Background
The main tracking task for the php-opcache corruption we saw in production was T224491. This was closed in September after various workarounds were put in place that reduced the chances of corruption happening.
In a nut shell:
- In order for PHP to not perform unacceptably slow, it is required that we have a compilation cache. This is enabled by default and is called opcache. Similar systems existed in HHVM and PHP5 as well. (It is not new). This system translates the .php files from disk into a parsed and optimised machine-readable format, stored in RAM.
- Our current deployment model is based on changing files directly on disk (Scap, and rsync), and does not involve containers, new servers, or servers being depooled/re-pooled. Instead, they remain live serving.
- Updates to files are picked up by PHP by comparing the mtime of files on disk if more than a configurable number of seconds have past since the last time it checked.
- When it finds such an update, it recompiles the file on-the-fly and adds it to memory. It does not remove or replace the existing entry, as another on-going request might still be using that.
- In addition to not removing or replacing on-demand, there is also no background process to or other garbage collection concept in place. Instead, it grows indefinitely until it runs out of memory, at which point it is forced to reset the opcache and start from scratch. When this happens, php7-opcache shits itself and causes unrecoverable corruption to the interpreted source code. (Unrecoverable, meaning, it is not temporary or self-correcting, any corruption that occurs tends to be sticky until a human restarts the server.)
What we did to close T224491:
- A cronjon is active on all MW app servers that checks every few minutes if opcache is close to running out of memory. If it is, we'll try to prevent it corrupting itself by voluntarily depooling the server automatically, then doing a restart cleanly in a way that has no live traffic and thus presumably no way to trigger the race condition that causes the corruption, and then repool it. This cronjob has Icinga alerting on it. And it is spread out so that we don't restart "too many" servers at once.
- The Scap deployment tool also enacts the same script as the cronjob to perform this restart around deployments, so that if we know we're close to running out of memory we won't wait for traffic to increase memory for the new source code, but rather catch it proactively.
Status quo
We still see corruptions from time to time. New ones are now tracked at T245183.
We are kind of stuck because any kind of major deployment or other significant temporary or indefinite utilisation of opcache (e.g. T99740) should involve a php-fpm restart to be safe, but we can't easily do a rolling restart because:
- Live traffic has to go somewhere, so we can't restart all at once.
- If we don't restart all at once, that means we have to do a slow rolling one.
- Which means, deployments take 15 minutes or no longer. This would be a huge increase compared to the 1-2 minutes it takes today.
Ideas
- Do a restart for all deploys. Take the hit on deploy time and/or focus on ways to reduce it.
- Method: Memory is controlled by not building up stale copies of old code.
- Benefit: Easy to add..
- Downside: We keep all the cronjob and scap complexity we have today.
- Spawn fresh php-fpm instances for each deploy.
- Method: Some kind of socket transfer for live traffic. Automatic opcache updates would be disabled, thus memory can't grow indefinitely.
- Benefit: Fast deployment. Relatively clean and easy to reason about.
- Benefit: We get to remove the complexity we have today.
- Benefit: We get to prepare and re-use what we learn here for the direction of "MW on containers".
- Downside: Non-trivial to build.
- Downside: More disruption to apcu lifetime. We may need to do one of T244340 or T248005 first in that case.
- Get the php-opcache bug(s) fixed upstream.
- Method: Contractor?
- Benefit: Fast deployment. Relatively clean and easy to reason about.
- Benefit: We get to remove the complexity we have today.
- Downside: Unsure if php-opcache is beyond fixing.