Avoid php-opcache corruption in WMF production
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	May 26 2020, 6:21 PM

Description

NOTE: Please send reports of corruptions to T245183, not here.

Background

The main tracking task for the php-opcache corruption we saw in production was T224491. This was closed in September after various workarounds were put in place that reduced the chances of corruption happening.

In a nut shell:

In order for PHP to not perform unacceptably slow, it is required that we have a compilation cache. This is enabled by default and is called opcache. Similar systems existed in HHVM and PHP5 as well. (It is not new). This system translates the .php files from disk into a parsed and optimised machine-readable format, stored in RAM.
Our current deployment model is based on changing files directly on disk (Scap, and rsync), and does not involve containers, new servers, or servers being depooled/re-pooled. Instead, they remain live serving.
Updates to files are picked up by PHP by comparing the mtime of files on disk if more than a configurable number of seconds have past since the last time it checked.
When it finds such an update, it recompiles the file on-the-fly and adds it to memory. It does not remove or replace the existing entry, as another on-going request might still be using that.
In addition to not removing or replacing on-demand, there is also no background process to or other garbage collection concept in place. Instead, it grows indefinitely until it runs out of memory, at which point it is forced to reset the opcache and start from scratch. When this happens, php7-opcache shits itself and causes unrecoverable corruption to the interpreted source code. (Unrecoverable, meaning, it is not temporary or self-correcting, any corruption that occurs tends to be sticky until a human restarts the server.)

What we did to close T224491:

A cronjon is active on all MW app servers that checks every few minutes if opcache is close to running out of memory. If it is, we'll try to prevent it corrupting itself by voluntarily depooling the server automatically, then doing a restart cleanly in a way that has no live traffic and thus presumably no way to trigger the race condition that causes the corruption, and then repool it. This cronjob has Icinga alerting on it. And it is spread out so that we don't restart "too many" servers at once.
The Scap deployment tool also enacts the same script as the cronjob to perform this restart around deployments, so that if we know we're close to running out of memory we won't wait for traffic to increase memory for the new source code, but rather catch it proactively.

Status quo

We still see corruptions from time to time. New ones are now tracked at T245183.

We are kind of stuck because any kind of major deployment or other significant temporary or indefinite utilisation of opcache (e.g. T99740) should involve a php-fpm restart to be safe, but we can't easily do a rolling restart because:

Live traffic has to go somewhere, so we can't restart all at once.
If we don't restart all at once, that means we have to do a slow rolling one.
Which means, deployments take 15 minutes or no longer. This would be a huge increase compared to the 1-2 minutes it takes today.

Ideas

Do a restart for all deploys. Take the hit on deploy time and/or focus on ways to reduce it.
- Method: Memory is controlled by not building up stale copies of old code.
- Benefit: Easy to add..
- Downside: We keep all the cronjob and scap complexity we have today.
Spawn fresh php-fpm instances for each deploy.
- Method: Some kind of socket transfer for live traffic. Automatic opcache updates would be disabled, thus memory can't grow indefinitely.
- Benefit: Fast deployment. Relatively clean and easy to reason about.
- Benefit: We get to remove the complexity we have today.
- Benefit: We get to prepare and re-use what we learn here for the direction of "MW on containers".
- Downside: Non-trivial to build.
- Downside: More disruption to apcu lifetime. We may need to do one of T244340 or T248005 first in that case.
Get the php-opcache bug(s) fixed upstream.
- Method: Contractor?
- Benefit: Fast deployment. Relatively clean and easy to reason about.
- Benefit: We get to remove the complexity we have today.
- Downside: Unsure if php-opcache is beyond fixing.

Details

Subject	Repo	Branch	Lines +/-
mediawiki::php bump opcache.max_accelerated_files	operations/puppet	production	+1 -1
mediawiki: reduce the number of cached keys that trigger a restart	operations/puppet	production	+1 -1
mediawiki: Check number of cached keys in php-check-and-restart.sh	operations/puppet	production	+15 -0
php::admin: export additional opcache metrics	operations/puppet	production	+13 -0
hiera: disable php-fpm restarts on mwdebug	operations/puppet	production	+1 -1
mediawiki::php::restarts: Allow disabling of php-fpm restarts	operations/puppet	production	+18 -7
hiera: disable php-fpm restarts on mwdebug	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Krinkle	T212460 Adopt static array files for local disk storage of values (epic)
Open	None	T99740 Use static php array files for l10n cache at WMF (instead of CDB)
Resolved	Krinkle	T245183 PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.)
Resolved	Krinkle	T253673 Avoid php-opcache corruption in WMF production
Resolved	jijiki	T261009 Reproduce opcache corruptions in production
Resolved	Joe	T266055 Update Scap to perform rolling restart for all MW deploy
Resolved	dancy	T243009 Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers
Resolved	thcipriani	T264362 Scap feature: restart php-fpm on deployment
Resolved	dancy	T290038 scap sync-file --force warns "sudo: no tty present and no askpass program specified"
Resolved	dancy	T237033 Scap can't clear opcache on mw servers in Beta Cluster
Resolved	Krinkle	T311788 MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Do a restart for all deploys. Take the hit on deploy time and/or focus on ways to reduce it.

The current estimate for the Scap rolling restart is 15 minutes. Is there low hanging fruit for reducing this?

I believe we currently do this in batches of N servers at once, where the next batch starts after the previous is fully finished. Could this be optimised by letting the batches overlap? E.g. rather than chunks of N, we'd have at most N undergoing a restart at once. More "rolling". I heard some ideas on IRC also involving PoolCounter, but a local variable on the deployment server could perhaps work as well.

We currently don't have coordination from Scap with the cronjobs (which could interfere), so PoolCounter could be used to ensure coordination with thatE.g. we'd have at most N servers in a DC undergoing restart, the server would take care of it locally, and Scap just invokes the script on all servers and each one waits as needed until it's done. Another way could be to communicate with PyBall instead and base it on ensuring a minimum number of pooled servers (as opposed to ensuring a max of depooled servers). I suppose there can be race conditions there though, so maybe PoolCounter is the better way.

Spawn fresh php-fpm instances for each deploy.

What would it take do this?

Get the php-opcache bug(s) fixed upstream.

[…]
Downside: Unsure if php-opcache is beyond fixing.

TODO: Ref upstream tickets and determine if there is any hope.

Krinkle added a subtask: T243009: Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers.May 26 2020, 6:29 PM

Krinkle added a parent task: T99740: Use static php array files for l10n cache at WMF (instead of CDB).

In T253673#6166500, @Krinkle wrote:

Do a restart for all deploys. Take the hit on deploy time and/or focus on ways to reduce it.

I would like to do this -- it seems like it would solve a long-tail of issues that we've whack-a-mole'd -- but I have reservations about it.

The current estimate for the Scap rolling restart is 15 minutes. Is there low hanging fruit for reducing this?

My major worry isn't that deploys themselves take 15 minutes -- it's that rollbacks would also take 15 minutes. Waiting 15 minutes for a bad deploy to propagate and then waiting 15 minutes for it to be fixed is problematic (cf: T244544)

The current paradigm of deployments involves rolling forward to individual groups of wikis as a way to de-risk a given deployment. We don't have another way to gain that confidence currently. Without that confidence we need a fast rollback mechanism. Maybe making rollbacks faster is easier to solve than making every deployment fast (?) -- not that I know how to do that exactly :)

I believe we currently do this in batches of N servers at once, where the next batch starts after the previous is fully finished. Could this be optimised by letting the batches overlap? E.g. rather than chunks of N, we'd have at most N undergoing a restart at once. More "rolling". I heard some ideas on IRC also involving PoolCounter, but a local variable on the deployment server could perhaps work as well.

The way this is currently implemented is all the groups of servers (jobrunner, appserver, appserver_api, testserver) are restarted in parallel using a pool of workers. Not more than 10% of a given group is restarted/depooled at the same time. @Joe could probably speak to the actual depooling piece.

A contractor is going to ask for the same thing that upstream is asking for: a reproduction case. There's not much to go on without one.

Have you tried depooling an appserver, disabling the workarounds on that server, turning on opcache.protect_memory=1, setting low limits for opcache, and hitting the server with synthetic traffic (replay GET requests from production logs using ab)?

CDanis subscribed.May 27 2020, 11:09 PM

Krinkle updated the task description. (Show Details)Jun 2 2020, 2:42 PM

Krinkle moved this task from Limbo to Perf recommendation on the Performance-Team (Radar) board.Jun 2 2020, 2:45 PM

Krinkle added a parent task: T245183: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.).Jun 10 2020, 6:03 PM

Krinkle mentioned this in T254210: ParameterAssertionException "Bad value for parameter $row->rev_timestamp" from RevisionStoreRecord.php.Jun 11 2020, 1:49 AM

Addshore subscribed.Jun 12 2020, 6:38 PM

darthmon_wmde subscribed.Jun 15 2020, 9:24 AM

Addshore mentioned this in T255282: mw1384 is misbehaving.Jun 15 2020, 9:25 AM

Krinkle mentioned this in T255699: LoadBalancer.php: PHP Warning: Invalid argument supplied for foreach().Jun 17 2020, 6:51 PM

Krinkle triaged this task as High priority.Jun 22 2020, 7:10 PM

Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).

Krinkle moved this task from Inbox, needs triage to Blocked (old) on the Performance-Team board.

Ladsgroup mentioned this in T256305: Fatal Error: Class MediaWiki\HookContainer\HookRunner contains 1 abstract method and must therefore be declared abstract.Jun 24 2020, 8:10 PM

hashar merged a task: T256305: Fatal Error: Class MediaWiki\HookContainer\HookRunner contains 1 abstract method and must therefore be declared abstract.Jun 25 2020, 8:28 AM

hashar added subscribers: brennen, Agusbou2015, tstarling and 3 others.

Platonides subscribed.Jun 26 2020, 12:06 AM

Krinkle mentioned this in T239724: Fatal error: "Object does not support method calls" (from MemcachedPeclBagOStuff).Jul 1 2020, 12:34 AM

jijiki subscribed.Jul 1 2020, 2:31 PM

Today at 2020-07-08T14:00:12, with no deployments happening, mw1346 started generating exceptions at high rate:

/srv/mediawiki/php-1.35.0-wmf.39/includes/config/GlobalVarConfig.php:53 GlobalVarConfig::get: undefined option: 'MinervaOverflowInPageActhons'

#0 /srv/mediawiki/php-1.35.0-wmf.39/skins/MinervaNeue/includes/MinervaHooks.php(131): GlobalVarConfig->get(string)
#1 /srv/mediawiki/php-1.35.0-wmf.39/includes/HookContainer/HookContainer.php(320): MinervaHooks::onMobileFrontendFeaturesRegistration(MobileFrontend\Features\FeaturesManager)
#2 /srv/mediawiki/php-1.35.0-wmf.39/includes/HookContainer/HookContainer.php(131): MediaWiki\HookContainer\HookContainer->callLegacyHook(string, array, array, array)
#3 /srv/mediawiki/php-1.35.0-wmf.39/extensions/MobileFrontend/includes/Features/FeaturesManager.php(43): MediaWiki\HookContainer\HookContainer->run(string, array)
#4 /srv/mediawiki/php-1.35.0-wmf.39/extensions/MobileFrontend/includes/ServiceWiring.php(55): MobileFrontend\Features\FeaturesManager->useHookToRegisterExtensionOrSkinFeatures()
#5 /srv/mediawiki/php-1.35.0-wmf.39/includes/libs/services/ServiceContainer.php(451): Wikimedia\Services\ServiceContainer->{closure}(MediaWiki\MediaWikiServices)
#6 /srv/mediawiki/php-1.35.0-wmf.39/includes/libs/services/ServiceContainer.php(419): Wikimedia\Services\ServiceContainer->createService(string)
#7 /srv/mediawiki/php-1.35.0-wmf.39/extensions/MobileFrontend/includes/MobileFrontendHooks.php(1115): Wikimedia\Services\ServiceContainer->getService(string)
#8 /srv/mediawiki/php-1.35.0-wmf.39/includes/HookContainer/HookContainer.php(320): MobileFrontendHooks::onMakeGlobalVariablesScript(array, OutputPage)
#9 /srv/mediawiki/php-1.35.0-wmf.39/includes/HookContainer/HookContainer.php(131): MediaWiki\HookContainer\HookContainer->callLegacyHook(string, array, array, array)
#10 /srv/mediawiki/php-1.35.0-wmf.39/includes/HookContainer/HookRunner.php(2516): MediaWiki\HookContainer\HookContainer->run(string, array)
#11 /srv/mediawiki/php-1.35.0-wmf.39/includes/OutputPage.php(3386): MediaWiki\HookContainer\HookRunner->onMakeGlobalVariablesScript(array, OutputPage)
#12 /srv/mediawiki/php-1.35.0-wmf.39/includes/OutputPage.php(3035): OutputPage->getJSVars()
#13 /srv/mediawiki/php-1.35.0-wmf.39/includes/OutputPage.php(3056): OutputPage->getRlClient()
#14 /srv/mediawiki/php-1.35.0-wmf.39/includes/skins/SkinMustache.php(82): OutputPage->headElement(SkinApi)
#15 /srv/mediawiki/php-1.35.0-wmf.39/includes/skins/SkinMustache.php(57): SkinMustache->getTemplateData()
#16 /srv/mediawiki/php-1.35.0-wmf.39/includes/skins/SkinTemplate.php(141): SkinMustache->generateHTML()
#17 /srv/mediawiki/php-1.35.0-wmf.39/includes/OutputPage.php(2616): SkinTemplate->outputPage()
#18 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiFormatBase.php(333): OutputPage->output()
#19 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiFormatRaw.php(82): ApiFormatBase->closePrinter()
#20 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiMain.php(1834): ApiFormatRaw->closePrinter()
#21 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiMain.php(608): ApiMain->printResult(integer)
#22 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiMain.php(532): ApiMain->handleException(ConfigException)
#23 /srv/mediawiki/php-1.35.0-wmf.39/includes/api/ApiMain.php(496): ApiMain->executeActionWithErrorHandling()
#24 /srv/mediawiki/php-1.35.0-wmf.39/api.php(89): ApiMain->execute()
#25 /srv/mediawiki/php-1.35.0-wmf.39/api.php(44): wfApiMain()
#26 /srv/mediawiki/w/api.php(3): require(string)
#27 {main}

Note the spelling of MinervaOverflowInPageActhons.

Pybal quickly depooled the host, but probably monitoring kept querying the server and generating the exceptions.

@Joe did php7adm /opcache-free on mw1346 and the exceptions cleared.

Interestingly, according to the opcache metadata, the file where the error was (/srv/mediawiki/php-1.35.0-wmf.39/skins/MinervaNeue/includes/MinervaHooks.php) was in opcache since 2 weeks, which isn't great because it means nothing really caused this issue:

no deploy
no opcache invalidation by chance
no php restarts

Krinkle mentioned this in T245183: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.).Jul 8 2020, 4:17 PM

FTR i-->h is a single bit-flip in the LSB.

In T253673#6290704, @CDanis wrote:

FTR i-->h is a single bit-flip in the LSB.

Sorry, I was off by one; it's actually a transposition. So seems much less likely to be a random flip.

I think it's also interesting to compare this failure to T221347: there it was L -> K. It's a -1 in both cases. Unsure what this could mean...

Krinkle added a project: Sustainability (Incident Followup).Jul 8 2020, 7:23 PM

In order to move this a little bit forward, we can try to reproduce and have a go at @ori 's suggestion. If we don't get anywhere, we will revisit the pros and cons of restarting after every deploy (option 1 in description), what we can do to optimise it, and possibly proceed with it. I hope to make some time for this in August.

Krinkle updated the task description. (Show Details)Jul 16 2020, 9:20 PM

Chatted a bit about this with @jijiki. If the data corruption was triggered by code deployments, maybe the way to reproduce this bug is to simulate the effect of many code deployments by spamming opcache with auto-generated, randomized PHP code.

The generated code would consist of a class with a $data property, a $hash property, and a method that verifies that md5($this->data) == $this->hash. The name of the class and the method and the literal values of $data and $hash would be randomly generated by the codegen script. (Randomizing both the value of the data attribute and the class and method names should help ensure we stress both the code cache and the interned strings buffer.)

The test harness would generate the code, copy the generated PHP code to the server's document root, curl it multiple times in parallel, and repeat. It should be possible to run many iterations of this test on a depooled app server very quickly, with no risk of corrupting production data or triggering MediaWiki bugs.

One reason this might not work is if the bug is not truly internal to opcache -- e.g., if there's some specific PHP extension that does something opcache doesn't expect, etc. If that's the case, a different approach would be needed.

Thoughts?

Script to generate randomized, self-validating code:

codegen.php1 KBDownload

And on the subject of useful php.ini debug settings:

opcache.protect_memory

If the bug is caused by a PHP extension mutating data that opcache expects to be immutable, opcache.protect_memory=1 should help by causing a crash with a stack trace at the point of mutation. PHP bug #73933 is an example of a bug like that.

Turning on opcache.protect_memory for the stress test I proposed above won't be useful, because the randomized code for the stress test doesn't exercise any PHP extensions. But the setting could be useful for the tests that replay requests with the full MediaWiki codebase.

opcache.consistency_checks

Came across this one today:

opcache.consistency_checks integer
If non-zero, OPcache will verify the cache checksum every N requests, where N is the value of this configuration directive. This should only be enabled when debugging, as it will impair performance.

Here's the code that actually performs the check:
https://github.com/php/php-src/blob/517c9938af/ext/opcache/ZendAccelerator.c#L2119-L2142

I'm not totally sure what it does, but it looks like it includes an Adler32 checksum with each compiled script cache entry, and verifies it on load. If there's a checksum mismatch it logs an INFO-level message and restarts opcache.

I wonder if the impact on performance would really be so bad if this is turned on in production with a value of, say, 1000.

taavi subscribed.Aug 17 2020, 7:30 PM

In T253673#6386605, @ori wrote:

The test harness would generate the code, copy the generated PHP code to the server's document root, curl it multiple times in parallel, and repeat. It should be possible to run many iterations of this test on a depooled app server very quickly, with no risk of corrupting production data or triggering MediaWiki bugs.

One reason this might not work is if the bug is not truly internal to opcache -- e.g., if there's some specific PHP extension that does something opcache doesn't expect, etc. If that's the case, a different approach would be needed.

Thoughts?

I am wondering if this test will increase the wasted memory, and trigger an opcache restart with an empty cache. I think we have set this to 10%. We do suspect that maybe code deployments per se might not be teh issue, one file mentioned in this thread was cached weeks before its corruption. Nevertheless, it is worth a shot and testing is cheap, we may test it.

In T253673#6386921, @ori wrote:

And on the subject of useful php.ini debug settings:

opcache.protect_memory

If the bug is caused by a PHP extension mutating data that opcache expects to be immutable, opcache.protect_memory=1 should help by causing a crash with a stack trace at the point of mutation. PHP bug #73933 is an example of a bug like that.

+1 I will try that

opcache.consistency_checks

Came across this one today:

opcache.consistency_checks integer
If non-zero, OPcache will verify the cache checksum every N requests, where N is the value of this configuration directive. This should only be enabled when debugging, as it will impair performance.

Here's the code that actually performs the check:
https://github.com/php/php-src/blob/517c9938af/ext/opcache/ZendAccelerator.c#L2119-L2142

I'm not totally sure what it does, but it looks like it includes an Adler32 checksum with each compiled script cache entry, and verifies it on load. If there's a checksum mismatch it logs an INFO-level message and restarts opcache.

I wonder if the impact on performance would really be so bad if this is turned on in production with a value of, say, 1000.

I didn't know about this, sounds promising!

Summing up, after we finish up with a minor maintenance we have been doing on our clusters, we can do some testing hoping to reproduce the corruption. With @Krinkle we have a list of webrequests we can start with and run it against

an app server, disable the systemd timer and run an ab test from mwdebug*.
an api server, with opcache.protect_memory=1 and wait for a segfault.

Morever we can,

test performance of opcache.consistency_checks, see if it makes sense to have it enabled on some servers or include it in the above ones
run codegen and see if it adds something here or confuses us more

jijiki moved this task from Incoming 🐫 to 🔦Unused2 on the serviceops board.Aug 17 2020, 11:45 PM

jijiki mentioned this in T261009: Reproduce opcache corruptions in production .Aug 21 2020, 5:42 PM

• Mholloway subscribed.Aug 24 2020, 11:38 PM

Change 622761 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mediawiki::php::restarts: Allow disabling of php-fpm restarts

https://gerrit.wikimedia.org/r/622761

gerritbot added a project: Patch-For-Review.Aug 27 2020, 10:19 AM

Change 622762 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: disable php-fpm restarts on mwdebug

https://gerrit.wikimedia.org/r/622762

Change 622762 abandoned by Effie Mouzeli:
[operations/puppet@production] hiera: disable php-fpm restarts on mwdebug

Reason:
conflict

https://gerrit.wikimedia.org/r/622762

Change 622765 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: disable php-fpm restarts on mwdebug

https://gerrit.wikimedia.org/r/622765

Change 622761 merged by Effie Mouzeli:
[operations/puppet@production] mediawiki::php::restarts: Allow disabling of php-fpm restarts

https://gerrit.wikimedia.org/r/622761

Change 622765 merged by Effie Mouzeli:
[operations/puppet@production] hiera: disable php-fpm restarts on mwdebug

https://gerrit.wikimedia.org/r/622765

jijiki added a subtask: T261009: Reproduce opcache corruptions in production .Sep 7 2020, 9:05 AM

jijiki added a project: User-jijiki.Sep 8 2020, 10:13 AM

jijiki moved this task from Incoming🐅 to In Progress 🏋️‍♀️ on the User-jijiki board.Sep 8 2020, 10:17 AM

jijiki moved this task from In Progress 🏋️‍♀️ to Radar 📻 on the User-jijiki board.Sep 8 2020, 11:19 AM

Change 625224 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] php::admin: export additional opcache metrics

https://gerrit.wikimedia.org/r/625224

Change 625224 merged by Effie Mouzeli:
[operations/puppet@production] php::admin: export additional opcache metrics

https://gerrit.wikimedia.org/r/625224

jijiki changed the status of subtask T261009: Reproduce opcache corruptions in production from Open to Stalled.Sep 16 2020, 6:17 PM

In T253673#6166689, @thcipriani wrote:

In T253673#6166500, @Krinkle wrote:

Do a restart for all deploys. Take the hit on deploy time and/or focus on ways to reduce it.

I would like to do this -- it seems like it would solve a long-tail of issues that we've whack-a-mole'd -- but I have reservations about it.

The current estimate for the Scap rolling restart is 15 minutes. Is there low hanging fruit for reducing this?

My major worry isn't that deploys themselves take 15 minutes -- it's that rollbacks would also take 15 minutes. Waiting 15 minutes for a bad deploy to propagate and then waiting 15 minutes for it to be fixed is problematic (cf: T244544)

@Joe said today that a rolling restart takes 5-10 minutes, not 15 minutes.

RE: Having an option to take a shortcut, I don't oppose it existing per se, but I don't think skipping the restarts as proposed in T244544/T243009 would be effective because solving this task requires that we disable the dangerous revalidation option in opcache. Thus not restarting equates to effectively not having deployed any code.

However if we want to have a shortcut that doesn't do things in a rolling way, but immediately sends an all-out restart, that's something for @Joe and team to balance and decide how much we could e.g.. take short cuts in a disaster scenario where e.g. if most requests are http 5xx anyway, and if not having appserver capacity is handled gracefully higherr in the traffic stack, then maybe an option for a bigger more risky batch could make sense.

But, I think for the short term, we should proceed without this and just take 5-10 min as our new safe default and then work on reducing it.

Is anything else blocking this? Could we try it for a few deploys and test drive?

BTW, I produced a short writeup aimed at deployers and others close to production: https://wikitech.wikimedia.org/wiki/User:CDanis/Diagnosing_opcache_corruption

Krinkle mentioned this in T243009: Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers.Oct 5 2020, 8:17 PM

• mmodell subscribed.Oct 5 2020, 8:26 PM

kostajh subscribed.Oct 5 2020, 8:36 PM

• mmodell awarded a token.Oct 13 2020, 3:59 PM

Krinkle mentioned this in T266055: Update Scap to perform rolling restart for all MW deploy.Oct 20 2020, 5:40 PM

Krinkle added a subtask: T266055: Update Scap to perform rolling restart for all MW deploy.

My idea for detection/prevention of opcache corruption is to use a memory protection key to do essentially what opcache.protect_memory=1 does, but fast enough for it to be always enabled in production.

My theory of opcache corruption is that the large number of pointers into shared memory during a request provides many opportunities for accidental writes, due to dangling pointers or other programmer errors. It's not feasible to do mprotect() on the shared memory every time shared memory is written to, because mprotect() needs to write to every page table entry which makes it O(N) in the size of the segment. Every request writes to shared memory, because shared memory contains locks which are incremented and decremented during read operations.

My idea is to tag shared memory with a pkey. Then when entering or exiting a section of the code that writes to shared memory, only a single instruction (WRPKRU) needs to be executed to change the permissions on all of shared memory.

The goal is to convert shared memory corruption into segfaults, which are less damaging in production. Segfaults can produce core dumps, potentially giving a lead as to the root cause of the memory corruption.

Tgr mentioned this in T266052: Interface 'MediaWiki\EditPafe\IEditObject' not found.Oct 21 2020, 6:49 AM

Yesterday we had opcache corruptions on 2 servers, mw2328 && mw2252. I don't know about other times, but for those specific 2 corruptions, I can say that they happened right after opcache restarted because, on these servers it reached its max cached keys:

mw2328:
    "start_time": 1600177590, -> Tuesday, 15 September 2020 
    "last_restart_time": 1603211850, -> Tuesday, 20 October 2020 16:37:30
    "oom_restarts": 0,
    "hash_restarts": 2,

mw2252:
    "start_time": 1600174055, -> Tuesday, 15 September 2020 12:47:35
    "last_restart_time": 1603217533, -> Tuesday, 20 October 2020 18:12:13
    "oom_restarts": 0,
    "hash_restarts": 2,

Yesterday we had some more servers that had their opcache restarted, for the same reasons, looking for servers opcache_statistics.hash_restarts is 2:

(48) mw[2218-2220,2222-2223,2252-2253,2262,2283,2285-2289,2291-2300,2304,2306,2308,2317,2320-2324,2328,2332,2334,2350,2352,2358,2360,2362,2364,2366-2368,2370,2372,2374].codfw.wmnet
----- OUTPUT of 'php7adm  /opcach...cs.hash_restarts' -----
2

Looking at when those servers had their opcache restarted, it was yesterday, and most of them (if not all, I have not checked yet) are api servers:

mw2218.codfw.wmnet: 1603215616
mw2219.codfw.wmnet: 1603212256
mw2220.codfw.wmnet: 1603215818
mw2222.codfw.wmnet: 1603222186
mw2223.codfw.wmnet: 1603218900
mw2252.codfw.wmnet: 1603217533
mw2253.codfw.wmnet: 1603212502
mw2262.codfw.wmnet: 1603224353
mw2283.codfw.wmnet: 1603219582
mw2285.codfw.wmnet: 1603215620
mw2286.codfw.wmnet: 1603222040
mw2287.codfw.wmnet: 1603211649
mw2288.codfw.wmnet: 1603215617
mw2289.codfw.wmnet: 1603217488
mw2291.codfw.wmnet: 1603213619
mw2292.codfw.wmnet: 1603211083
mw2293.codfw.wmnet: 1603222034
mw2294.codfw.wmnet: 1603220461
mw2295.codfw.wmnet: 1603215815
mw2296.codfw.wmnet: 1603213118
mw2297.codfw.wmnet: 1603221654
mw2298.codfw.wmnet: 1603219884
mw2299.codfw.wmnet: 1603221725
mw2300.codfw.wmnet: 1603215618
mw2304.codfw.wmnet: 1603221327
mw2306.codfw.wmnet: 1603221212
mw2308.codfw.wmnet: 1603214999
mw2317.codfw.wmnet: 1603216066
mw2320.codfw.wmnet: 1603214233
mw2321.codfw.wmnet: 1603216670
mw2322.codfw.wmnet: 1603220571
mw2323.codfw.wmnet: 1603220597
mw2324.codfw.wmnet: 1603221161
mw2328.codfw.wmnet: 1603211850
mw2332.codfw.wmnet: 1603221654
mw2334.codfw.wmnet: 1603223000
mw2350.codfw.wmnet: 1603221653
mw2352.codfw.wmnet: 1603224062
mw2358.codfw.wmnet: 1603217075
mw2360.codfw.wmnet: 1603220878
mw2362.codfw.wmnet: 1603217662
mw2364.codfw.wmnet: 1603215611
mw2366.codfw.wmnet: 1603213081
mw2367.codfw.wmnet: 1603200046
mw2368.codfw.wmnet: 1603220098
mw2370.codfw.wmnet: 1603221962
mw2372.codfw.wmnet: 1603216361
mw2374.codfw.wmnet: 1603213716

That being said, we can enhance the cronjob script we have to check this metric as well, and trigger a php-fpm restart. So it will be restarted when free opcache is below 200mb or if cached keys are over 32k.

In T253673#6566529, @tstarling wrote:

My idea is to tag shared memory with a pkey. Then when entering or exiting a section of the code that writes to shared memory, only a single instruction (WRPKRU) needs to be executed to change the permissions on all of shared memory.

The goal is to convert shared memory corruption into segfaults, which are less damaging in production. Segfaults can produce core dumps, potentially giving a lead as to the root cause of the memory corruption.

How difficult would that be to implement? It sounds relatively straight-forward but I'm not very familiar with php internals. Segfault would definitely be a huge improvement over the random behavior we've been seeing.

Change 635854 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mediawiki: Check number of cached keys in php-check-and-restart.sh

https://gerrit.wikimedia.org/r/635854

Change 635854 merged by Effie Mouzeli:
[operations/puppet@production] mediawiki: Check number of cached keys in php-check-and-restart.sh

https://gerrit.wikimedia.org/r/635854

Change 636047 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mediawiki::php bump opcache.max_accelerated_files

https://gerrit.wikimedia.org/r/636047

• dpifke subscribed.Jan 19 2021, 7:17 PM

Change 657398 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mediawiki: reduce the number of cached keys that trigger a restart

https://gerrit.wikimedia.org/r/657398

Change 657398 merged by Effie Mouzeli:
[operations/puppet@production] mediawiki: reduce the number of cached keys that trigger a restart

https://gerrit.wikimedia.org/r/657398

Change 636047 abandoned by Effie Mouzeli:
[operations/puppet@production] mediawiki::php bump opcache.max_accelerated_files

Reason:
maybe another time

https://gerrit.wikimedia.org/r/636047

Krinkle closed subtask T243009: Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers as Resolved.Jan 26 2021, 8:57 PM

Krinkle closed subtask T264362: Scap feature: restart php-fpm on deployment as Resolved.Jan 26 2021, 9:03 PM

Krinkle mentioned this in T240775: RFC: Support PHP 7.4 preload.Feb 10 2021, 4:03 AM

Krinkle moved this task from Blocked (old) to Radar on the Performance-Team board.Mar 1 2021, 8:22 PM

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.

Krinkle mentioned this in T274041: Reduce performance impact of HookRunner.php loading 500+ interfaces.Mar 25 2021, 10:43 PM

Quartely update for clarity: This is currently blocked on sub task T266055.

Krenair subscribed.Apr 10 2021, 10:19 PM

jijiki closed subtask T261009: Reproduce opcache corruptions in production as Resolved.Apr 24 2021, 5:38 AM

Krinkle mentioned this in T278382: Clean up CirrusSearch job retries.Apr 30 2021, 12:44 AM

Krinkle updated the task description. (Show Details)Apr 21 2022, 9:07 PM

Joe closed subtask T266055: Update Scap to perform rolling restart for all MW deploy as Resolved.Jul 28 2022, 6:11 AM

In T245183#8042310, @Krinkle wrote:

Any remaining "smells like opcache" problems we see can't be the cause of php-opcache revalidation mode since that mode is now disabled on production web servers as per T266055.

One notable remaining issue in particular is: T254209: Spike of impossible "Cannot declare class" fatal errors (opcache)

Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).Jul 31 2022, 2:49 AM

Krinkle moved this task from Inbox, needs triage to Doing: Goals on the Performance-Team board.

Krinkle mentioned this in T314240: Enable rolling restart for all MW servers (tracking).Jul 31 2022, 6:28 AM

	F32410334: image.png
	Oct 21 2020, 4:43 PM

	F32410331: image.png
	Oct 21 2020, 4:43 PM

	F32166877: codegen.php
	Aug 15 2020, 5:52 PM

Avoid php-opcache corruption in WMF productionClosed, ResolvedPublicActions

Description

Background

Status quo

Ideas

Details

Related ObjectsSearch...

Event Timeline

opcache.protect_memory

opcache.consistency_checks

opcache.protect_memory

opcache.consistency_checks

Avoid php-opcache corruption in WMF production
Closed, ResolvedPublic
Actions

Related Objects
Search...