wikireplicas: Define MW sections per host
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Marostegui
	Oct 9 2020, 12:50 PM

Description

After being unable to find anywhere if we decided on which sections will go per host, I am creating a task to get a proposal.
On the planning do we have this:

Ideal is 8 slices * 2 types (web + analytics) * 2 instances each == 32 database instances. Deployed as multi-instance with 4 instances per physical node == 8 nodes

Each host will have 512GB RAM, let's use 80% for the buffer pool, so that's 410
Usable disk space for MySQL will be 8.7TB

Host 1:
s1 (enwiki): 150
s3: 70
s4 (commons): 140
s5: 50

Total disk space needed (with InnoDB compression enabled): 5TB

Host 2:

s2: 60
s6: 50
s7 (some big wikis like arwiki, eswiki, metawiki or viwiki or : 100
s8 (wikidatawiki): 200

Total disk space needed (with InnoDB compression enabled): 4.2TB

@Bstorm it is not yet clear to me whether you guys want to have full redundancy between hosts (per service), as in, let's say we are talking about the Analytics service, we can have 2 models within the same service:

Model a)
host1 and host3 having identical data and serving s1, s3, s4 and s5
host2 and host4 having identical data and serving s2, s6, s7 and s8

Or whether you'd like to have more computational power for reads and have for example:

Model b)
Host1 serving s1
Host 2 serving s8
Host 3 serving s4 and s7
Host 4 serving: s2, s3, s5 and s6

Model a give us more redundancy as we can lose up to two hosts per array (like a RAID 10!) but less computational power for reads as we share more sections and hence each section has less buffer pool available
Mode b give us more power for reads as big wikis like s1 (enwiki) s4 (commons) and s8 (wikidatawiki) have dedicated resources, but if we lose a host, we lose some sections.

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		Marostegui	T233766 labsdb1011 mariadb crashed
			Restricted Task
			Restricted Task
Open		None	T204950 Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users
Open		None	T215858 Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema
Resolved		fnegri	T280152 Mitigate breaking changes from the new Wiki Replicas architecture
			Unknown Object (Task)
Resolved		RobH	T260441 (Need By: ASAP) rack/setup/install clouddb10[13-20]
Resolved		• Bstorm	T260389 Redesign and rebuild the wikireplicas service using a multi-instance architecture
Resolved		• Bstorm	T260843 Set up roles for new wiki replicas layout
Resolved		Marostegui	T265135 wikireplicas: Define MW sections per host
Resolved		Marostegui	T267090 Productionize clouddb10[13-20]
Resolved		Marostegui	T268312 Deploy labsdbuser and views to new clouddb hosts
Resolved		• Bstorm	T269200 Create end-user accounts on the new clouddb hosts
Resolved		• Bstorm	T269620 maintain-dbusers doesn't close connections right on harvest-replicas
Resolved		MoritzMuehlenhoff	T268725 Include mail on standard_packages.pp
Resolved		Marostegui	T268742 Test upgrading sanitarium hosts to Buster + 10.4
Resolved		Marostegui	T272008 Move wikireplicas under the new sanitarium hosts (db1154, db1155)
Resolved		• Cmjohnson	T272125 Memory errors on clouddb1019
Resolved		dcaro	T272127 2021-01-15: PROBLEM alert - labstore1004/Ensure mysql credential creation for tools users is running is CRITICAL
Resolved		Marostegui	T280492 Upgrade all sanitarium masters to 10.4 and Buster
Resolved	Request	wiki_willy	T281794 decommission db1082.eqiad.wmnet
Resolved	Request	• Cmjohnson	T281959 decommission db1074.eqiad.wmnet
Resolved	Request	• Cmjohnson	T282079 decommission db1079.eqiad.wmnet
Resolved	Request	• Cmjohnson	T282093 decommission db1087.eqiad.wmnet
Resolved	Request	• Cmjohnson	T282096 decommission db1085.eqiad.wmnet

Event Timeline

Marostegui created this task.Oct 9 2020, 12:50 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 9 2020, 12:50 PM

Marostegui mentioned this in T260843: Set up roles for new wiki replicas layout.Oct 13 2020, 5:25 AM

• Bstorm moved this task from Inbox to Needs discussion on the cloud-services-team (Kanban) board.Oct 13 2020, 3:31 PM

After discussing this as a group, we came to the conclusion that more redundancy would serve everyone better. However, do you have any statistics on what the performance differences would be between the options? Especially in contrast to our existing setup, presumably both would be more performant, yes?

It is pretty hard to know the performance differences, as right now we just have a big pool and depending on the queries the pool changes. I would assume that the most hit wikis are enwiki, commons and wikidata, so I would assume most of the pool is used for those.
Though, if there are bots or users scrapping other wikis, the pool would get "dirty" for everyone, even if that specific wiki is only used once.
I am fine with both models, it is really up to WMCS to decide :-)

If we go for option a, does that proposal of sections sound good?

Seems fine, yes. We will learn more about how it affects performance after deployment. I hope it works well 😁

On the section split, WMCS agrees with your logic of the most hit wikis being enwiki, commons and wikidata. We won't be perfect in distributing load, but separating those makes sense.

So yes, option a, with the proposed sections sounds good.

In regards to performance, I understand it's hard to give numbers. We presume the performance will be at least equivalent to today, and likely better. For posterity however, how much effort would be required if we needed/wanted to change to option b during the rollout? I don't forsee this as being an issue, but I wanted to understand how much effort it would be if we needed to pivot for any reason.

We could do the switch from model a to model b but that means we'd need to repopulate the data across the hosts which means downtime for them, I would estimate around 5-7 days if all goes well.

aborrero moved this task from Needs discussion to Watching on the cloud-services-team (Kanban) board.Oct 21 2020, 3:44 PM

Closing this as the question has been answered.
We are going for model a:
I am going for this data structure:

clouddb1013 s1+s3
clouddb1014 s2+s7
clouddb1015 s4+s6
clouddb1016 s8+s5

clouddb1017 s1+s3
clouddb1018 s2+s7
clouddb1019 s4+s6
clouddb1020 s8+s5

@Bstorm @nskaggs can you sign this off? Looks good?

By splitting the data like this, we can have the major wikis s1, s4 and s8 split, and sharing the other resources with other "smaller" sections. If a host on any of the "pools" goes down, we still have the other mirrored one.

Marostegui mentioned this in T267090: Productionize clouddb10[13-20].Nov 3 2020, 6:36 AM