User Details
- User Since
- Aug 26 2020, 8:28 PM (220 w, 2 d)
- Roles
- Disabled
- LDAP User
- Razzi
- MediaWiki User
- RAbuissa (WMF) [ Global Accounts ]
May 17 2022
VM created. Work continues at https://phabricator.wikimedia.org/T308597
I'm going to go ahead and put this on row A. Here's a little snippet I used to look at the ganeti resource totals by row (python -m pip install pandas ipython first):
May 16 2022
I downloaded the whole dashboard as json, edited the json to make the name have "TEST COPY", scp'd it to the superset host, and loaded it with:
Upgrading the VM worked for Turnilo, but Superset needs updating before it will work on Bullseye. Generally there's no guarantee that both staging services will be compatible with the same Debian version, so I say we split Turnilo staging onto its own server.
May 12 2022
I forgot there's a way to upgrade a virtual machine's operating system: https://wikitech.wikimedia.org/wiki/Ganeti#Reinstall_/_Reimage_a_VM and I'm following that now for the superset / turnilo staging instance (an-tool1005).
Thanks for the explanation @BBlack, nothing to do here so I'll close this.
May 11 2022
I merged the related patch, but when I restarted pybal it caused an alert, so I'm waiting for input from the traffic team before proceeding: https://phabricator.wikimedia.org/T308174
May 5 2022
I tried out Superset 1.5 briefly but found it requires python 3.8, and an-tool1005 is currently running python 3.7. The error:
May 4 2022
Here's the traceback of a 500 error I got:
[2022-05-04 17:25:52,664] ERROR in app: Exception on /api/query/stop [POST] Traceback (most recent call last): File "/srv/quarry/venv/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app response = self.full_dispatch_request() File "/srv/quarry/venv/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request rv = self.handle_user_exception(e) File "/srv/quarry/venv/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception reraise(exc_type, exc_value, tb) File "/srv/quarry/venv/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise raise value File "/srv/quarry/venv/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request rv = self.dispatch_request() File "/srv/quarry/venv/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "./quarry/web/api.py", line 152, in api_stop_query cur.execute("KILL %s;", (result_dictionary["connection_id"])) File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/cursors.py", line 148, in execute result = self._query(query) File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/cursors.py", line 310, in _query conn.query(q) File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/connections.py", line 548, in query self._affected_rows = self._read_query_result(unbuffered=unbuffered) File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/connections.py", line 775, in _read_query_result result.read() File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/connections.py", line 1156, in read first_packet = self.connection._read_packet() File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/connections.py", line 725, in _read_packet packet.raise_for_error() File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/protocol.py", line 221, in raise_for_error err.raise_mysql_exception(self._data) File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/err.py", line 143, in raise_mysql_exception raise errorclass(errno, errval) pymysql.err.OperationalError: (1094, 'Unknown thread id: 8295086')
May 3 2022
It's working! Visit https://superset-next.wikimedia.org/
Updated netbox status to "Active".
May 2 2022
I sent out an email that all the named hosts, other than an-airflow, will be rebooted this Friday May 6 in a window from 17-19UTC (10am-12pm pacific).
Apr 28 2022
Apr 27 2022
If we merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/779915/ that @BTullis and I came up with, we can do the following to reimage these hosts.
This is done. Since none of the views were changing but passing --replace-all was hanging since some views were currently in use, I told every "replace" prompt "no" with the following:
This seems to be resolved now that clouddb1021 has been reimaged successfully. I'll close this but if I'm missing something please reopen.
Apr 26 2022
I did the reimage again just now and it worked fine selecting "No" when prompted to load missing firmware. @MoritzMuehlenhoff I misread your comment and didn't realize your change should have been submitted first, sorry!! Let me know if I can still be useful in testing that, but otherwise, this ticket can be closed. Thanks for your input everybody.
Apr 18 2022
I reimaged the host back to Buster for now, which went smoothly. Replication lag is a few days behind but is catching up gradually: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=clouddb1021&var-port=13311&viewPanel=6&from=now-5m&to=now
Apr 14 2022
@Marostegui says I should tag @MoritzMuehlenhoff - hopefully we can all solve this together :)
Possibly relevant links thanks to @jhathaway:
Apr 13 2022
Apr 12 2022
Ok after some help with wmf-pt-kill in https://phabricator.wikimedia.org/T305974 and a patch to update netboot for other clouddb10xx hosts https://gerrit.wikimedia.org/r/c/operations/puppet/+/779557 the reimage of clouddb1014 went smoothly. I'm repooling all hosts and will continue with clouddb1015-1021 tomorrow.
Thank you both! Looks good 👍 👍
I forgot to tell netboot to treat these hosts as database hosts, which I have now done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/779488
Apr 11 2022
According to @hashar we can get node 12.22.5 by upgrading Debian to version 11 Bullseye (staging and production Turnilo run Debian 10). I'll try upgrading Debian on the staging host and see if the latest Turnilo works then.
Apr 7 2022
I'll do this next week. To my knowledge these hosts are pretty much the same as the dbstore hosts I did this week for https://phabricator.wikimedia.org/T299481, except that there can be no downtime if I depool the hosts first.
Apr 6 2022
Ah ok it appears we're now too far behind on nodejs versions
I made a patch for this, but the scap deploy to staging failed due to some error with locales:
Apr 5 2022
Hi Product Analytics, superset 1.4.2 is ready to be tested on staging. Once we confirm there are no showstopping bugs we'll release it to production.
Ok, looks like the following will resolve it:
I thought I'd update the staging database to be the same as production before sharing superset staging widely, and I'm glad I did, because it looks like there's some sort of database issue with the update.
All the reimages are done. Thanks for your input @Marostegui and @Ladsgroup .
Looks like reimage went fine; the warning about icinga status is that the replication has not caught up, but I see the replication Seconds_Behind_Master decreasing over time.
Mar 31 2022
I'm thinking about requiring superset-next to have 2fa for 2 reasons:
- since it's the staging environment for superset, it's more likely to have misconfigurations that would lead to security issues
- to test out what it would look like to eventually require 2fa for superset
I removed stat100* directories. All done!
I removed the data in stat1006. Thanks everyone.
Ok, I have removed the data on each stat host and also did a hdfd -rmdir on the empty hive database as well.
By declaring the host -> ip mappings using https://github.com/kelseyhightower/confd/blob/master/docs/templates.md#map earlier in the etcd template, we should be able to keep the data stored in etcd in line with the current node statuses; as simple as "pooled/not pooled". I'd prefer this, since repeating the ip address in etcd is a potential point of confusion.
Mar 30 2022
Mar 29 2022
Ok thanks for chiming in @Marostegui and @Ladsgroup. Here is my updated plan, and I'm planning to kick this off a week from today on April 5 at 15:00 UTC.
Superset 1.4.2 is running on superset-staging: an-tool1005. When making the change, I had to pin markupsafe to 2.0.1 since the default markupsafe it downloaded is not compatible with the version of flask that superset is using.
I'm working on another superset upgrade (https://phabricator.wikimedia.org/T304972) and I think it's a good time to reconsider this superset-next domain :)
Mar 25 2022
Merged my latest patch to remove Type=notify (https://gerrit.wikimedia.org/r/c/operations/puppet/+/773387, I used the other karapace ticket so it didn't post here) and after manually restarting karapace, puppet competes without error.
I have opened a patch for this that writes to a config file other than the actual one, so it can be inspected for correctness: https://gerrit.wikimedia.org/r/c/operations/puppet/+/773386
Mar 24 2022
Timers are present!
I can get started on this one. Here's my plan; if it looks good we can announce downtime; I vote to do the upgrade next Tuesday the 29th of March; I think all the reimages could be done in 1 day, and we'd have 3 days until Friday April 1 when the next round of monthly statistics are computed. If that timeline is too short, we can wait until the week of April 4.
Mar 23 2022
Ok thanks for confirming @KCVelaga_WMF!
@Joe I'm interested in your input, since you mentioned etcd is the way to go here - does the above plan make sense?
Mar 22 2022
Currently this is connected to kafka-test1006.eqiad.wmnet. It is my understanding we will use the "jumbo" kafka cluster. The following netcat times out; we'll need to open the firewall for traffic to kafka-jumbo.