User Details
- User Since
- Feb 6 2023, 4:18 PM (83 w, 5 h)
- Availability
- Available
- IRC Nick
- sfaci
- LDAP User
- Santiago Faci
- MediaWiki User
- SFaci-WMF [ Global Accounts ]
Today
Hi @brouberol!
I would say that we already have at least a draft regarding the minimal required documentation: https://wikitech.wikimedia.org/wiki/Metrics_Platform/MPIC/Administration
Could you take a look to review it? Is this enough so far? I'm pretty sure that we could add more details but, at the same time, I'm not sure about where is the limit here. So far deployment details are added and also a quick guide to deploy and troubleshooting.
Please, suggest anything you consider that would be interesting to have here
At this time there is already a draft version with all basic details regarding the kubernetes deployment: https://wikitech.wikimedia.org/wiki/Metrics_Platform/MPIC/Administration
This is the related MR: https://gitlab.wikimedia.org/repos/data-engineering/mpic/-/merge_requests/97
Fri, Sep 6
@cjming @Sarai-WMF According to the ticket description, some fields are going to be excluded here. Some of those fields are schema_title and stream_name. Shouldn't we exclude schema_title as well?
And I guess that, all these fields we consider as excluded for experiments are the ones that we should consider as optional in the database (so far these fields were defined as mandatory)
After some previous discussion, I have been exploring a bit all suggestions made there and, finally, I have written some notes with three different options to implement the disaster recovery plan for MPIC: https://docs.google.com/document/d/11exlY2vFBKpqUTKU_dwQdm88-oJn-94icM041kOe3-8
Thu, Sep 5
Current dashboard is available at https://grafana-rw.wikimedia.org/d/ee2057f3-eb34-45a7-a48b-489e3ff0b2ec/mpic?orgId=1
@SGupta-WMF Could you take a look at it? I have been tuning a bit expressions and queries and it seems that everything is working now
According to some comments that Ben has put after merging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1070649, we should move the monitoring configuration from the chart to the helm files for the next deployment. That's also pending.
Monitoring code is already implemented and deploy and the dashboard is already alive, but some tuning is needed to show properly all the metrics. We have based the current dashboard on an existing one for AQS and metric names are pretty different. At this time we are working on that.
Just for tracking purposes I will put here the MR where we have implemented the prometheus client library change for MPIC : https://gitlab.wikimedia.org/repos/data-engineering/mpic/-/merge_requests/96. It's already merged
Tue, Sep 3
We also have this task T361347: Add documentation related to the kubernetes deployment to the MPIC service page to document some technical details about the kubernetes deployment for MPIC. The purporse of that ticket is already clear and we can already start working on it but I was wondering whether defining the right place for that documentation should be defined here according to what was decided about the structure for the MPIC documentation
After talking with SREs looking for help and guidance (https://wikimedia.slack.com/archives/C055QGPTC69/p1725291450945039), we have some extra context about this task:
- We should instrument our code using prom-client
- We can implement the default metrics and we can also add others
- We have to build a specific endpoint for metrics. Something like /metrics (using same port as the webapp or a different one)
- At this moment monitoring is disabled in the MPIC helmfile for both environments, staging and production
- There is already a grafana dashboard that shows some metrics for MPIC through envoy: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?forceLogin&orgId=1&var-app=All&var-datasource=thanos&var-destination=All&var-kubernetes_namespace=mpic&var-prometheus=k8s-dse&var-site=eqiad
- We will need a separate dashboard if we add any additional metric using prom-client
- It's easy to copy items from an existing dashboard to create a new one and we can also use Grizzly to define a new one as code (Observability team could help us with this)
After talking with SREs about this we have some new clues about how to address this task:
- We can expand the template mentioned in the description to add any content we consider useful. We could take a look at https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Cluster/Spark_History as a good example and also at https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration as Andrew suggested above
- The final destination for the documentation should be something close to the project, somewhere beneath https://wikitech.wikimedia.org/wiki/Metrics_Platform
I think so. From my perspective we should consider every action that makes an user lose any information already filled in the form. I have found that, when an user clicks on any radio button there, the form is completely emptied. I would say that we should ask for a confirmation before doing that.
Mon, Sep 2
It's difficult to choose one! I think all of them are really good mocks!
Hi, Here are my assumptions regarding your questions:
Here are the details I think you need -> https://docs.google.com/spreadsheets/d/1F-Y-hLIkOtGSwraFdvjPKKTiZj9ejG4ollNCg-ijKrg/edit?gid=342440600#gid=342440600
- How many APIs are involved: In the "Sample requests/response" tab you will see all the endpoints we currently have with sample requests and responses
- The full requests for each APIs: In the "Sample requests/response" you will see sample requests for every endpoint (mock data we use is too long so it's prepared in separated tabs: mockInstrument, mockApiInstrumentsArray and mockInstrumentsArray) and referenced in the right place using those names. POST_INSTRUMENT_COLUMNS and PUT_INSTRUMENT_COLUMNS (included there) show which columns from the mockInstrument object we use when registering/updating an instrument respectively
- Sample payloads for each of the protocols: Sample Request and sample responses are included at "Sample request/response" tab
- Authentication (if any): There is a login mechanism but, for now, we are running all tests using a bypass that disable it. When using the bypass the username is always testuser. But we could discuss how to test the login part as well, of course
- Environment to test in (staging or local): Apart from the local testing environment (details below) we also have a staging environment at https://mpic-next.wikimedia.org
- Process to deploy the env: Take a look at https://gitlab.wikimedia.org/repos/data-engineering/mpic#running-the-development-environment
Thu, Aug 29
Tue, Aug 27
Just a note to avoid forgetting that this deployment is not a regular one. We'll probably have to update the kubernetes chart to include the new config property to use data gateway for CIM (Eventually all AQS services will use it so it seems a good idea to add it as a chart property). And we should also remove all the properties that CIM no longer needs (cassandra related properties).
Fri, Aug 23
Thu, Aug 22
My initial inkling after reading your original comment about the new instruments' status was to think that setting them to "on" by default could actually be less error-prone ๐ค In their current workflow, could users forget to enable an instrument on time and thus put data collection at risk? Or would they get any sort of error while setting up their experiment in case the instrument is disabled?
To be honest I don't have a strong opinion about this. Everything sounds too similar for me (probably because the way my mind translates).
Wed, Aug 21
It will be a pleasure to be involved here. I'll take a look at the document.
I'll copy the dropdown but, anyway, I'll try to agree with @cjming what to do from our perspective to avoid confusion. It wouldn't be fair having two Eng votes, right? xD
@EChukwukere-WMF @SGupta-WMF I'm not sure if we can consider this as a bug because, historically if we base the criteria on the existing AQS services, that character (_) was not included before as an invalid one. No service is failing with that value for the project parameter that is the parameter that we validate for the rest of the AQS services.
Sorry, my fault! I moved there by mistake. It should be tested before moving it to deploy.
Tue, Aug 20
I agree with you @Sarai-WMF regarding the Scenario 3. The new message is more appropriate.
Mon, Aug 19
Done! Just a couple of pending changes! There are some details in the MR
Just in case it helps, I would like to add that we are very interested on automating these integration tests. We already created a ticket to explore about it (T371922: MPIC: Automate integration tests). At this time we have to run them manually because we need a database but it seems that GitLab services could run a database as a container within the pipeline (is that feature supported by our current GitLab installation?). That way these tests could be run automatically every time we push/merge something.
In fact this is something that we also miss when working with the APIs and the existing test suite we have for them. If I'm not wrong we already explored this feature there and the result of that exploration was that this is something not supported by the pipeline we have there (Gerrit + Jenkins).