Gaps in Grafana graphs using Thanos
Open, Needs TriagePublicBUG REPORT
Actions

Assigned To

None

Authored By

	daniel
	Aug 6 2024, 11:23 AM

Description

Steps to replicate the issue (include links if applicable):

Go to https://grafana-rw.wikimedia.org/d/VIfr7aank/mediawiki-linksupdate
Zoom into a 6 hour window or smaller

What happens?:

There are gaps in the RefreshLinksJob::getParserOutputFromCache (Prometheus) graph
When zooming out to 12 hours or more, these gaps vanish and the data seems to be present.
The gaps are not present in the RefreshLinksJob::getParserOutputFromCache (graphite) graph (these two graphs should show the same data, and they seem to do so, except for the data from prometheus being more "spikey")

What should have happened instead?:

the RefreshLinksJob::getParserOutputFromCache (Prometheus) graph should show coninous data, like the equivalent query on Thanos does: https://thanos.wikimedia.org/graph?g0.expr=sum%20by(status)%20(irate(mediawiki_refreshlinks_parsercache_operations_total%5B2m%5D))&g0.tab=0&g0.stacked=0&g0.range_input=6h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Other information (browser name/version, screenshots, etc.):

24 hour window works fine:

6 hour window is broken:

15 minute window is broken:

The query works fine for a 6 hour window on Thanos:

Details

Subject	Repo	Branch	Lines +/-
mediawiki: bump limit/request for statsd-exporter	operations/deployment-charts	master	+2 -8
statsd-exporter: only scrape metrics from the prometheus ports	operations/deployment-charts	master	+1 -0
mw-jobrunner: bump limit/request for statsd-exporter	operations/deployment-charts	master	+6 -0

Customize query in gerrit

Related Objects

Mentioned In: T372242: Alert on unscrapable pods
T372241: Better visibility for throttled pods
T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s
Mentioned Here: T371102: Include long-retention Prometheus data from Thanos into Grafana queries

Event Timeline

daniel created this task.Aug 6 2024, 11:23 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 6 2024, 11:23 AM

Thank you for the detailed report @daniel ! Made it super easy to reproduce and investigate. I have played around with rate() interval and it looked like there wasn't enough data for e.g. rate([2m]) to plot sth meaningful. Given that we scrape every 60s I thought maybe prometheus can't scrape often enough from statsd-exporter. Sure enough, e.g. for mw-jobrunner namespace statsd-exporter appears to be heavily cpu throttled (unless I'm misreading the graphs) https://grafana.wikimedia.org/goto/pfQrsErIR?orgId=1 (link will break unfortunately when statsd-exporter gets new pods)

2024-08-06-155659_1858x1123_scrot.png (1×1 px, 146 KB)

fgiunchedi added projects: MW-on-K8s, serviceops.Aug 6 2024, 1:57 PM

fgiunchedi mentioned this in T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.Aug 6 2024, 2:14 PM

I'm working on bumping the statsd-exporter limits, and in the meantime I got curious on general cpu throttling stats in k8s: https://w.wiki/ArXp

unsurprisingly statsd-exporter is in the top 15, with a few service containers too:

{container="statsd-exporter", namespace="mw-web", site="eqiad"}	21.931203316177598
{container="statsd-exporter", namespace="mw-api-ext", site="eqiad"}	13.23978883967244
{container="statsd-exporter", namespace="mw-web", site="codfw"}	11.908566182694393
{container="statsd-exporter", namespace="mw-api-int", site="eqiad"}	7.6056624905245105
{container="thumbor-8082", namespace="thumbor", site="eqiad"}	6.3241434191838914
{container="statsd-exporter", namespace="mw-jobrunner", site="eqiad"}	6.201692060091147
{container="statsd-exporter", namespace="mw-api-ext", site="codfw"}	5.848601039738285
{container="linkrecommendation-internal", namespace="linkrecommendation", site="eqiad"}	5.102385104628782
{container="wikifeeds-production-tls-proxy", namespace="wikifeeds", site="eqiad"}	4.340646963265761
{container="thumbor-8081", namespace="thumbor", site="eqiad"}	3.969756827945107
{container="wikifeeds-production-tls-proxy", namespace="wikifeeds", site="codfw"}	2.745671794745576
{container="statsd-exporter", namespace="mw-parsoid", site="eqiad"}	2.6688748266749003
{container="thumbor-8080", namespace="thumbor", site="eqiad"}	1.9259288830619037
{container="mobileapps-production", namespace="mobileapps", site="eqiad"}	1.4022554075214957
{container="thumbor-8080", namespace="thumbor", site="codfw"}	1.175055309842738

Change #1060411 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mw-jobrunner: bump limit/request for statsd-exporter

https://gerrit.wikimedia.org/r/1060411

gerritbot added a project: Patch-For-Review.Aug 7 2024, 10:12 AM

Change #1060411 merged by Filippo Giunchedi:

[operations/deployment-charts@master] mw-jobrunner: bump limit/request for statsd-exporter

https://gerrit.wikimedia.org/r/1060411

Maintenance_bot removed a project: Patch-For-Review.Aug 7 2024, 2:30 PM

Even though there's no throttling now, e.g. mw-jobrunner statsd-exporter still shows as down from e.g. prometheus eqiad k8s: https://prometheus-eqiad.wikimedia.org/k8s/targets?scrapePool=k8s-pods&search=jobrunner

And indeed trying to simulate a scrape, curl sits there waiting for an answer after the GET:

prometheus1005:~$ curl 10.67.184.160:9125/metrics -v
* Uses proxy env variable no_proxy == '.wmnet'
*   Trying 10.67.184.160:9125...
* Connected to 10.67.184.160 (10.67.184.160) port 9125 (#0)
> GET /metrics HTTP/1.1
> Host: 10.67.184.160:9125
> User-Agent: curl/7.74.0
> Accept: */*
>

Just a quick update: @hnowlan rightfully pointed out that I was looking at port 9125 as failed, which it does, though statsd-exporter actually exports on 9102 which does work as expected:

prometheus1005:~$ curl 10.67.184.160:9102/metrics -s | grep -i refreshlinks
mediawiki_EventBus_outgoing_events_total{event_service_name="eventgate_main",event_service_uri="http_localhost_6005_v1_events",event_type="TYPE_JOB",function_name="send",stream_name="mediawiki_job_refreshLinks"} 373350
# HELP mediawiki_WikibaseClient_PageUpdates_RefreshLinks_jobs_total Metric autogenerated by statsd_exporter.
# TYPE mediawiki_WikibaseClient_PageUpdates_RefreshLinks_jobs_total counter
mediawiki_WikibaseClient_PageUpdates_RefreshLinks_jobs_total 54934
# HELP mediawiki_WikibaseClient_PageUpdates_RefreshLinks_titles_total Metric autogenerated by statsd_exporter.
# TYPE mediawiki_WikibaseClient_PageUpdates_RefreshLinks_titles_total counter
mediawiki_WikibaseClient_PageUpdates_RefreshLinks_titles_total 54934
# HELP mediawiki_refreshlinks_failures_total Metric autogenerated by statsd_exporter.
# TYPE mediawiki_refreshlinks_failures_total counter
mediawiki_refreshlinks_failures_total{reason="lock_failure"} 60
mediawiki_refreshlinks_failures_total{reason="page_not_found"} 6
# HELP mediawiki_refreshlinks_parsercache_operations_total Metric autogenerated by statsd_exporter.
# TYPE mediawiki_refreshlinks_parsercache_operations_total counter
mediawiki_refreshlinks_parsercache_operations_total{status="cache_hit"} 106048
mediawiki_refreshlinks_parsercache_operations_total{html_changed="n_a",status="cache_hit"} 23027
mediawiki_refreshlinks_parsercache_operations_total{html_changed="no",status="cache_miss"} 48282
mediawiki_refreshlinks_parsercache_operations_total{html_changed="unknown",status="cache_miss"} 97965
mediawiki_refreshlinks_parsercache_operations_total{html_changed="yes",status="cache_miss"} 4332
# HELP mediawiki_refreshlinks_superseded_updates_total Metric autogenerated by statsd_exporter.
# TYPE mediawiki_refreshlinks_superseded_updates_total counter
mediawiki_refreshlinks_superseded_updates_total 78182

Investigation continues

Regarding the throttling: Maybe it would help to set GOMAXPROCS to something sensible/related to the CPU limit (see https://wikitech.wikimedia.org/wiki/Kubernetes/Resource_requests_and_limits, https://github.com/uber-go/automaxprocs/tree/master)

Edit: Whoops, I completely missed T371885#10048618 onward before posting this. In any case, question still stands re: the annotation :)

For the endpoints marked down: it looks as if prometheus is scraping both container ports - i.e., 9102 (correct) and 9125 (statsd listen port, incorrect).

Not sure if that could somehow cause problems like those described in the task description, but it would at least explain T371885#10048571.

I wonder if we need to add an explicit prometheus.io/port annotation to ensure only 9102 is scraped?

I couldn't find any gaps in the data, but please let me know if you do!

For awareness: https://grafana.com/blog/2020/09/28/new-in-grafana-7.2-__rate_interval-for-prometheus-rate-queries-that-just-work/

Possibly $__rate_interval is calculating some interval that yields no data (or not enough data to render a valid rate() result). I tried updating "min step" on the graph to 2m and zooming seems to work with that.

In T371885#10049340, @Scott_French wrote:

Edit: Whoops, I completely missed T371885#10048618 onward before posting this. In any case, question still stands re: the annotation :)

For the endpoints marked down: it looks as if prometheus is scraping both container ports - i.e., 9102 (correct) and 9125 (statsd listen port, incorrect).

Not sure if that could somehow cause problems like those described in the task description, but it would at least explain T371885#10048571.

I wonder if we need to add an explicit prometheus.io/port annotation to ensure only 9102 is scraped?

That's not the best way to do it in modern-wmf-k8s, although we haven't fixed a lot of those cases. The correct thing to do here is to declare prometheus.io/scrape_by_name: true instead of prometheus.io/scrape: true, that will make prometheus only scrape ports with name ending in -metrics.

Change #1060669 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] statsd-exporter: only scrape metrics from the prometheus ports

https://gerrit.wikimedia.org/r/1060669

gerritbot added a project: Patch-For-Review.Aug 8 2024, 6:16 AM

Change #1060669 merged by jenkins-bot:

[operations/deployment-charts@master] statsd-exporter: only scrape metrics from the prometheus ports

https://gerrit.wikimedia.org/r/1060669

Now prometheus only reports scraping the correct ports https://prometheus-eqiad.wikimedia.org/k8s/targets?scrapePool=k8s-pods-metrics&search=statsd-exporter

Maintenance_bot removed a project: Patch-For-Review.Aug 8 2024, 7:30 AM

In T371885#10050015, @colewhite wrote:

Possibly $__rate_interval is calculating some interval that yields no data (or not enough data to render a valid rate() result). I tried updating "min step" on the graph to 2m and zooming seems to work with that.

Right - so this is an issue with Grafana rather than Thanos/Prometheus.

It seems like $__rate_interval is based on the "Interval" query option, which in turn is based on the zoom level. When that drops below the actual scrape rate, no data is found for most intervals. Setting the "Min interval" option to 2m indeed fixes it. The documentation sais that Min interval should be set to the write frequency. Is there a way we can do that automatically based on the data source? Or at least globally for all Prometheus sorces? Doing this for every graph manually is really annoying...

Thank you all for the investigation and help on this -- appreciate it!

To recap, this problem is actually the same as what's discussed at T371102: Include long-retention Prometheus data from Thanos into Grafana queries and the fix I have is to set min interval + scrape interval on the default thanos datasource so dashboards using $__rate_interval will work out of the box. Also for context rate_interval is set to 4x min interval (IIRC) and scrape interval is also taken into account. In our case scrape interval is 60s and with min interval to 30s that will yield a minimium rate_interval of 2m which is what we're after. I'll followup on T371102

Change #1061856 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mediawiki: bump limit/request for statsd-exporter

https://gerrit.wikimedia.org/r/1061856

gerritbot added a project: Patch-For-Review.Aug 12 2024, 7:12 AM

fgiunchedi mentioned this in T372241: Better visibility for throttled pods.Aug 12 2024, 7:18 AM

fgiunchedi mentioned this in T372242: Alert on unscrapable pods.

Change #1061856 merged by Filippo Giunchedi:

[operations/deployment-charts@master] mediawiki: bump limit/request for statsd-exporter

https://gerrit.wikimedia.org/r/1061856

Maintenance_bot removed a project: Patch-For-Review.Aug 13 2024, 1:30 PM

lmata added a project: SRE Observability (FY2024/2025-Q1).Aug 19 2024, 2:34 AM

lmata moved this task from Inbox to Prioritized on the Observability-Metrics board.

Mentioned in SAL (#wikimedia-operations) [2024-08-19T17:29:00Z] <swfrench-wmf> statsd-exporter resource bumps (https://gerrit.wikimedia.org/r/1061856) are now everywhere - T371885

lmata edited projects, added SRE Observability (FY2024/2025-Q2); removed SRE Observability (FY2024/2025-Q1).Tue, Nov 5, 5:10 PM

	F57120815: 2024-08-06-155659_1858x1123_scrot.png
	Aug 6 2024, 1:57 PM

Gaps in Grafana graphs using ThanosOpen, Needs TriagePublicBUG REPORTActions

Description

Details

Related Objects

Event Timeline

Gaps in Grafana graphs using Thanos
Open, Needs TriagePublicBUG REPORT
Actions