[MPIC] Analyse risk of potential performance issues with static approach to stream configuration
Open, HighPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	WDoranWMF
	Jun 4 2024, 5:42 PM

Description

Based on discussion with DE(https://docs.google.com/document/d/1S2Ij3FikNGdN8ZwZKcwpAfNmGQoNtah9AwNilxqhC3k/edit), it has been decided to move forward with static stream config option as a preference.

However, there are unknowns in terms of performance issues that need to be investigated to ensure that the workflow will scale. This work is necessary irrespective of static vs dynamic approach, though they would have slightly different risk profiles.

For example, how will filtering will impact performance (especially for dashboarding) with Parquet / Iceberg?

Acceptance Criteria

Documented set of usage scenarios covering scaling of storage, requests etc
Posit performance implications for each
Propose mitigation for most likely scenarios

Final status

An extreme case for unified instrument volumes was suggested.
Some dummy queries were provided
A existent stand-in Hive Parquet table with a suitable structure and volume was selected
Dummy filtering queries were executed against this the stand-in table.

The results are summarized at T366627#10015322. The comparison of the results against a 'baseline' dummy 'per-instrument' table were flawed, but the query times against the unified instrument table were not.

Note that Presto query times can vary wildly, as the Presto engine is a shared environment.

The dummy queries filtering for an small instrument in a large unified instrument table can take around 5 seconds.

In T366627#10030263, it was noted that this is fine.

NOTE: in the next quarter or so, Data Engineering should be able to support event table custom partitioning, which will only (greatly) improve query times when filtering for a small instrument.

As ~5 seconds is acceptable for now, and things will only improve, we we call this task done and not do further work to compare query times.

Related Objects
Search...

Status	Assigned	Task
Resolved	WDoranWMF	T360647 [Sprint 11 GOAL] SDS 2.5.5 Deliver working prototype for MP Instrumentation Configuration
Resolved	None	T360736 [Epic] Build the mechanism to adapt and deliver the output of the MPIC application
Resolved	phuedx	T360738 Update EventStreamConfigs extension to use MPMW hook
Resolved	None	T307500 Airflow Hackathon (May 2022)
Open	None	T307505 Refine jobs should be scheduled by Airflow
Open	Antoine_Quhen	T356762 [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation
Open	None	T354557 Dataset Config Store
Resolved	gmodena	T361853 [Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator
Resolved	Sfaci	T366949 MPIC: Add stream name to forms/database/api
Open	None	T366807 [EPIC] Update Metrics Platform Client Libraries to accept instrument name
Open	Ottomata	T366627 [MPIC] Analyse risk of potential performance issues with static approach to stream configuration

Event Timeline

WDoranWMF triaged this task as High priority.Jun 4 2024, 5:42 PM

WDoranWMF created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 4 2024, 5:42 PM

I think the metric we need here mostly is: how will fewer tables affect dashboard (superset?) latency? If all events for a given instrument(?)/schema(?) (@phuedx help me with terms please), are in one Hive/Iceberg table will dashboard loading latency be significantly worse?

It will be hard to answer this without volume estimates. How many events per hour (or second) will these streams receive? @phuedx can you make a guess on lower and upper bounds? A guess is fine. Also a guess about volumes per 'discriminating' filter/field would be helpful too.

It also depends on what the dashboarding queries are. Do dashboards generally operate by aggregating events per hour? Per day? Are there intermediate summary tables that are created via pipelines?

All that is to say: we can make some guesses but it will be hard to give a good answer without an understanding of the what the data is and how it will be used.

Ottomata added a parent task: T361853: [Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator.Jun 4 2024, 6:43 PM

In T366627#9861246, @Ottomata wrote:

All that is to say: we can make some guesses but it will be hard to give a good answer without an understanding of the what the data is and how it will be used.

This.

I think that potential performance issues are a risk that we need to be aware of and should be monitoring. Can we monitor query performance on a per table/per query basis? Can we dashboard it?

We already know plenty of techniques to mitigate performance issues (partitioning, generating intermediate summary tables as mentioned above). Are there any query performance optimisations that we couldn't implement because stream configurations would only be static?

Are there any query performance optimisations that we couldn't implement because stream configurations would only be static?

I don't think so. And, if a particular 'discriminator'(?) is too high volume, even if it was in its own table, it could still cause increased dashboard latency.

I guess the question is, will having lots of events for one discriminator in one table affect dashboard latency for other discriminators?

We can probably give a guess to this question. Maybe @JAllemandou or @xcollazo who know more about Iceberg / Parquet details could help.

Question for Joseph and Xabriel: How smart are Parquet/Iceberg with simple 'predicate pushdowns' filters. If all dashboard queries use a simple WHERE filter, does having a lot of irrelevant records for that query significantly affect query performance?

cjming moved this task from Backlog to Metrics Platform Instrument Configurator on the Metrics Platform board.Jun 5 2024, 9:53 PM

MNeisler subscribed.Jun 6 2024, 5:21 PM

Add status quo info for the usage scenarios as well as baseline performance for superset and improvement targets for it.

VirginiaPoundstone assigned this task to phuedx.Jun 7 2024, 3:37 PM

VirginiaPoundstone edited projects, added Data Products (Data Products Sprint 15); removed Data Products.

Question for Joseph and Xabriel: How smart are Parquet/Iceberg with simple 'predicate pushdowns' filters. If all dashboard queries use a simple WHERE filter, does having a lot of irrelevant records for that query significantly affect query performance?

I should first qualify my answer: Iceberg is not Druid. It is not designed to build dashboarding solutions on top of. Neither is Hive.

Having said that, not all dashboards need sub-second performance. And if we shape the data carefully, you can indeed get 'couple seconds' performance with Iceberg, even if its terabytes of data. Iceberg can certainly do predicate pushdowns, and moreover, we can order the writes so that we efficiently skip many files and many Parquet row groups. We use this trick on wmf_dumps.wikitext_raw, and it gave us day and night difference for that particular use case (See T340863#9397991 and on). But that is for a very specific query pattern.

If we can share a specific table schema, and some specific query examples we could assess better. We should also specify what are the latency expectations of this dashboard to better assess if Iceberg is the right tool.

Thanks @xcollazo!

I think the question at hand is, how much will the query latency of these 2 situations differ?

Situation A:

Tables button_experiement_a and button_experiement_b have the same schema.
button_experiement_a is very large, but button_experiement_b is not so large.
A dashboard for button_experiement_b is built.
- Let's say a query is select day, session_id, count(*) from button_experiement_b group by session_id, day(dt) as day.

Situation B:

A single table button_instrument with a field experiment_id with values like "a" and "b".
There are many rows where experiement_id == "a", but not so many where experiement_id == "b".
A dashboard is built that only cares about rows where experiment_id == "b".
- This would have a query like select day, session_id, count(*) from button_instrument where experiment_id = "b" group by session_id, day(dt) as day

How much will the presence of many irrelevant rows where experiement_id != "b" affect the query latency in Situation B? A lot? A little? I think even a guess like this could help.

I'm hoping that predicate pushdown might mean that the latency is barely affected, but I don't really know!

...does Iceberg have indexes? :p

How much will the presence of many irrelevant rows where experiement_id != "b" affect the query latency in Situation B? A lot? A little? I think even a guess like this could help.
I'm hoping that predicate pushdown might mean that the latency is barely affected, but I don't really know!

Ah, this one is simple: We partition the table by experiment_id. Both Iceberg and Hive support this.

So in fact no predicate pushdown farther than the first layer of metadata is needed.

How much will the presence of many irrelevant rows where experiement_id != "b" affect the query latency in Situation B?

Zero affect.

...does Iceberg have indexes? :p

It does not, but your use case doesn't seem to need them. IIRC from Iceberg Summit, a couple companies rolled their own indexes for Iceberg, and they where planning to spec it out for official support. But that is likely 1+ year away.

We partition the table by experiment_id. Both Iceberg and Hive support this.

Hm! This may be easier in Iceberg world than Hive, because IIUC, when writing the partitioning is handled by Iceberg based on the data values, whereas in Hive we have to explicitly tell it what the partition is, right?

Custom partitioning would be a use case for Datasets Config for sure. cc @JAllemandou.

@xcollazo for curiosity sake, what if the table were not partitioned by experiment_id? Would predicate pushdown be enough to ameliorate query latency concerns here?

phuedx awarded a token.Jun 10 2024, 9:16 AM

whereas in Hive we have to explicitly tell it what the partition is, right?

Hive let's you do dynamic partitioning. Here is an example of us using that feature. Note I'm not advocating for Hive, I'm just saying this partitioning pattern is supported by both systems and will give you zero cost for irrelevant experiments when filtering by experiment_id.

for curiosity sake, what if the table were not partitioned by experiment_id? Would predicate pushdown be enough to ameliorate query latency concerns here?

(Assuming Iceberg now) Depends on the data landing:

If the INSERTs would typically only include data for a specific experiment_id, then the min/max parquet statistics will kick in and help you a lot because naturally files will only include one experiment, thus other files will be skipped.
If the INSERTs would typically include data for multiple experiment_ids, then we could try WRITE ORDERED BY experiment_id to skip as much as we can.

But let's partition! Is the cheapest and most effective solution.

Also, would this system also do dt filters? We could partition/ORDER BY by that as well and gain further perf.

If the INSERTs would typically include data for multiple experiment_ids

And if we don't do any special partitioning or inserting?

I ask because these tables are created by automated ingestion jobs. We are working on support for custom configuration per table, but any custom partitioning or insert logic would be manually applied at the moment.

In T366627#9875744, @Ottomata wrote:

If the INSERTs would typically include data for multiple experiment_ids

And if we don't do any special partitioning or inserting?

Then every SELECT will go over all data and will have to filter it at runtime, meaning that definitely experiment_ids with more data will slow down smaller ones.

How will these dashboards be served? Via Presto?

How will these dashboards be served? Via Presto?

Not sure, likely yes. But I believe there may be some desire to have some pipelines that get this stuff into AQS somehow. @phuedx @VirginiaPoundstone

Then every SELECT will go over all data and will have to filter it at runtime, meaning that definitely experiment_ids with more data will slow down smaller ones.

Okay, then the question to answer is: by how much? It'd be nice if we could make a guess.

@phuedx can you come up with

A guess for average throughput for 2 instrumentations(?), one with lots of events and one with few
A guess at a naive and simple query someone would run on a dashboard. (Maybe count per day? e.g. a daily button click count?)

We could then generate some artificial data and compare.

@MNeisler just caught me up on this. I just want to share some thoughts about

How will these dashboards be served? Via Presto?

I think we (not just Product Analytics but data practitioners in general) might have gotten too dependent on Presto in Superset. When Presto was introduced, it was in the Oozie / pre-Airflow times, so the ability to calculate a metric from raw event data with Presto and then easily turn the query results into a chart that can be added to a dashboard was revolutionary to our workflow. It also meant that metrics could be defined on a project-by-project basis and easily implemented & monitored.

We're trying to move away from that (teams coming up with project-specific metrics for each project), and instead pursue our strategy/vision of everyone using a shared set of essential metrics with governance and trusted datasets of their measurements. Presto was originally a boon for our workflow but is also, I think, enabling what I would call a bad habit in the long term.

Where Presto (via Superset's SQL Lab) shines is very quick analyses/answers to small questions and prototyping/iterating on metrics and dashboards. For creating trusted datasets and stable, performant dashboards that feature teams use to monitor usage of their features (including results of their experiments), we should use pre-computed essential metrics with all calculations offloaded to Airflow pipelines.

In T366627#9876441, @Ottomata wrote:

@phuedx can you come up with

A guess for average throughput for 2 instrumentations(?), one with lots of events and one with few

A lot? SessionTick. A few? EditAttemptStep (I'm sure there are instruments that submit fewer events if we need a better example).

Instrument Name	Average Daily Event Rate (events/day)
SessionTick	117,504,000
EditAttemptStep	7,361,280

A guess at a naive and simple query someone would run on a dashboard. (Maybe count per day? e.g. a daily button click count?)

[0]

SELECT COUNT(1) AS count FROM …;

[1]

SELECT
  action,
  COUNT(1) as count
FROM
  …
WHERE
  instrument_name = …
GROUP BY
  1
;

[2]

SELECT
  experiment.name,
  COUNT(DISTINCT(enrollment_token)) AS n_enrollments
FROM
  …
WHERE
  experiment.name = _
  AND action = 'enroll'
;

[3]

SELECT
  experiment.name,
  experiment.variant_name,
  SUM(IF(action = 'click', 1, 0) / COUNT(1) AS click_through_rate
FROM
  …
WHERE
  experiment.name = …
  AND action IN ( 'init', 'click' )
GROUP BY
  1, 2
;

All credit to @mpopov for above example queries.

we should use pre-computed essential metrics with all calculations offloaded to Airflow pipelines

wow that sounds amazing!

So, if we have pipelines anyway, then that could mitigate these concerns by splitting off the experiments/instruments/discriminator whatever into their own standardized metric tables?

As currently envisioned, a future datasets config / management system where every dataset/table is explicitly declared may make automation (e.g. a pipeline auto splitting into per experiment tables here more difficult), but I hope we can work with the datasets config design to compromise with defaults for automation. TBD.

By that time, ideally we could just use custom Iceberg partitions as Xabriel suggests.

phuedx moved this task from Sprint Backlog to In Process on the Data Products (Data Products Sprint 15) board.Jun 25 2024, 11:21 AM

WDoranWMF moved this task from In Process to Wormhole To Sprint 16 on the Data Products (Data Products Sprint 15) board.Jul 3 2024, 4:53 PM

WDoranWMF edited projects, added Data Products (Data Products Sprint 16); removed Data Products (Data Products Sprint 15).

WDoranWMF moved this task from Sprint Backlog to In Process on the Data Products (Data Products Sprint 16) board.

We haven't done any number crunching, but I think we can assume that we will be able to ultimately handle this with Iceberg partitioning.

It would be nice to generate some fake data and run some test queries on it to get an idea for latency added on by filtering on discriminator. I have not had time to do this yet.

In T366627#9950681, @Ottomata wrote:

It would be nice to generate some fake data and run some test queries on it to get an idea for latency added on by filtering on discriminator. I have not had time to do this yet.

Is this something that you can own? Would you need any support from Data Products?

phuedx edited projects, added Data Products; removed Data Products (Data Products Sprint 16).Jul 4 2024, 12:06 PM

phuedx moved this task from Incoming to NEEDS DISCUSSION on the Data Products board.

We discussed in Data Eng Sync meeting today.

We will eventually support custom table partitioning, as described here, so we can be sure that ultimately the 'static stream config' approach will be fine.

In the near term...

It would be nice to generate some fake data and run some test queries on it to get an idea for latency added on by filtering on discriminator

Is this something that you can own? Would you need any support from Data Products?

Yes, Data Engineering team will own this. For prioritization: do you have a sense on timeline/urgency on this? (It will probably be me doing this and I have to balance a bunch of stuff! :) )

Ahoelzl subscribed.Jul 11 2024, 3:53 PM

@phuedx ^ in case you missed:

For prioritization: do you have a sense on timeline/urgency on this? (It will probably be me doing this and I have to balance a bunch of stuff! :) )

@Ottomata mid August? First Product team will use MPIC early October and I want to make sure we have time to QA our implementation.

Great thank you. I will try to get this done by then.

Okay, here are some dummy results for ya.

I used event.mediawiki_client_session_tick as my example data. While this is not an 'instrument/experiment' table, its shape and size is enough to execute queries similar to what might be executed on instrument metric tables.

I used meta.domain as a stand in for 'experiment_name` (or whatever the discriminator will be). unified_experiment_table is a table with a full day of event.mediawiki_client_session_tick data with about 100 million records, as a stand in for what might be a single table with multiple experiments in it. small_experiment_table has a subset of about 1 million records where meta.domain == 'ar.wikipedia.org'.

Each pair of queries shows the presto query latency I got when either executing it on small_experiment_table vs unified_experiement_table filtering on meta.domain (my stand in for experiment_name).

The actual queries can be found in this Dummy Experiment Table Presto Query Latencies spreadsheet. Thanks to @mpopov and @phuedx for query examples.

Results

Queries filtering for a small experiment in a large table are indeed slower. For this data size, it adds around 4 seconds to the query time.

Is this additional query latency acceptable for Metrics Platform immediate dashboarding needs?

Note that if we partition the table on the discriminator, the query times shouldn't be affected. We don't have the capability to automate auto-partitioning in event tables currently, but we should be able to support it in the future.

If the query latency is not acceptable, it might be possible to manually partition the MP Hive event tables sooner. We'd have to look into this to be sure.

(Also: please check my work! My SQL-fu is low!)

Description	Latency	Result if relevant
Total count of records in unified table	384ms	107845742

Simple Count
small experiment table	315ms	1097193
unified table filter for small experiment	4s	1097193

Group by and count
small experiment table	1s
unified table filter for small experiment	6s

Count distinct data field, with filter on other data field
small experiment table	2s
unified table filter for small experiment	6s

Group by and sum on conditional
small experiment table	2s
unified table filter for small experiment	5s

Here are the queries I used to create the dummy tables:


CREATE TABLE `otto`.`unified_experiment_table` (
  `_schema` STRING,
  `meta` STRUCT<`uri`: STRING, `request_id`: STRING, `id`: STRING, `dt`: STRING, `domain`: STRING, `stream`: STRING>,
  `http` STRUCT<`protocol`: STRING, `method`: STRING, `status_code`: BIGINT, `has_cookies`: BOOLEAN, `request_headers`: MAP<STRING, STRING>, `response_headers`: MAP<STRING, STRING>>,
  `client_dt` STRING,
  `tick` BIGINT,
  `config_tick_ms` BIGINT,
  `config_idle_ms` BIGINT,
  `config_reset_ms` BIGINT,
  `test` MAP<STRING, BIGINT>,
  `user_agent_map` MAP<STRING, STRING>,
  `dt` STRING,
  `is_wmf_domain` BOOLEAN,
  `normalized_host` STRUCT<`project_class`: STRING, `project`: STRING, `qualifiers`: ARRAY<STRING>, `tld`: STRING, `project_family`: STRING>,
  `datacenter` STRING,
  `year` BIGINT,
  `month` BIGINT,
  `day` BIGINT,
  `hour` BIGINT)
USING parquet;



CREATE TABLE `otto`.`small_experiment_table` (
  `_schema` STRING,
  `meta` STRUCT<`uri`: STRING, `request_id`: STRING, `id`: STRING, `dt`: STRING, `domain`: STRING, `stream`: STRING>,
  `http` STRUCT<`protocol`: STRING, `method`: STRING, `status_code`: BIGINT, `has_cookies`: BOOLEAN, `request_headers`: MAP<STRING, STRING>, `response_headers`: MAP<STRING, STRING>>,
  `client_dt` STRING,
  `tick` BIGINT,
  `config_tick_ms` BIGINT,
  `config_idle_ms` BIGINT,
  `config_reset_ms` BIGINT,
  `test` MAP<STRING, BIGINT>,
  `user_agent_map` MAP<STRING, STRING>,
  `dt` STRING,
  `is_wmf_domain` BOOLEAN,
  `normalized_host` STRUCT<`project_class`: STRING, `project`: STRING, `qualifiers`: ARRAY<STRING>, `tld`: STRING, `project_family`: STRING>,
  `datacenter` STRING,
  `year` BIGINT,
  `month` BIGINT,
  `day` BIGINT,
  `hour` BIGINT)
USING parquet;



INSERT INTO `otto`.`unified_experiment_table`
SELECT * FROM `event`.`mediawiki_client_session_tick`
WHERE year=2024 AND month=7 AND day=17;


INSERT INTO `otto`.`small_experiment_table`
SELECT * FROM `otto`.`unified_experiment_table`
WHERE meta.domain = 'ar.wikipedia.org';

Ottomata claimed this task.Jul 25 2024, 3:32 PM

Ottomata moved this task from Incoming (new tickets) to Q1 2024 July 1st - September 30th on the Data-Engineering board.

Ottomata edited projects, added Data-Engineering (Q1 2024 July 1st - September 30th); removed Data-Engineering.

Ottomata moved this task from Next Up to In Review on the Data-Engineering (Q1 2024 July 1st - September 30th) board.

Queries filtering for a small experiment in a large table are indeed slower. For this data size, it adds around 4 seconds to the query time.

@Ottomata it adds 4 seconds on top of an existing baseline of querytime? Or is it 4 seconds total to run a query for a dashboard? I'm looking for an estimated total times.

Is this additional query latency acceptable for Metrics Platform immediate dashboarding needs?

Maybe? Depends on current baseline vs the the estimated new time. CC @mpopov for opinions on this too.

it adds 4 seconds on top of an existing baseline of querytime?

My test was for an extreme case: a query filtering for an experiment with very few records in a unified table with many many records.

The more records that need to be searched through, the longer it will take. This unified table had about 100million records for 1 day, which adds up to about 1000 events per second. The small experiment had about 1million records, so about 10 events per second.

This example was mocking an extreme case where an instrumentation (?) (all events in one stream per schema?) emitted 1000 events per second, and a single experiment emitted 10 events per second. In a case at that scale, assuming daily metric rollups, then yes, having all events in one stream-table increases query times concerned only with the small experiment from around 1-2 seconds to 4-5 seconds.

(Once again, please forgive if I am misusing terms like instrumentation and experiment. I'm still not sure what the 'discriminator' is intended to be here.)

I'm looking for an estimated total times.

The estimated total times are in the table in the Results section in this comment in the Latency column

Hm, I wonder if filtering on a bucketed field would help. Partitioning would be better, but we could probably do bucketing now. I will try it!

I just tried the same queries on a unified table bucketed by meta.domain (my discriminator field) into 32 buckets. The query results did not improve.

I've updated the spreadsheet with the results, but since bucketing didn't make a difference I won't update the summarized results in my comment above.

meta.domain has about 796 distinct values in my dataset, 32 bucket would be about 24 domains per bucket. Of course, meta.domain is highly skewed; I guess it makes sense that bucketing didn't necessarily improve query times. If the domain I'm filtering on happens to be in the same bucket with a domain with larger requests, the bucketing won't help that much.

Maybe some other bucket value would help? Probably not.

I'm getting out of my comfort zone here. @xcollazo does ^ make sense to you?

(yes, we should just partition. we will ultimately do that.)

We have prepared some work to Refine raw events directly into Iceberg tables.

I looked at the previous experiments on presto query latency. I was very interested, so I built one of my own with Iceberg to verify the hypothesis that Iceberg is faster than Hive+Parquet in the case of a per-experiment request on a unified table.

So, I created an Iceberg table with hourly partitioning (meta.dt) + partitioning by meta.domain. I ran the same query on both tables, and the results were unexpected.

First, the remanence of malformed deleted data (from my first iteration of ingestion) in my table puzzled me. By creating plenty of hourly partitions, they made the query time skyrocket.

Past this "bug", I could run on a clean Iceberg table, yet the query times still exceed the request on hive+parquet by far.

After investigations, I found a way to optimize the query time on Iceberg:

Switching the time partitioning from hourly to daily helps a little in our case, as fewer partitions were created.
The main improvements come from querying nested columns properly. Could be achieved:
- by setting spark.sql.optimizer.nestedSchemaPruning.enabled=true and querying the nested columns as usual,
- or by unnesting the columns into their own columns (meta.dt > dt, meta.domain > domain). I found only this option working from Presto.

After optimizations, the "unified table filter for small experiment" query takes 1s on Iceberg, down from 4s with Hive+Parquet . Script here: https://gitlab.wikimedia.org/-/snippets/148

Let us know if you want to see those tables on Iceberg sooner than on Hive.

Ottomata mentioned this in T367057: [SPIKE] Document decision to use a single table per base schema.Jul 30 2024, 2:05 AM

Ottomata added a parent task: T366807: [EPIC] Update Metrics Platform Client Libraries to accept instrument name.

In T366627#10019310, @Ottomata wrote:

If the domain I'm filtering on happens to be in the same bucket with a domain with larger requests, the bucketing won't help that much.

Maybe some other bucket value would help? Probably not.

I'm getting out of my comfort zone here. @xcollazo does ^ make sense to you?

This makes sense. Bucketing will only help when data skew is not an issue, and it mostly helps in joins rather than outright reads, and IIRC we do not require joins in this scenario.

(yes, we should just partition. we will ultimately do that.)

I thought the reason we don't partition now is that that requires new DDL automation code that can be the dataset config thingy, no? If so, bucketing would require the same automation, no?

In T366627#10022878, @Antoine_Quhen wrote:

So, I created an Iceberg table with hourly partitioning (meta.dt) + partitioning by meta.domain. I ran the same query on both tables, and the results were unexpected.

First, the remanence of malformed deleted data (from my first iteration of ingestion) in my table puzzled me. By creating plenty of hourly partitions, they made the query time skyrocket.

I speculate query planning was taking most of the time. This goes back to Iceberg not being designed to run dashboards directly on top of (i.e. not meant for sub 1 second queries). But, considering that it is what it is, we need to balance partition cardinality with query planning time.

I thought the reason we don't partition now is that that requires new DDL automation code that can be the dataset config thingy, no? If so, bucketing would require the same automation, no?

We can manually create new tables with whatever DDL we like; As is, the Refine automated table management step will not mess with partitions or bucketing once the table is created. But, to do so automated, yes. We need an automated way to set the partitions (or bucketing...which we don't really need) for a table. We will rely on EventStreamConfig to do this for the Refine in Airflow refactor for now.

We might be able to do custom partitioning sooner rather than later, especially if we do Iceberg excplicitly for new stream tables only. @Antoine_Quhen and I are discussing this.

But, for this ticket, I'd still like to have a estimation of the effect on query times when filtering for a small instrument (I think instrument_name is the correct discriminator, not experiement_name... :) ) in a table with many events for other instruments in Hive Parquet tables.

Upon investigation, my previous analysis was flawed: The comparison was done between data files of different sizes. Also, it seems Presto query times can vary wildly, even on the same data! Sometimes a query will take < 1 second, other times it will take 5 or 6 seconds. My tables also did not have any partitions at all; I just inserted a day of data. I expected this to be fine because I wasn't doing any filtering on time. For a better comparison I will add the hourly partitions.

I will:

do a new comparison using a small instrument table with many (about 4) data files per partition (just like the event tables have), as well as with the same partitions layout as the baseline table.
Run the presto queries many times and take report average, etc. stats on query latency times.

@Ottomata @xcollazo: I can't review all the discussion on this ticket but Andrew pointed me here:

In T367057#10026029, @Ottomata wrote:

will create significant limitations for how much data we would be able to query with Presto

If we partition correctly, it shouldn't. T366627#9871868

I saw there was some discussion about partitioning based on experiment info. I wanted to update you that the way we are going to tag events with experiment data is different than how you've been talking about it, see T368326: Update Metrics Platform Client Libraries to accept experiment membership – in case you were talking seriously about using that strategy and weren't just using it as a fake example.

partitioning based on experiment info

Ya, I think I've been using the wrong terminology (I never really knew if it was previously a stream per instrument (I don't really know what the definition of an instrument is) or a stream per experiment. From more recent readings, I think I should be using 'instrument' instead of 'experiment' for the purposes of this ticket.

In any case, the outcome will be the same. It's select from <focused_table> vs select from <unified_table> where <discriminator> = <focused_value>. We are pretty sure that partitioning on <discriminator> should make presto query times between those approximately the same. Some extra query time will be spent on partition pruning, but there shouldn't be more data that needs to be filtered through.

What do you think about these latencies @nettrom_WMF @mpopov? Acceptable?

@Milimetric what impact would this have if we use growthbook?

In T366627#10030134, @VirginiaPoundstone wrote:

What do you think about these latencies @nettrom_WMF @mpopov? Acceptable?

Copying my response over from Slack so we keep the conversation in one place:

Took me a little work to find the decision task where @mpopov mentioned the consequences of going to one table, but also described how some of our work practices are affected (in T367057#10025457). I agree with his point that we should work towards not querying/dashboarding on the raw data and instead have something like processing steps/ETLs, something which has also come up in various other conversations the past couple of weeks. I think the latency can be a motivator for that kind of improvement, although I'm reluctant to expect we'll have resources to make a lot of headway on that soon. While I wish for blazingly fast queries in any work I do, I think the approach we're looking at and the resulting latency is acceptable.

Ottomata updated the task description. (Show Details)Aug 2 2024, 2:57 PM

Ottomata moved this task from In Review to Done on the Data-Engineering (Q1 2024 July 1st - September 30th) board.

@Ottomata and @phuedx is this task ready to resolve?

Yes. It is on DE board in Done column. I think @Ahoelzl might like to resolve in bulk at end of sprint quarter though?

[MPIC] Analyse risk of potential performance issues with static approach to stream configurationOpen, HighPublic8 Estimated Story PointsActions

Description

Description

Acceptance Criteria

Final status

Related ObjectsSearch...

Event Timeline

Results

[MPIC] Analyse risk of potential performance issues with static approach to stream configuration
Open, HighPublic8 Estimated Story Points
Actions

Related Objects
Search...