wmfdata-r v2 should mainly be a wrapper for wmfdata-py
Open, Stalled, LowPublic
Actions

Assigned To

Authored By

	mpopov
	Mar 17 2022, 3:41 PM

Description

wmfdata-py is currently the best way to query our production data sources for data analysis. It has excellent support for

Hive via PyHive, meanwhile wmfdata-r still relies on wrapping hive CLI and suffers from the same problems as wmfdata-py did prior to the switch (T275233)
Spark via PySpark, but our configuration prevents us from using sparklyr
Presto via presto-python-client, but RPresto's support for Kerberos-ized setups appears to be ¯\_(ツ)_/¯

A new version of wmfdata-r is needed, one that is just a wrapper for wmfdata-py's database-accessing functions (via reticulate).

Not only is wmfdata-r vastly outdated and limited in its ability to access production data sources, it has also become bloated – with miscellaneous functions (e.g. sample size calculations for χ2 test, Wikimedia color palettes) added over time that should actually be factored out into a separate package or forgotten entirely.

Related Objects

Mentioned Here: T275233: wmfdata-python's Hive query output includes logspam

Event Timeline

mpopov created this task.Mar 17 2022, 3:41 PM

While it would be a significant initial investment, long-term maintenance would be minimal since majority of the maintenance burden would fall on the underlying Python codebase.

mpopov updated the task description. (Show Details)Mar 17 2022, 4:03 PM

ldelench_wmf triaged this task as Low priority.Mar 29 2022, 5:11 PM

ldelench_wmf moved this task from Triage to Backlog on the Product-Analytics board.

Repo with package skeleton created at https://gitlab.wikimedia.org/repos/product-analytics/wmfdata-r

Nothing actually implemented yet

@mpopov: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

mpopov changed the task status from Open to Stalled.May 16 2024, 5:55 PM

mpopov claimed this task.

mpopov moved this task from Backlog to Icebox on the Product-Analytics board.

wmfdata-r v2 should mainly be a wrapper for wmfdata-pyOpen, Stalled, LowPublicActions

Description

Related Objects

Event Timeline

wmfdata-r v2 should mainly be a wrapper for wmfdata-py
Open, Stalled, LowPublic
Actions