[go: up one dir, main page]

Page MenuHomePhabricator

wmfdata-r v2 should mainly be a wrapper for wmfdata-py
Open, Stalled, LowPublic

Description

wmfdata-py is currently the best way to query our production data sources for data analysis. It has excellent support for

  • Hive via PyHive, meanwhile wmfdata-r still relies on wrapping hive CLI and suffers from the same problems as wmfdata-py did prior to the switch (T275233)
  • Spark via PySpark, but our configuration prevents us from using sparklyr
  • Presto via presto-python-client, but RPresto's support for Kerberos-ized setups appears to be ¯\_(ツ)_/¯

A new version of wmfdata-r is needed, one that is just a wrapper for wmfdata-py's database-accessing functions (via reticulate).

Not only is wmfdata-r vastly outdated and limited in its ability to access production data sources, it has also become bloated – with miscellaneous functions (e.g. sample size calculations for χ2 test, Wikimedia color palettes) added over time that should actually be factored out into a separate package or forgotten entirely.

Event Timeline

While it would be a significant initial investment, long-term maintenance would be minimal since majority of the maintenance burden would fall on the underlying Python codebase.

ldelench_wmf moved this task from Triage to Backlog on the Product-Analytics board.

Repo with package skeleton created at https://gitlab.wikimedia.org/repos/product-analytics/wmfdata-r

Nothing actually implemented yet

Aklapper subscribed.

@mpopov: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

mpopov changed the task status from Open to Stalled.May 16 2024, 5:55 PM
mpopov claimed this task.
mpopov moved this task from Backlog to Icebox on the Product-Analytics board.