wmfdata-py is currently the best way to query our production data sources for data analysis. It has excellent support for
- Hive via PyHive, meanwhile wmfdata-r still relies on wrapping hive CLI and suffers from the same problems as wmfdata-py did prior to the switch (T275233)
- Spark via PySpark, but our configuration prevents us from using sparklyr
- Presto via presto-python-client, but RPresto's support for Kerberos-ized setups appears to be ¯\_(ツ)_/¯
A new version of wmfdata-r is needed, one that is just a wrapper for wmfdata-py's database-accessing functions (via reticulate).
Not only is wmfdata-r vastly outdated and limited in its ability to access production data sources, it has also become bloated – with miscellaneous functions (e.g. sample size calculations for χ2 test, Wikimedia color palettes) added over time that should actually be factored out into a separate package or forgotten entirely.