-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
port statistics and to_tsv.py to bigquery #715
Comments
loading into bigquery would help provide a recent request from @tantek: find the most webmentioned posts, both of all time and ongoing. https://chat.indieweb.org/2017-08-16#t1502909939532000 |
i tried to load the datastore backups for the first, the bigquery docs claim it supports wildcards in cloud storage URIs, e.g. second, when i tried loading a single file ( |
aha, i needed to load the e.g. |
ok, i may or may not maintain it. we'll see. |
loaded the rest of the data (accounts, publishes, etc.) into bigquery from an 11/19 snapshot, and started trying google data studio as a BI tool: https://datastudio.google.com/open/10lOfwjq6844GqRmqtfv-2nZc2F9BZyWq overall, it's improved a lot since i tried it this past summer for indie map...but it's still not quite ready for prime time.
i've sent these as feedback and filed them in the issue tracker: |
next candidates:
|
so, qlik's bigquery connector requires their desktop app, and mode only had a free trial, not a permanent free tier, and their trial ended. so i'm now trying out local metabase. working pretty well so far. |
spoke too soon. metabase has been giving me a hell of a time dealing with bigquery views (e.g. a combined Sources view with all the different silo account types), legacy vs standard SQL (especially on views), and graphing multiple things on the same chart. next idea: use bigquery to populate the current spreadsheet, and continue to use that to generate the graphs. |
worked! the spreadsheet now has updated data and graphs. will document the new process soon. in the meantime, here's the monster bigquery query that populates the spreadsheet data: WITH
days AS (SELECT * FROM UNNEST(
GENERATE_DATE_ARRAY('2013-12-01', (SELECT MAX(CAST(created AS date)) FROM datastore.Response))) AS day),
s as (SELECT cast(created as date) AS day, __key__.kind, count(*) as num FROM datastore.Sources GROUP BY day, kind),
bl as (SELECT * FROM s WHERE kind = 'Blogger'),
fb as (SELECT * FROM s WHERE kind = 'FacebookPage'),
fl as (SELECT * FROM s WHERE kind = 'Flickr'),
gp as (SELECT * FROM s WHERE kind = 'GooglePlusPage'),
ig as (SELECT * FROM s WHERE kind = 'Instagram'),
me as (SELECT * FROM s WHERE kind = 'Medium'),
tu as (SELECT * FROM s WHERE kind = 'Tumblr'),
tw as (SELECT * FROM s WHERE kind = 'Twitter'),
wp as (SELECT * FROM s WHERE kind = 'WordPress'),
resp as (SELECT cast(created as date) AS day, count(*) as num FROM datastore.Response GROUP BY day),
bp as (SELECT cast(created as date) AS day, count(*) as num FROM datastore.BlogPost GROUP BY day),
pub as (SELECT cast(created as date) AS day, count(*) as num FROM datastore.Publish GROUP BY day),
bw as (SELECT cast(created as date) AS day, count(*) as num FROM datastore.BlogWebmention GROUP BY day),
resps_and_bps as (
SELECT created, sent, unsent, error, failed, skipped FROM datastore.Response
UNION ALL
SELECT created, sent, unsent, error, failed, skipped FROM datastore.BlogPost
),
wms as (
SELECT cast(created as date) AS day, count(*) as num,
SUM(ARRAY_LENGTH(sent)) AS sent,
SUM(ARRAY_LENGTH(sent) + ARRAY_LENGTH(unsent) + ARRAY_LENGTH(error) + ARRAY_LENGTH(failed) + ARRAY_LENGTH(skipped)) AS links
FROM resps_and_bps GROUP BY day),
features as (
SELECT cast(created as date) AS day, feat, count(*) as num
FROM datastore.Sources s, s.features feat GROUP BY feat, day),
listen as (SELECT * FROM features WHERE feat = 'listen'),
publish as (SELECT * FROM features WHERE feat = 'publish'),
webmention as (SELECT * FROM features WHERE feat = 'webmention'),
domains as (
SELECT DISTINCT NET.HOST(s) AS domain, MIN(CAST(rbs.created AS date)) AS day
FROM resps_and_bps rbs, rbs.sent s GROUP BY domain ORDER BY day),
domain_counts as (SELECT day, COUNT(*) AS num FROM domains GROUP BY day)
SELECT day,
SUM(bl.num) OVER win AS Blogger,
SUM(fb.num) OVER win AS FacebookPage,
SUM(fl.num) OVER win AS Flickr,
SUM(gp.num) OVER win AS GooglePlusPage,
SUM(ig.num) OVER win AS Instagram,
SUM(me.num) OVER win AS Medium,
SUM(tu.num) OVER win AS Tumblr,
SUM(tw.num) OVER win AS Twitter,
SUM(wp.num) OVER win AS WordPress,
SUM(resp.num) OVER win AS Response,
SUM(bp.num) OVER win AS BlogPost,
SUM(pub.num) OVER win AS Publish,
SUM(bw.num) OVER win AS BlogWebmention,
SUM(listen.num) OVER win AS f_listen,
SUM(publish.num) OVER win AS f_publish,
SUM(webmention.num) OVER win AS f_blog_webmention,
SUM(wms.links) OVER win AS links,
SUM(wms.sent + bp.num) OVER win AS webmentions,
SUM(domain_counts.num) OVER win AS domains
FROM days
LEFT JOIN bl USING(day)
LEFT JOIN fb USING(day)
LEFT JOIN fl USING(day)
LEFT JOIN gp USING(day)
LEFT JOIN ig USING(day)
LEFT JOIN me USING(day)
LEFT JOIN tu USING(day)
LEFT JOIN tw USING(day)
LEFT JOIN wp USING(day)
LEFT JOIN resp USING(day)
LEFT JOIN bp USING(day)
LEFT JOIN pub USING(day)
LEFT JOIN bw USING(day)
LEFT JOIN listen USING(day)
LEFT JOIN publish USING(day)
LEFT JOIN wms USING(day)
LEFT JOIN webmention USING(day)
LEFT JOIN domain_counts USING(day)
WINDOW win AS (ORDER BY day ROWS UNBOUNDED PRECEDING)
ORDER BY day |
we currently use
to_tsv.py
and a google spreadsheet to generate our periodic usage and growth stats, e.g. https://snarfed.org/2016-06-06_bridgy-stats-update-2...but we've outgrown it. we have ~25GB of data now, so downloading it alone is a bit prohibitive, and processing it afterward takes ~4GB memory and >6h on a 2014 MBP. ugh.
fortunately google has solved this for us. we can load the existing datastore backups into bigquery, then port
to_tsv.py
to a handful of SQL queries. we may even get better viz and charts for free. yahtzee!The text was updated successfully, but these errors were encountered: