8000 The jsonstats utility becomes ever more powerful. by lemire · Pull Request #890 · simdjson/simdjson · GitHub
[go: up one dir, main page]

Skip to content

The jsonstats utility becomes ever more powerful. #890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 19, 2020

Conversation

lemire
Copy link
Member
@lemire lemire commented May 18, 2020

We now record how many bytes are used by repeated keys. (That is, many keys may appear once, we don't want to consider those.)

First observation: all of our test files have fewer than 128 repeated keys:

$ for i in ../jsonexamples/*.json; do echo $i; ./tools/jsonstats  $i|grep repeated_key_distin; done
../jsonexamples/apache_builds.json
      "repeated_key_distinct_count"=          3,
../jsonexamples/canada.json
      "repeated_key_distinct_count"=          1,
../jsonexamples/citm_catalog.json
      "repeated_key_distinct_count"=         24,
../jsonexamples/github_events.json
      "repeated_key_distinct_count"=        114,
../jsonexamples/google_maps_api_compact_response.json
      "repeated_key_distinct_count"=          6,
../jsonexamples/google_maps_api_response.json
      "repeated_key_distinct_count"=          6,
../jsonexamples/gsoc-2018.json
      "repeated_key_distinct_count"=          9,
../jsonexamples/instruments.json
      "repeated_key_distinct_count"=         61,
../jsonexamples/marine_ik.json
      "repeated_key_distinct_count"=         23,
../jsonexamples/mesh.json
      "repeated_key_distinct_count"=          0,
../jsonexamples/mesh.pretty.json
      "repeated_key_distinct_count"=          0,
../jsonexamples/numbers.json
      "repeated_key_distinct_count"=          0,
../jsonexamples/random.json
      "repeated_key_distinct_count"=         11,
../jsonexamples/repeat.json
      "repeated_key_distinct_count"=          2,
../jsonexamples/tree-pretty.json
      "repeated_k
8000
ey_distinct_count"=         40,
../jsonexamples/twitter.json
      "repeated_key_distinct_count"=         83,
../jsonexamples/twitter10.json
      "repeated_key_distinct_count"=         94,
../jsonexamples/twitter_api_compact_response.json
      "repeated_key_distinct_count"=         58,
../jsonexamples/twitter_api_response.json
      "repeated_key_distinct_count"=         69,
../jsonexamples/twitter_timeline.json
      "repeated_key_distinct_count"=         65,
../jsonexamples/twitterescaped.json
      "repeated_key_distinct_count"=         83,
../jsonexamples/update-center.json
      "repeated_key_distinct_count"=         22,

Yet, in many cases, these few strings can amount to much of the string volume:

$ for i in ../jsonexamples/*.json; do echo $i; ./tools/jsonstats  $i|egrep "(repeated_key_byte_count|string_byte_count)"; done
../jsonexamples/apache_builds.json
      "string_byte_count"        =      76964,
      "repeated_key_byte_count"  =      10523;
../jsonexamples/canada.json
      "string_byte_count"        =         90,
      "repeated_key_byte_count"  =          8;
../jsonexamples/citm_catalog.json
      "string_byte_count"        =     221379,
      "repeated_key_byte_count"  =     202039;
../jsonexamples/github_events.json
      "string_byte_count"        =      45778,
      "repeated_key_byte_count"  =       6861;
../jsonexamples/google_maps_api_compact_response.json
      "string_byte_count"        =       6760,
      "repeated_key_byte_count"  =       4047;
../jsonexamples/google_maps_api_response.json
      "string_byte_count"        =       6760,
      "repeated_key_byte_count"  =       4047;
../jsonexamples/gsoc-2018.json
      "string_byte_count"        =    2945815,
      "repeated_key_byte_count"  =     128855;
../jsonexamples/instruments.json
      "string_byte_count"        =      69760,
      "repeated_key_byte_count"  =      67944;
../jsonexamples/marine_ik.json
      "string_byte_count"        =     126909,
      "repeated_key_byte_count"  =     125312;
../jsonexamples/mesh.json
      "string_byte_count"        =         92,
      "repeated_key_byte_count"  =          0;
../jsonexamples/mesh.pretty.json
      "string_byte_count"        =         92,
      "repeated_key_byte_count"  =          0;
../jsonexamples/numbers.json
      "string_byte_count"        =          0,
      "repeated_key_byte_count"  =          0;
../jsonexamples/random.json
      "string_byte_count"        =     334043,
      "repeated_key_byte_count"  =      90944;
../jsonexamples/repeat.json
      "string_byte_count"        =       3299,
      "repeated_key_byte_count"  =        596;
../jsonexamples/tree-pretty.json
      "string_byte_count"        =       8456,
      "repeated_key_byte_count"  =       6221;
../jsonexamples/twitter.json
      "string_byte_count"        =     367917,
      "repeated_key_byte_count"  =     166051;
../jsonexamples/twitter10.json
      "string_byte_count"        =    3679170,
      "repeated_key_byte_count"  =    1670860;
../jsonexamples/twitter_api_compact_response.json
      "string_byte_count"        =       7894,
      "repeated_key_byte_count"  =       2738;
../jsonexamples/twitter_api_response.json
      "string_byte_count"        =       8564,
      "repeated_key_byte_count"  =       3251;
../jsonexamples/twitter_timeline.json
      "string_byte_count"        =      31116,
      "repeated_key_byte_count"  =      15948;
../jsonexamples/twitterescaped.json
      "string_byte_count"        =     367917,
      "repeated_key_byte_count"  =     166051;
../jsonexamples/update-center.json
      "string_byte_count"        =     440851,
      "repeated_key_byte_count"  =     106920;

@lemire lemire merged commit f346362 into master May 19, 2020
@lemire lemire deleted the dlemire/evenbetterjsonstats branch May 19, 2020 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0