8000 Merge branch 'master' into jkeiser/json-pointer · JavaScriptExpert/simdjson@dedf0c6 · GitHub
[go: up one dir, main page]

Skip to content

Commit dedf0c6

Browse files
authored
Merge branch 'master' into jkeiser/json-pointer
2 parents 0bcda5e + 5514ae3 commit dedf0c6

File tree

3 files changed

+46
-7
lines changed

3 files changed

+46
-7
lines changed

README.md

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ This library is part of the [Awesome Modern C++](https://awesomecpp.com) list.
1212

1313
[![Build Status](https://cloud.drone.io/api/badges/simdjson/simdjson/status.svg)](https://cloud.drone.io/simdjson/simdjson)
1414
[![CircleCI](https://circleci.com/gh/simdjson/simdjson.svg?style=svg)](https://circleci.com/gh/simdjson/simdjson)
15+
[![Fuzzing Status](https://oss-fuzz-build-logs.storage.googleapis.com/badges/simdjson.svg)](https://bugs.chromium.org/p/oss-fuzz/issues/list?sort=-opened&q=proj%3Asimdjson&can=2)
1516
[![Build status](https://ci.appveyor.com/api/projects/status/ae77wp5v3lebmu6n/branch/master?svg=true)](https://ci.appveyor.com/project/lemire/simdjson-jmmti/branch/master)
1617
[![][license img]][license]
1718

@@ -34,7 +35,7 @@ simdjson is easily consumable with a single .h and .cpp file.
3435
std::cout << tweets["search_metadata"]["count"] << " results." << std::endl;
3536
}
3637
```
37-
3. `g++ -o parser parser.cpp` (or clang++)
38+
3. `c++ -o parser parser.cpp simdjson.cpp -std=c++17`
3839
4. `./parser`
3940
```
4041
100 results.
@@ -63,6 +64,7 @@ QCon San Francisco 2019 (best voted talk):
6364
- [Microsoft FishStore](https://github.com/microsoft/FishStore)
6465
- [Yandex ClickHouse](https://github.com/yandex/ClickHouse)
6566
- [Clang Build Analyzer](https://github.com/aras-p/ClangBuildAnalyzer)
67+
- [azul](https://github.com/tudelft3d/azul)
6668
6769
If you are planning to use simdjson in a product, please work from one of our releases.
6870
@@ -143,11 +145,27 @@ The json stream parser is threaded, using exactly two threads.
143145

144146
## Large files
145147

146-
If you are processing large files (e.g., 100 MB), it is likely that the performance of simdjson will be limited by page misses and/or page allocation. [On some systems, memory allocation runs far slower than we can parse (e.g., 1.4GB/s).](https://lemire.me/blog/2020/01/14/how-fast-can-you-allocate-a-large-block-of-memory-in-c/)
148+
If you are processing large files (e.g., 100 MB), it is possible that the performance of simdjson will be limited by page misses and/or page allocation. [On some systems, memory allocation runs far slower than we can parse (e.g., 1.4GB/s).](https://lemire.me/blog/2020/01/14/how-fast-can-you-allocate-a-large-block-of-memory-in-c/)
147149

148-
You will get best performance with large or huge pages. Under Linux, you can enable transparent huge pages with a command like `echo always > /sys/kernel/mm/transparent_hugepage/enabled` (root access may be required). We recommend that you report performance numbers with and without huge pages.
150+
A viable strategy is to amortize the cost of page allocation by reusing the same `parser` object over several files:
151+
152+
```C++
153+
// create one parser
154+
simdjson::document::parser parser;
155+
...
156+
// the parser is going to pay a memory allocation price
157+
auto [doc1, error1] = parser.parse(largestring1);
158+
...
159+
// use again the same parser, it will be faster
160+
auto [doc2, error2] = parser.parse(largestring2);
161+
...
162+
auto [doc3, error3] = parser.load("largefilename");
163+
```
164+
165+
If you cannot reuse the same parser instance, maybe because your application just processes one large document once, you will get best performance with large or huge pages. Under Linux, you can enable transparent huge pages with a command like `echo always > /sys/kernel/mm/transparent_hugepage/enabled` (root access may be required). It may be more difficult to achieve the same result under other systems like macOS or Windows.
166+
167+
In general, when running benchmarks over large files, we recommend that you report performance numbers with and without huge pages if possible. Furthermore, you should amortize the parsing (e.g., by parsing several large files) to distinguish the time spent parsing from the time spent allocating memory.
149168

150-
Another strategy is to reuse pre-allocated buffers. That is, you avoid reallocating memory. You just allocate memory once and reuse the blocks of memory.
151169

152170
## Including simdjson
153171

fuzz/Fuzzing.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,15 @@
66
- https://github.com/lemire/simdjson/issues/351
77
- https://github.com/lemire/simdjson/issues/345
88

9-
Simdjson tries to follow [fuzzing best practises](https://google.github.io/oss-fuzz/advanced-topics/ideal-integration/#summary).
9+
The simdjson library tries to follow [fuzzing best practises](https://google.github.io/oss-fuzz/advanced-topics/ideal-integration/#summary).
10+
11+
The simdjson library is continuously fuzzed on [oss-fuzz](https://github.com/google/oss-fuzz).
12+
13+
14+
## Currently open bugs
15+
16+
You can find the currently opened bugs, if any at [bugs.chromium.org](https://bugs.chromium.org/p/oss-fuzz/issues/list?sort=-opened&q=proj%3Asimdjson&can=2): make sure not to miss the "Open Issues" selector. Bugs that are fixed by follow-up commits are automatically closed.
1017

11-
Simdjson is continuously fuzzed on [oss-fuzz](https://github.com/google/oss-fuzz).
1218

1319
## Fuzzing as a CI job
1420

@@ -29,7 +35,7 @@ The corpus will grow over time and easy to find bugs will be detected already du
2935

3036
## Corpus
3137

32-
Simdjson does not benefit from a corpus as much as other projects, because the library is very fast and explores the input space very well. With that said, it is still beneficial to have one. The CI job stores the corpus on bintray between runs, and is available here: https://dl.bintray.com/pauldreik/simdjson-fuzz-corpus/corpus/corpus.tar
38+
The simdjson library does not benefit from a corpus as much as other projects, because the library is very fast and explores the input space very well. With that said, it is still beneficial to have one. The CI job stores the corpus on bintray between runs, and is available here: https://dl.bintray.com/pauldreik/simdjson-fuzz-corpus/corpus/corpus.tar
3339

3440
One can also grab the corpus as an artifact from the github actions job. Pick a run, then go to artifacts and download.
3541

scripts/ruby/kostya_large.rb

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,21 @@
11
#!/usr/bin/env ruby
22
require 'json'
33

4+
y = []
5+
6+
10000.times do
7+
h = {
8+
'x' => rand,
9+
'y' => rand,
10+
'z' => rand,
11+
'name' => ('a'..'z').to_a.shuffle[0..5].join + ' ' + rand(10000).to_s,
12+
'opts' => {'1' => [1, true]},
13+
}
14+
y << h
15+
end
16+
17+
File.open("2.json", 'w') { |f| f.write JSON.pretty_generate('coordinates' => y, 'info' => "some info") }
18+
419
x = []
520

621
524288.times do

0 commit comments

Comments
 (0)
0