8000 Merge pull request #61 from axel22/gh-pages · soc-mirror/scala.github.com@d10cb52 · GitHub
[go: up one dir, main page]

Skip to content

Commit d10cb52

Browse files
committed
Merge pull request scala#61 from axel22/gh-pages
Benchmarking section of parallel collection overview.
2 parents 555e9b0 + 527afb2 commit d10cb52

File tree

1 file changed

+147
-1
lines changed

1 file changed

+147
-1
lines changed

overviews/parallel-collections/performance.md

+147Lines changed: 147 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,6 @@ next cycle as far as possible.
6868
The `scala.testing.Benchmark` trait is predefined in the Scala standard library and is designed with
6969
above in mind. Here is an example of benchmarking a map operation on a concurrent trie:
7070

71-
7271
import collection.parallel.mutable.ParCtrie
7372
import collection.parallel.ForkJoinTaskSupport
7473

@@ -111,6 +110,153 @@ code gets optimized. Further, we can see that the benefit of hyperthreading is n
111110
as going from `4` to `8` threads results only in a minor performance improvement.
112111

113112

113+
## How big should a collection be to go parallel?
114+
115+
This is a question commonly asked. The answer is somewhat involved.
116+
117+
The size of the collection at which the parallelization pays of really
118+
depends on many factors. Some of them, but not all, include:
119+
- Machine architecture. Different CPU types have different
120+
performance and scalability characteristics. Orthogonal to that,
121+
whether the machine is multicore or has multiple processors
122+
communicating via motherboard.
123+
- JVM vendor and version. Different VMs apply different
124+
optimizations to the code at runtime. They implement different memory
125+
management and synchronization techniques. Some do not support
126+
`ForkJoinPool`, reverting to `ThreadPoolExecutor`s, resulting in
127+
more overhead.
128+
- Per-element workload. A function or a predicate for a parallel
129+
operation determines how big is the per-element workload. The
130+
smaller the workload, the higher the number of elements needed to
131+
gain speedups when running in parallel.
132+
- Specific collection. For example, `ParArray` and
133+
`ParCtrie` have splitters that traverse the collection at
134+
different speeds, meaning there is more per-element work in just the
135+
traversal itself.
136+
- Specific operation. For example, `ParVector` is a lot slower for
137+
transformer methods (like `filter`) than it is for accessor methods (like `foreach`)
138+
- Side-effects. When modifying memory areas concurrently or using
139+
synchronization within the body of `foreach`, `map`, etc.,
140+
contention can occur.
141+
- Memory management. When allocating a lot of objects a garbage
142+
collection cycle can be triggered. Depending on how the references
143+
to new objects are passed around, the GC cycle can take more or less time.
144+
145+
Even in separation, it is not easy to reason about things above and
146+
give a precise answer to what the collection size should be. To
147+
roughly illustrate what the size should be, we give an example of
148+
a cheap sideeffect-free parallel vector reduce (in this case, sum)
149+
operation performance on an i7 quad-core processor (not using
150+
hyperthreading) on JDK7:
151+
152+
import collection.parallel.immutable.ParVector
153+
154+
object Reduce extends testing.Benchmark {
155+
val length = sys.props("length").toInt
156+
val par = sys.props("par").toInt
157+
val parvector = ParVector((0 until length): _*)
158+
159+
parvector.tasksupport = new collection.parallel.ForkJoinTaskSupport(new scala.concurrent.forkjoin.ForkJoinPool(par))
160+
161+
def run = {
162+
parvector reduce {
163+
(a, b) => a + b
164+
}
165+
}
166+
}
167+
168+
169+
object ReduceSeq extends testing.Benchmark {
170+
val length = sys.props("length").toInt
171+
val vector = collection.immutable.Vector((0 until length): _*)
172+
173+
def run = {
174+
vector reduce {
175+
(a, b) => a + b
176+
}
177+
}
178+
}
179+
180+
We first run the benchmark with `250000` elements and obtain the
181+
following results, for `1`, `2` and `4` threads:
182+
183+
java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=1 -Dlength=250000 Reduce 10 10
184+
Reduce$ 54 24 18 18 18 19 19 18 19 19
185+
java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=2 -Dlength=250000 Reduce 10 10
186+
Reduce$ 60 19 17 13 13 13 13 14 12 13
187+
java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=4 -Dlength=250000 Reduce 10 10
188+
Reduce$ 62 17 15 14 13 11 11 11 11 9
189+
190+
We then decrease the number of elements down to `120000` and use `4`
191+
threads to compare the time to that of a sequential vector reduce:
192+
193+
java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=4 -Dlength=120000 Reduce 10 10
194+
Reduce$ 54 10 8 8 8 7 8 7 6 5
195+
java -server -cp .:../../build/pack/lib/scala-library.jar -Dlength=120000 ReduceSeq 10 10
196+
ReduceSeq$ 31 7 8 8 7 7 7 8 7 8
197+
198+
`120000` elements seems to be the around the threshold in this case.
199+
200+
As another example, we take the `mutable.ParHashMap` and the `map`
201+
method (a transformer method) and run the following benchmark in the same environment:
202+
203+
import collection.parallel.mutable.ParHashMap
204+
205+
object Map extends testing.Benchmark {
206+
val length = sys.props("length").toInt
207+
val par = sys.props("par").toInt
208+
val phm = ParHashMap((0 until length) zip (0 until length): _*)
209+
210+
phm.tasksupport = new collection.parallel.ForkJoinTaskSupport(new scala.concurrent.forkjoin.ForkJoinPool(par))
211+
212+
def run = {
213+
phm map {
214+
kv => kv
215+
}
216+
}
217+
}
218+
219+
220+
object MapSeq extends testing.Benchmark {
221+
val length = sys.props("length").toInt
222+
val hm = collection.mutable.HashMap((0 until length) zip (0 until length): _*)
223+
224+
def run = {
225+
hm map {
226+
kv => kv
227+
}
228+
}
229+
}
230+
231+
For `120000` elements we get the following times when ranging the
232+
number of threads from `1` to `4`:
233+
234+
java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=1 -Dlength=120000 Map 10 10
235+
Map$ 187 108 97 96 96 95 95 95 96 95
236+
java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=2 -Dlength=120000 Map 10 10
237+
Map$ 138 68 57 56 57 56 56 55 54 55
238+
java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=4 -Dlength=120000 Map 10 10
239+
Map$ 124 54 42 40 38 41 40 40 39 39
240+
241+
Now, if we reduce the number of elements to `15000` and compare that
242+
to the sequential hashmap:
243+
244+
java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=1 -Dlength=15000 Map 10 10
245+
Map$ 41 13 10 10 10 9 9 9 10 9
246+
java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=2 -Dlength=15000 Map 10 10
247+
Map$ 48 15 9 8 7 7 6 7 8 6
248+
java -server -cp .:../../build/pack/lib/scala-library.jar -Dlength=15000 MapSeq 10 10
249+
MapSeq$ 39 9 9 9 8 9 9 9 9 9
250+
251+
For this collection and this operation it makes sense
252+
to go parallel when there are below `15000` elements (in general,
253+
it is feasible to parallelize hashmaps and hashsets with fewer
254+
elements than would be required for arrays or vectors).
255+
256+
257+
258+
259+
114260
## References
115261

116262
1. [Anatomy of a flawed microbenchmark, Brian Goetz][1]

0 commit comments

Comments
 (0)
0