@@ -68,7 +68,6 @@ next cycle as far as possible.
68
68
The ` scala.testing.Benchmark ` trait is predefined in the Scala standard library and is designed with
69
69
above in mind. Here is an example of benchmarking a map operation on a concurrent trie:
70
70
71
-
72
71
import collection.parallel.mutable.ParCtrie
73
72
import collection.parallel.ForkJoinTaskSupport
74
73
@@ -111,6 +110,153 @@ code gets optimized. Further, we can see that the benefit of hyperthreading is n
111
110
as going from ` 4 ` to ` 8 ` threads results only in a minor performance improvement.
112
111
113
112
113
+ ## How big should a collection be to go parallel?
114
+
115
+ This is a question commonly asked. The answer is somewhat involved.
116
+
117
+ The size of the collection at which the parallelization pays of really
118
+ depends on many factors. Some of them, but not all, include:
119
+ - Machine architecture. Different CPU types have different
120
+ performance and scalability characteristics. Orthogonal to that,
121
+ whether the machine is multicore or has multiple processors
122
+ communicating via motherboard.
123
+ - JVM vendor and version. Different VMs apply different
124
+ optimizations to the code at runtime. They implement different memory
125
+ management and synchronization techniques. Some do not support
126
+ ` ForkJoinPool ` , reverting to ` ThreadPoolExecutor ` s, resulting in
127
+ more overhead.
128
+ - Per-element workload. A function or a predicate for a parallel
129
+ operation determines how big is the per-element workload. The
130
+ smaller the workload, the higher the number of elements needed to
131
+ gain speedups when running in parallel.
132
+ - Specific collection. For example, ` ParArray ` and
133
+ ` ParCtrie ` have splitters that traverse the collection at
134
+ different speeds, meaning there is more per-element work in just the
135
+ traversal itself.
136
+ - Specific operation. For example, ` ParVector ` is a lot slower for
137
+ transformer methods (like ` filter ` ) than it is for accessor methods (like ` foreach ` )
138
+ - Side-effects. When modifying memory areas concurrently or using
139
+ synchronization within the body of ` foreach ` , ` map ` , etc.,
140
+ contention can occur.
141
+ - Memory management. When allocating a lot of objects a garbage
142
+ collection cycle can be triggered. Depending on how the references
143
+ to new objects are passed around, the GC cycle can take more or less time.
144
+
145
+ Even in separation, it is not easy to reason about things above and
146
+ give a precise answer to what the collection size should be. To
147
+ roughly illustrate what the size should be, we give an example of
148
+ a cheap sideeffect-free parallel vector reduce (in this case, sum)
149
+ operation performance on an i7 quad-core processor (not using
150
+ hyperthreading) on JDK7:
151
+
152
+ import collection.parallel.immutable.ParVector
153
+
154
+ object Reduce extends testing.Benchmark {
155
+ val length = sys.props("length").toInt
156
+ val par = sys.props("par").toInt
157
+ val parvector = ParVector((0 until length): _*)
158
+
159
+ parvector.tasksupport = new collection.parallel.ForkJoinTaskSupport(new scala.concurrent.forkjoin.ForkJoinPool(par))
160
+
161
+ def run = {
162
+ parvector reduce {
163
+ (a, b) => a + b
164
+ }
165
+ }
166
+ }
167
+
168
+
169
+ object ReduceSeq extends testing.Benchmark {
170
+ val length = sys.props("length").toInt
171
+ val vector = collection.immutable.Vector((0 until length): _*)
172
+
173
+ def run = {
174
+ vector reduce {
175
+ (a, b) => a + b
176
+ }
177
+ }
178
+ }
179
+
180
+ We first run the benchmark with ` 250000 ` elements and obtain the
181
+ following results, for ` 1 ` , ` 2 ` and ` 4 ` threads:
182
+
183
+ java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=1 -Dlength=250000 Reduce 10 10
184
+ Reduce$ 54 24 18 18 18 19 19 18 19 19
185
+ java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=2 -Dlength=250000 Reduce 10 10
186
+ Reduce$ 60 19 17 13 13 13 13 14 12 13
187
+ java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=4 -Dlength=250000 Reduce 10 10
188
+ Reduce$ 62 17 15 14 13 11 11 11 11 9
189
+
190
+ We then decrease the number of elements down to ` 120000 ` and use ` 4 `
191
+ threads to compare the time to that of a sequential vector reduce:
192
+
193
+ java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=4 -Dlength=120000 Reduce 10 10
194
+ Reduce$ 54 10 8 8 8 7 8 7 6 5
195
+ java -server -cp .:../../build/pack/lib/scala-library.jar -Dlength=120000 ReduceSeq 10 10
196
+ ReduceSeq$ 31 7 8 8 7 7 7 8 7 8
197
+
198
+ ` 120000 ` elements seems to be the around the threshold in this case.
199
+
200
+ As another example, we take the ` mutable.ParHashMap ` and the ` map `
201
+ method (a transformer method) and run the following benchmark in the same environment:
202
+
203
+ import collection.parallel.mutable.ParHashMap
204
+
205
+ object Map extends testing.Benchmark {
206
+ val length = sys.props("length").toInt
207
+ val par = sys.props("par").toInt
208
+ val phm = ParHashMap((0 until length) zip (0 until length): _*)
209
+
210
+ phm.tasksupport = new collection.parallel.ForkJoinTaskSupport(new scala.concurrent.forkjoin.ForkJoinPool(par))
211
+
212
+ def run = {
213
+ phm map {
214
+ kv => kv
215
+ }
216
+ }
217
+ }
218
+
219
+
220
+ object MapSeq extends testing.Benchmark {
221
+ val length = sys.props("length").toInt
222
+ val hm = collection.mutable.HashMap((0 until length) zip (0 until length): _*)
223
+
224
+ def run = {
225
+ hm map {
226
+ kv => kv
227
+ }
228
+ }
229
+ }
230
+
231
+ For ` 120000 ` elements we get the following times when ranging the
232
+ number of threads from ` 1 ` to ` 4 ` :
233
+
234
+ java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=1 -Dlength=120000 Map 10 10
235
+ Map$ 187 108 97 96 96 95 95 95 96 95
236
+ java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=2 -Dlength=120000 Map 10 10
237
+ Map$ 138 68 57 56 57 56 56 55 54 55
238
+ java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=4 -Dlength=120000 Map 10 10
239
+ Map$ 124 54 42 40 38 41 40 40 39 39
240
+
241
+ Now, if we reduce the number of elements to ` 15000 ` and compare that
242
+ to the sequential hashmap:
243
+
244
+ java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=1 -Dlength=15000 Map 10 10
245
+ Map$ 41 13 10 10 10 9 9 9 10 9
246
+ java -server -cp .:../../build/pack/lib/scala-library.jar -Dpar=2 -Dlength=15000 Map 10 10
247
+ Map$ 48 15 9 8 7 7 6 7 8 6
248
+ java -server -cp .:../../build/pack/lib/scala-library.jar -Dlength=15000 MapSeq 10 10
249
+ MapSeq$ 39 9 9 9 8 9 9 9 9 9
250
+
251
+ For this collection and this operation it makes sense
252
+ to go parallel when there are below ` 15000 ` elements (in general,
253
+ it is feasible to parallelize hashmaps and hashsets with fewer
254
+ elements than would be required for arrays or vectors).
255
+
256
+
257
+
258
+
259
+
114
260
## References
115
261
116
262
1 . [ Anatomy of a flawed microbenchmark, Brian Goetz] [ 1 ]
0 commit comments