-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Hash table getOrElseUpdate and indexing 2.12 backport [ci: last-only] #5566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
protected final def improve(hcode: Int, seed: Int): Int = { | ||
val hash = scala.util.hashing.byteswap32(hcode) | ||
val shift = seed & ((1 << 5) - 1) | ||
(hash >>> shift) | (hash << (32 - shift)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed :)
4dab23f
to
8ed464a
Compare
/rebuild |
val shifted = (improved >> (32 - java.lang.Integer.bitCount(ones))) & ones | ||
shifted | ||
val exponent = Integer.numberOfLeadingZeros(ones) | ||
(improve(hcode, seedvalue) >>> exponent) & ones |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I realize: since ones
is of the form 00...0011...11
, I do not think there is any value of x
for which (x >>> exponent) & ones
is different from (x >>> exponent)
. It seems to me that & ones
is redundant and can be dropped. That's because & ones
masks off the exponent
most significant bits of (x >>> exponent)
, which by construction already has 0
's in all its exponent
most significant bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that also (though I thought I still have to zero out the highest bit), but haven't validated my find yet.
Thanks, lemme' recheck :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Integer.MAX_VALUE >>> 32 & 0) != (Integer.MAX_VALUE >>> 32)
i.e. if the length is 1
, ones
will be 0
, exponent
will be 32
, which would produce the same shift as 0
(i.e. 123 >>> 32
== 123
) ... other than that (i.e. if length > 1
) this seems to be true indeed.
Should I simply assert that the length >= 2
at the beginning?
Edit:
protected final def index(hcode: Int): Int = {
assert(table.length >= 2)
improve(hcode, seedvalue) >>> numberOfLeadingZeros(table.length - 1)
}
or simply
protected final def index(hcode: Int): Int = improve(hcode, seedvalue) >>> numberOfLeadingZeros(table.length - 1)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good observation!
But I don't think we need to care. If length == 1
, we do not care about the quality of hashes by definition, since everything enters the same bucket anyway. So this does not seem to be an issue. It probably deserves a comment, though.
Should I simply assert that the length >= 2 at the beginning?
No, assert
would kill all the performance of this performance-sensitive method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I'm not paying enough attention tonight, it seems. This is the index
method, which needs to return 0
for correctness when length == 1
(assuming that ever happens).
I don't know whether it is reasonable to assume that length
is never 1. If it is, then we can drop & ones
. If it isn't, we need to keep the code as it currently is, but then it would probably be a good thing to add a comment that & ones
is there specifically to handle length == 1
, in case someone wonders in the future.
8ed464a
to
cd0a77b
Compare
val exponent = Integer.numberOfLeadingZeros(ones) | ||
(improve(hcode, seedvalue) >>> exponent) & ones | ||
} | ||
protected final def index(hcode: Int): Int = if (table.length == 1) 0 else improve(hcode, seedvalue) >>> numberOfLeadingZeros(table.length - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjrd, thanks for the comments, is this too awkward?
Branch prediction should eliminate this fork.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. At this point we're past my understanding of JVM performance. I would expect & ones
(one CPU cycle with operands already loaded in registers so no stalling due to memory) to always be faster than a branch, even if correctly predicted and/or using a conditional move, but I cannot assert that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instruction Level Parallelism can achieve > 1 instruction/clock (IPC), so a predictable branch guarding an operation might in fact be faster than a branchless algorithm with a data dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @retronym, and also, the JVM can detect that the first branch is never taken and eliminate it completely :)
99a46f9
to
cd0a77b
Compare
@szeiger, what should I do to avoid all these build failures, I can't even check the source of the |
@paplorinc |
You can open the corresponding "validate-main" build on jenkins, log in, and click "Rebuild" |
cd0a77b
to
347b9db
Compare
Thanks @szeiger, squashed it :) |
uh, that's not very nice :) i knew that jenkins control is only available to a few, but causing 404 for others is rather unfriendly.. does it help to access https://scala-ci.typesafe.com/logout ? |
a906ecd
to
04259f5
Compare
Great, thanks! My comment was meant to indicate I did a quick check on this PR, making sure that the backport indeed got these files up to date with 2.12.x. |
For your next backport, I'd suggest using git cherry-pick. For this PR, the command would have been |
(cherry picked from commit 9a24860)
(`ops/s`, smaller is better) `Before (9c5d3f8)`: ```scala [info] # Run complete. Total time: 00:08:15 [info] [info] Benchmark (size) Mode Cnt Score Error Units [info] s.c.immutable.VectorMapBenchmark.groupBy 10 avgt 20 645.594 ± 9.435 ns/op [info] s.c.immutable.VectorMapBenchmark.groupBy 100 avgt 20 2084.216 ± 37.814 ns/op [info] s.c.immutable.VectorMapBenchmark.groupBy 1000 avgt 20 19878.481 ± 262.404 ns/op [info] s.c.mutable.HashMapBenchmark.get 10 avgt 20 689.941 ± 5.850 ns/op [info] s.c.mutable.HashMapBenchmark.get 100 avgt 20 7357.330 ± 45.956 ns/op [info] s.c.mutable.HashMapBenchmark.get 1000 avgt 20 95767.200 ± 1550.771 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 10 avgt 20 509.181 ± 2.683 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 100 avgt 20 5563.301 ± 32.335 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 1000 avgt 20 71965.365 ± 1809.738 ns/op [info] s.c.mutable.HashMapBenchmark.put 10 avgt 20 247.270 ± 3.972 ns/op [info] s.c.mutable.HashMapBenchmark.put 100 avgt 20 5646.185 ± 106.172 ns/op [info] s.c.mutable.HashMapBenchmark.put 1000 avgt 20 81303.663 ± 954.938 ns/op ``` `Changed modulo to bitwise and in hash calculation (4c729fe)`: ```scala [info] Benchmark (size) Mode Cnt Score Error Units [info] s.c.immutable.VectorMapBenchmark.groupBy 10 avgt 20 631.291 ± 9.269 ns/op [info] s.c.immutable.VectorMapBenchmark.groupBy 100 avgt 20 2077.885 ± 59.737 ns/op [info] s.c.immutable.VectorMapBenchmark.groupBy 1000 avgt 20 15458.278 ± 317.347 ns/op [info] s.c.mutable.HashMapBenchmark.get 10 avgt 20 678.013 ± 4.453 ns/op [info] s.c.mutable.HashMapBenchmark.get 100 avgt 20 7258.522 ± 76.088 ns/op [info] s.c.mutable.HashMapBenchmark.get 1000 avgt 20 94748.845 ± 1226.120 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 10 avgt 20 498.042 ± 5.006 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 100 avgt 20 5243.154 ± 110.372 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 1000 avgt 20 68194.752 ± 655.436 ns/op [info] s.c.mutable.HashMapBenchmark.put 10 avgt 20 257.275 ± 1.411 ns/op [info] s.c.mutable.HashMapBenchmark.put 100 avgt 20 5318.532 ± 152.923 ns/op [info] s.c.mutable.HashMapBenchmark.put 1000 avgt 20 79607.160 ± 651.779 ns/op ``` `Optimized HashTable.index (6cc1504)`: ```scala [info] Benchmark (size) Mode Cnt Score Error Units [info] s.c.immutable.VectorMapBenchmark.groupBy 10 avgt 20 616.164 ± 4.712 ns/op [info] s.c.immutable.VectorMapBenchmark.groupBy 100 avgt 20 2034.447 ± 14.495 ns/op [info] s.c.immutable.VectorMapBenchmark.groupBy 1000 avgt 20 14712.164 ± 119.983 ns/op [info] s.c.mutable.HashMapBenchmark.get 10 avgt 20 679.046 ± 6.872 ns/op [info] s.c.mutable.HashMapBenchmark.get 100 avgt 20 7242.097 ± 41.244 ns/op [info] s.c.mutable.HashMapBenchmark.get 1000 avgt 20 95342.919 ± 1521.328 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 10 avgt 20 488.034 ± 4.554 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 100 avgt 20 4883.123 ± 59.268 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 1000 avgt 20 65174.034 ± 496.759 ns/op [info] s.c.mutable.HashMapBenchmark.put 10 avgt 20 267.983 ± 1.797 ns/op [info] s.c.mutable.HashMapBenchmark.put 100 avgt 20 5097.351 ± 104.538 ns/op [info] s.c.mutable.HashMapBenchmark.put 1000 avgt 20 78772.540 ± 543.935 ns/op ``` Summary, i.e. the effect of this PR, according to the benchmarks: * `groupBy` has a `~35%` speedup * `get` didn't change * `getOrElseUpdate` has a `~10%` speedup * `put` has a `~3%` speedup Note: caching the `exponent` to a local private field (`Byte` or `Int`) didn't have any performance advantage (only a minor slowdown was measured, possibly because it's accessed via an interface now) (cherry picked from commit a501444)
(cherry picked from commit 7952525)
(cherry picked from commit bc91223)
Fixes https://issues.scala-lang.org/browse/SI-10049 Since `groupBy` uses this method extensively and suffered a measurable slowdown in `2.12.0`, this modification restores (and exceeds) its original speed. --- included benchmarks: (`ns/op` → smaller is better) `before (2.12.0):` ```java Benchmark (size) Mode Cnt Score Error Units s.c.immutable.VectorMapBenchmark.groupBy 10 avgt 20 865.693 ± 7.869 ns/op s.c.immutable.VectorMapBenchmark.groupBy 100 avgt 20 3095.657 ± 56.438 ns/op s.c.immutable.VectorMapBenchmark.groupBy 1000 avgt 20 28247.005 ± 470.513 ns/op s.c.mutable.HashMapBenchmark.get 10 avgt 20 679.448 ± 11.809 ns/op s.c.mutable.HashMapBenchmark.get 100 avgt 20 7240.178 ± 61.734 ns/op s.c.mutable.HashMapBenchmark.get 1000 avgt 20 95725.127 ± 2373.458 ns/op s.c.mutable.HashMapBenchmark.getOrElseUpdate 10 avgt 20 836.561 ± 20.085 ns/op s.c.mutable.HashMapBenchmark.getOrElseUpdate 100 avgt 20 7891.368 ± 56.808 ns/op s.c.mutable.HashMapBenchmark.getOrElseUpdate 1000 avgt 20 97478.629 ± 1782.497 ns/op s.c.mutable.HashMapBenchmark.put 10 avgt 20 243.422 ± 2.915 ns/op s.c.mutable.HashMapBenchmark.put 100 avgt 20 5810.927 ± 60.054 ns/op s.c.mutable.HashMapBenchmark.put 1000 avgt 20 82175.539 ± 1690.296 ns/op ``` `after:` ```java Benchmark (size) Mode Cnt Score Error Units s.c.immutable.VectorMapBenchmark.groupBy 10 avgt 20 627.007 ± 9.718 ns/op s.c.immutable.VectorMapBenchmark.groupBy 100 avgt 20 2086.955 ± 19.042 ns/op s.c.immutable.VectorMapBenchmark.groupBy 1000 avgt 20 19515.234 ± 173.647 ns/op s.c.mutable.HashMapBenchmark.get 10 avgt 20 683.977 ± 11.843 ns/op s.c.mutable.HashMapBenchmark.get 100 avgt 20 7345.675 ± 41.092 ns/op s.c.mutable.HashMapBenchmark.get 1000 avgt 20 95085.926 ± 1702.997 ns/op s.c.mutable.HashMapBenchmark.getOrElseUpdate 10 avgt 20 503.208 ± 2.643 ns/op s.c.mutable.HashMapBenchmark.getOrElseUpdate 100 avgt 20 5526.483 ± 28.262 ns/op s.c.mutable.HashMapBenchmark.getOrElseUpdate 1000 avgt 20 69265.900 ± 674.958 ns/op s.c.mutable.HashMapBenchmark.put 10 avgt 20 252.481 ± 7.597 ns/op s.c.mutable.HashMapBenchmark.put 100 avgt 20 5708.034 ± 110.360 ns/op s.c.mutable.HashMapBenchmark.put 1000 avgt 20 82051.378 ± 1432.009 ns/op ``` i.e. for the given benchmark conditions `~40%` faster `groupBy` and `getOrElseUpdate` (cherry picked from commit b67ca7d)
(cherry picked from commit 26c87f1)
04259f5
to
1ee7d21
Compare
Thanks! Also appreciate your patience with that sneaky genLoadIdent bug that's been plaguing the 2.11.x PR queue :-/ We've spent a lot of time chasing it already. I'm sure @retronym will corner it soon. |
Since these are all backports, I'll add the [ci: last-only] marker so that only the last commit needs to pass the CI. |
/rebuild |
@SethTisue, thanks for your patience, this build seems to fail again for some weird reason too: https://scala-ci.typesafe.com/job/scala-2.11.x-validate-publish-core/2978/console |
/rebuild |
(The weird compiler crash is tracked as scala/scala-dev#251) |
Rebased in #5626 |
Backported by recommendation from @Ichoran and @retronym: #5528 (comment)