forked from scala-js/scala-js
-
Notifications
You must be signed in to change notification settings - Fork 0
[pull] main from scala-js:main #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Previously, we used several typed arrays pointing to the same `ArrayBuffer`. Now, we use a single `DataView`. Since we can (must) force a specific endianness with `DataView`, we get rid of the `highOffset` and `lowOffset` fields, which further streamlines the code. The assembly produced by V8 is now better. Since it knows that it manipulates a single buffer, it needs fewer loads and type tests internally. I expect the same would be true for other optimizing engines.
The compilation scheme for `finally` is a beast on Wasm. It's good to have direct feedback on it without having to wait for the full partest.
As well as for the locals of `Labeled` blocks whose `Return`s cross a `try..finally` boundary. If their original type was not defaultable, we cast away nullability when reading them back.
Fix #5165: Use defaultable types for the locals of TryFinallys.
That allows the IR cleaner to see the JS type definitions from the library when it appears as a jar on the classpath. This is the case for the linker private library. Adding that support will allow the linker private library to use JS type definitions from the Scala.js library, as long as the references can be erased away.
Previously, our floating point bit manipulation methods were implemented in user space. This commit instead introduces dedicated IR `UnaryOp`s for them. On Wasm, the implementations are straightforward opcodes. In JavaScript, we use the same strategies as before, but moved to the linker world. In most cases, we use a unique scratch `DataView` to perform the manipulations. It is now allocated as a `globalVar` in the emitter, rather than in user space. The functions for `Float`/`Int` conversions are straightforward, as well as those for `Double`/`Long` when we use `bigint`s for `Long`s. When we use `RuntimeLong`s, the conversions are implemented in `RuntimeLong`, but require the scratch `DataView` as an argument, which the linker injects. This new strategy brings several benefits: * For JavaScript, the generated code is basically optimal now. It produces tight sequences of Assembly instructions that do not allocate anything. * For Wasm, the implementation does not rely on JS interop anymore, even when the optimizer does not run. This property will be required when we target Wasm outside of a JS host. * We open a path to adding support for the "raw" variants `floatToRawIntBits`/`doubleToRawLongBits` in Wasm in the future, should we need it.
Introduce IR UnaryOps for floating point bit manipulation.
This allows to better mutualize their implementation with the signed divisions. Moreover, our 3 implementation strategies (JS with `RuntimeLong`, JS with `bigint` and Wasm) have different efficient implementations of those operations. Using IR BinaryOps for them allows each backend to use the most appropriate implementation, while letting the optimizer generically manipulate their mathematical properties.
Introduce IR BinaryOps for unsigned division and remainder.
This way, we only have `require-jdk*` directories for JDK versions that we actually test in our CI. They also correspond to JDK LTS versions, so they are not arbitrary.
Implement jl.Math.{multiplyFull, multiplyHigh, unsignedMultiplyHigh}.
We only need the load spec, this allows memory to be freed up more eagerly.
Improvements to the IRCleaner
Extract branches to be mostly dependent on `isInclusive` and `step >= 0`. These are often constant, and when they're not statically constant, they are probably predictable anyway.
Previously, `Range` used a number of intermediate operations on `Long`s to avoid overflow. The optimizer was already good enough to remove them in many cases, in particular when the step is `1` and/or the start is `0`, which are common cases. Now that the optimizer can reason about unsigned division, we can do a lot better. We can entirely avoid `Long` computations, *and* streamline a lot of code, by using unsigned `Int` arithmetics. This produces much better code for the cases where the optimizer cannot entirely get rid of the computations.
Opt: Use unsigned arithmetics in Range, instead of Longs.
Previously, we used instance methods on `RuntimeLong` to implement the operators whose lhs is a `RuntimeLong`. Other operators, such as `fromDouble`, were implemented as methods of the module. Now, we use static methods for all the operators. Once inline, it does not make any difference. However, it leaves behind a few unused static methods, instead of some instance methods. The former can be removed by off-the-shelf JS minifies, unlike the latter. Paradoxically, the `Minify` output, as we measure it, gets bigger. That's because static method names are not renamed by the name minifier.
The original implementation of `ju.ArrayDeque` used `js.Array` for its internal data structure. However, when compiling to WebAssembly, it requires JavaScript interop calls are required to access the underlying js.Array, leading to a significant performance overhead. This commit switches the internal data structure of ju.ArrayDeque to use scala.Array instead, for both the WebAssembly and JavaScript backends. Using `scala.Array` in both environments avoids complicating the logic by conditionally using `js.Array` or `scala.Array` based on whether it's Wasm or JS. This change significantly improves ArrayDeque's performance on WebAssembly and results in only minor performance degradation on JavaScript. See the discussion at #5164
Implement ArrayDeque using scala.Array.
Use static methods as entry points for RuntimeLong operator methods.
This particular operation is used by a number of our core numerical algorithms. It makes sense for it to receive a dedicated opcode. The dedicated opcode lets the optimizer reason about its purity.
Introduce IR UnaryOps for `numberOfLeadingZeros` (`clz`).
This was forgotten in f414dae.
Through clever analysis of approximation errors, it turns out we can rely on a `Double` division by 10^9 rather than a `Long` division. Since there was already a `Double` division (and remainder) at the end of the `Long` division algorithm, the new implementation should be strictly better. We do have a new call to `js.Math.floor`, but that should be efficient, as `floor` is typically a hardware instruction. We also remove the hackish way of reusing `unsignedDivModHelper` to compute `toString()`. According to the `longMicro` benchmark, this change gives a 3x speed improvement to this code path.
Discovered while working on #5077. This allows to remove unnecessary unboxes and unlocks more inlining.
Opt: Avoid the `Long` division in `RuntimeLong.toString`.
Add missing cases in `Printers` for `{Int,Long}_clz`.
Opt: Recognize non-nullabe RT long as subtype of LongType
If we cannot inline RuntimeLong (the class or its static methods), something is broken and we should fail.
Optimizer: Fail if we cannot inline RuntimeLong
Hacker's Delight offers branchless formulas for double-word additions and subtraction, without access to the machine's carry bit. Although the new formula contains more elementary instructions, removal of the branch is significant. When one operand is constant, folding reduces to 3 bitwise operations to compute the carry, which is very fast. When both operands are variable, then the carry is often 50/50 unpredictable, which means the branch is unpredictable, and removing it is worth the full 5 bitwise operations anyway. Negation does not need a special code path anymore. Regular folding of the `0L - b` formula yields optimal, branchless code anyway. We remove `RuntimeLong.neg` and its code paths in the optimizer and emitter. While we're there, we also remove `RuntimeLong.not`, since the regular code paths for `-1L ^ b` fold in the same way. This change reduces execution time of the `sha512` benchmark by a whopping 25-30%.
Opt: Branchless addition, subtraction and negation for `RuntimeLong`.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.1)
Can you help keep this open source service alive? 💖 Please sponsor : )