10000 cross-platforming compression by yurique · Pull Request #2771 · typelevel/fs2 · GitHub
[go: up one dir, main page]

Skip to content

cross-platforming compression #2771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 41 commits into from

Conversation

yurique
Copy link
Contributor
@yurique yurique commented Dec 25, 2021

No description provided.

@yurique
Copy link
Contributor Author
yurique commented Dec 26, 2021

One question would be: can we think of something to organize all this stuff better? Right now we have (probably) too many "*platforms" and co.

@yurique
Copy link
Contributor Author
yurique commented Jan 18, 2022

Is there a way to reduce the number of Pull "stages" in the implementation, without relying on custom constructs like your UnsafeMutableContainer?

I couldn't find a way to simplify things more. And I don't think even using more Refs / UnsafeMutableContainers would allow to do so.

The Pulls here are like this:

  • read the length N from the stream and skip N bytes (updating the crc32 while skipping)
  • read the bytes until there's a 0 (updating the crc32), but only take at most approximately M bytes (soft limits for file name / comment length)
  • go through the stream and feed it into the inflater (updating the crc32 with the inflater output), stop when the inflater says it's done, get trailerSize bytes from the [ bytes not processed by the inflater ++ rest of the stream ]

Some of those are simpler (but complicated by the need to update the crc32 and/or count the bytes), others are tricky on their own.

I'd be happy to receive suggestions, though :)

@yurique
Copy link
Contributor Author
yurique commented Jan 18, 2022

The results with a combined Ref:

[info] Benchmark                     (withRandomBytes)   Mode  Cnt     Score     Error  Units
[info] CompressionBenchmark.deflate               true  thrpt    3    42.204 ±   8.658  ops/s
[info] CompressionBenchmark.deflate              false  thrpt    3   240.016 ±  32.985  ops/s
[info] CompressionBenchmark.gunzip                true  thrpt    3  2089.465 ±  87.978  ops/s
[info] CompressionBenchmark.gunzip               false  thrpt    3   669.763 ±  12.048  ops/s
[info] CompressionBenchmark.gzip                  true  thrpt    3    42.233 ±   1.149  ops/s
[info] CompressionBenchmark.gzip                 false  thrpt    3   230.807 ±  29.052  ops/s
[info] CompressionBenchmark.inflate               true  thrpt    3  2471.074 ± 417.530  ops/s
[info] CompressionBenchmark.inflate              false  thrpt    3   715.019 ±  21.269  ops/s

Before:

[info] Benchmark                     (withRandomBytes)   Mode  Cnt     Score     Error  Units
[info] CompressionBenchmark.deflate               true  thrpt    3    42.188 ±   2.239  ops/s
[info] CompressionBenchmark.deflate              false  thrpt    3   239.414 ±  20.532  ops/s
[info] CompressionBenchmark.gunzip                true  thrpt    3  2059.542 ±  43.628  ops/s
[info] CompressionBenchmark.gunzip               false  thrpt    3   665.145 ±   5.446  ops/s
[info] CompressionBenchmark.gzip                  true  thrpt    3    41.829 ±   2.228  ops/s
[info] CompressionBenchmark.gzip                 false  thrpt    3   230.061 ±  26.736  ops/s
[info] CompressionBenchmark.inflate               true  thrpt    3  2479.163 ± 150.095  ops/s
[info] CompressionBenchmark.inflate              false  thrpt    3   711.984 ±   7.221  ops/s

For a reference (copy from above): after doing weird things and removing the crc wrapper, the best I got was this:

[info] CompressionBenchmark.gunzip               true  thrpt    3  2171.996 ± 24.345  ops/s

And main:

[info] CompressionBenchmark.gunzip                true  thrpt    3  2256.685 ±  50.507  ops/s

@yurique
Copy link
Contributor Author
yurique commented Jan 18, 2022

combining the Refs complicates stuff in js (those refs are set in different places/callbacks there).

@armanbilge
Copy link
Member

@yurique thanks for all the updates. Actually that last pair of numbers you posted in #2771 (comment) look quite good! Basically the same within the margin of error.

At this point I'm not sure what constitutes weird anymore :) for example, I wonder if these defs could somehow be nested so that they can share a var instead of using UnsafeMutableContainer—but I'm not sure if that's more or less weird.

I'll try and take a deeper look sometime this week. Thanks again, for the massive effort this turned out to be...

@yurique
Copy link
Contributor Author
yurique commented Jan 18, 2022

I wonder if these defs could somehow be nested so that they can share a var instead of using UnsafeMutableContainer

That would be... tricky :)

The Refs (or UnsafeMutableContainers) are created and set in platform-specific inflate implementations, but consumed in the shared Gzip.gunzip.

The latest numbers are not too bad, I agree :) (but combining Refs didn't give us the perf improvement I was silently hoping for).

@yurique
Copy link
Contributor Author
yurique commented Jan 18, 2022

I'm still concerned about all the "weird" things we have to have with GunzipResult.

Should we instead think about introducing a new set of APIs: Gzip (next to Compression), have the existing jvm gzip functions forward to it, make a new GunzipResult (in a new package) and deprecate the old one? 🤔

@armanbilge
Copy link
8000
Member

I'm still concerned about all the "weird" things we have to have with GunzipResult.

I'm less worried about that. In this case it would be okay to implement that class twice, once on JS and once on JVM and avoid the weirdness.

If you still have appetite to try a different approach, here's an idea: you could try interacting directly with the Node.js APIs, rather than converting them to Stream. Then, maybe more of the current JVM code can be reused by building a common interface on top of the JDK/Node.js APIs. This may still be tricky because Node.js is async. WDYT?

@yurique
Copy link
Contributor Author
yurique commented Jan 18, 2022

you could try interacting directly with the Node.js APIs

This may still be tricky because Node.js is async

Interesting 🤔

Also, there are sync alternatives to the async functions in node. I can't remember if I considered this initially (and if so, what it was that made me not do it this was) – I'll take a peek :)

@armanbilge
Copy link
Member

I'd advise against the sync methods, since they are effectively blocking the main (and only) thread in the Node runtime.

@yurique
Copy link
Contributor Author
yurique commented Jan 18, 2022

Oh, now I also remember – the sync functions operate on whole inputs (as well as the async ones). What we use is the "inflater" and "deflater" which are Duplexes, and we have this whole suspendReadableAndRead machinery to deal with those (which involves subscribing to events etc) – something tells me we can't really do it better than that :) (or utilize these things the way we utilize jvm's inflater and deflater, which are sync and designed to take data in in chunks).

@armanbilge
Copy link
Member

or utilize these things the way we utilize jvm's inflater and deflater, which are sync and designed to take data in in chunks

The Duplex works basically the same way, you can write chunks, and it emits chunks as events. Only it happens async instead of sync, but it's all F in the end 😉 Assuming one chunk comes out per chunk that goes in, it should be easy I would think. Otherwise, it gets a lot harder and you'd probably end up re-implemetning Stream 😆

@yurique
Copy link
Contributor Author
yurique commented Jan 18, 2022

Yeah, it can probably be more or less straightforward :)

Those pulls that do (approx) inflater.inflate(nextChunk); Pull.output(inflaterOutput) would have to do (approx) Pull.eval(inflateChunk(nextChunk).flatMap(_.get)).flatMap( output => Pull.output(output)).

The inflateChunk(nextChunk) would probably have to return something like a Deferred: but we don't have Async in the jvm interface, and also this could affect the performance quite a bit.

@yurique
Copy link
Contributor Author
yurique commented Jan 18, 2022

Otherwise, having only this, lower-level, part have platform-specific implementations and everything else – shared would be nice indeed!

@armanbilge
Copy link
Member

I don't think it has to return Deferred. I'm hoping the difference should be essentially F.delay( ... do something sync ... ) for JVM vs F.async_(cb => ... do something async ...) for JS, which both give you F.

@yurique
Copy link
Contributor Author
yurique commented Jan 18, 2022

F.async_(cb => ... do something async ...) for JS

Indeed! :)

Well, I'll give it a try as soon as I can. This should work. Curious to see how it affects the benchmark.

@yurique yurique force-pushed the feature/js-compression-platform branch from 1745d7a to 022229e Compare January 22, 2022 22:46
@yurique
Copy link
Contributor Author
yurique commented Jan 22, 2022

@armanbilge

I tried implementing inflate in terms of this:

trait ChunkInflater[F[_]] {
  /** @param bytesChunk bytes to inflate
    * @return (inflatedChunk, remainingBytes, finished)
    */
  def inflateChunk(
      bytesChunk: Chunk[Byte]
  ): Pull[F, INothing, (Chunk[Byte], Int, Boolean)]
}

jvm implementation:

new ChunkInflater[F] {
        def inflateChunk(
            bytesChunk: Chunk.ArraySlice[Byte],
            offset: Int
        ): Pull[F, INothing, (Chunk[Byte], Int, Boolean)] = {
          inflater.setInput(
            bytesChunk.values,
            bytesChunk.offset + offset,
            bytesChunk.length - offset
          )
          val inflatedBytes = inflater.inflate(inflatedBuffer)
          Pull.pure(
            (
              copyAsChunkBytes(inflatedBuffer, inflatedBytes),
              inflater.getRemaining,
              inflater.finished()
            )
          )
        }
      }

(changes: https://github.com/yurique/fs2/pull/1/files)

Have the tests passing for jvm.

Inflate implementation is the same, except using ChunkInflater instead of directly interfacing with juz.Inflater.

Benchmarks look weird, though (ran it a few times, it's consistent):

[info] Benchmark                     (withRandomBytes)   Mode  Cnt     Score    Error  Units
[info] CompressionBenchmark.deflate               true  thrpt    4    41.692 ±  0.356  ops/s
[info] CompressionBenchmark.deflate              false  thrpt    4   235.074 ±  1.212  ops/s
[info] CompressionBenchmark.gunzip                true  thrpt    4  1557.763 ± 19.680  ops/s <----
[info] CompressionBenchmark.gunzip               false  thrpt    4   603.812 ±  2.886  ops/s
[info] CompressionBenchmark.gzip                  true  thrpt    4    41.193 ±  0.687  ops/s
[info] CompressionBenchmark.gzip                 false  thrpt    4   226.022 ±  3.354  ops/s
[info] CompressionBenchmark.inflate               true  thrpt    4  2446.234 ± 15.744  ops/s <----
[info] CompressionBenchmark.inflate              false  thrpt    4   700.276 ±  4.743  ops/s

before:

[info] Benchmark                     (withRandomBytes)   Mode  Cnt     Score    Error  Units
[info] CompressionBenchmark.deflate               true  thrpt    3    41.596 ±  0.976  ops/s
[info] CompressionBenchmark.deflate              false  thrpt    3   234.454 ± 13.996  ops/s
[info] CompressionBenchmark.gunzip                true  thrpt    3  2023.898 ± 87.731  ops/s <----
[info] CompressionBenchmark.gunzip               false  
F438
thrpt    3   655.208 ±  5.633  ops/s
[info] CompressionBenchmark.gzip                  true  thrpt    3    41.279 ±  1.060  ops/s
[info] CompressionBenchmark.gzip                 false  thrpt    3   225.414 ± 28.269  ops/s
[info] CompressionBenchmark.inflate               true  thrpt    3  2450.515 ± 53.559  ops/s <----
[info] CompressionBenchmark.inflate              false  thrpt    3   700.596 ± 14.874  ops/s

I can't explain the difference in the gunzip benchmarks given near-identical results for inflate 🤔 (gunzip implementation is the same). Can you spot something I'm missing?


Kind of half way there with the JS implementation (though it turns out even trickier than I expected), but unless we can fix this performance issue, not sure if that's the way to go.


As a side note: in the inflate implementation, I have this:

              bytesChunk: Chunk.ArraySlice[Byte],
              offset: Int,

This is used to send the whole chunk into the inflater (inflater doesn't always accept the whole input because the output might overflow the output buffer, so this slice is being re-used, and the offset within that slice is increased as needed).

So I thought if I should maybe just use the Chunk and .takeRight(...) instead.

(changes: https://github.com/yurique/fs2/pull/2/files)

Got a rather surprising result:

[info] Benchmark                     (withRandomBytes)   Mode  Cnt    Score    Error  Units
[info] CompressionBenchmark.deflate               true  thrpt    3   41.563 ±  1.225  ops/s
[info] CompressionBenchmark.deflate              false  thrpt    3  234.662 ± 25.051  ops/s
[info] CompressionBenchmark.gunzip                true  thrpt    3  374.088 ±  8.450  ops/s  <----
[info] CompressionBenchmark.gunzip               false  thrpt    3  601.324 ±  5.705  ops/s
[info] CompressionBenchmark.gzip                  true  thrpt    3   41.181 ±  0.693  ops/s
[info] CompressionBenchmark.gzip                 false  thrpt    3  223.710 ± 29.040  ops/s
[info] CompressionBenchmark.inflate               true  thrpt    3  422.174 ± 16.779  ops/s  <----
[info] CompressionBenchmark.inflate              false  thrpt    3  697.039 ±  4.659  ops/s

@armanbilge
Copy link
Member

@yurique so if I'm reading those benchmarks correctly, gunzip performance decreased but inflate stayed the same. IIRC the difference is that gunzip calculates the CRC for verification but inflate does not (apologies if I remember incorrectly). So could it be related to that somehow?

@yurique
Copy link
Contributor Author
yurique commented Jan 24, 2022

@armanbilge yes, that's right – gunzip "requests" the calculation of the crc and the trailer.

I was thinking "those parts haven't changed at all either" but your comment reminded me of something, I looked through it again. And there is a subtle difference:

if (track) crcBuilder.update(inflatedChunk)

So now inflatedChunk is a Chunk, and crcBuilder.update calls the .toArraySlice under the hood. Before, there was no .toArraySlice involved as the inflater output buffer was available as Array[Byte]. This must be it.

I changed it so that the ChunkInflater returns its buffer + length without wrapping in a chunk, so there's no extra .toArraySlice when calling crcBuilder.update.

And the benchmark is almost back to where it was:

[info] Benchmark                     (withRandomBytes)   Mode  Cnt     Score     Error  Units
[info] CompressionBenchmark.deflate               true  thrpt    3    41.041 ±   4.382  ops/s
[info] CompressionBenchmark.deflate              false  thrpt    3   231.742 ±  18.002  ops/s
[info] CompressionBenchmark.gunzip                true  thrpt    3  1971.619 ±  67.197  ops/s
[info] CompressionBenchmark.gunzip               false  thrpt    3   646.719 ±   6.831  ops/s
[info] CompressionBenchmark.gzip                  true  thrpt    3    40.631 ±   2.349  ops/s
[info] CompressionBenchmark.gzip                 false  thrpt    3   221.801 ±  35.506  ops/s
[info] CompressionBenchmark.inflate               true  thrpt    3  2408.610 ± 173.101  ops/s
[info] CompressionBenchmark.inflate              false  thrpt    3   690.958 ±  10.401  ops/s

But I'm really surprised: the Chunk is created with Chunk.array(target, 0, length) – I'd assumed .toArraySlice would be nearly a no-op in such a case. I guess it's not :)

  def toArraySlice[O2 >: O](implicit ct: ClassTag[O2]): Chunk.ArraySlice[O2] =
    this match {
      case as: Chunk.ArraySlice[_] if ct.wrap.runtimeClass eq as.getClass =>
        as.asInstanceOf[Chunk.ArraySlice[O2]]
      case _ => Chunk.ArraySlice(toArray, 0, size)
    }

But this is good! I'll try finishing the JS implementation.

@armanbilge
Copy link
Member

Nice! 🎉 And very interesting about toArraySlice 🤔

@yurique
Copy link
Contributor Author
yurique commented Jan 24, 2022

@armanbilge so I managed to make a JS implementation – it was hell, but it works =)

The benchmark results are the same as before the JS implementstion (as they should be – I added a couple of things to the shared inflate implementation, but those do not kick in in JVM).

[info] Benchmark                     (withRandomBytes)   Mode  Cnt     Score     Error  Units
[info] CompressionBenchmark.deflate               true  thrpt    3    41.846 ±   0.349  ops/s
[info] CompressionBenchmark.deflate              false  thrpt    3   234.625 ±  14.218  ops/s
[info] CompressionBenchmark.gunzip                true  thrpt    3  2015.259 ± 119.846  ops/s
[info] CompressionBenchmark.gunzip               false  thrpt    3   655.609 ±  10.494  ops/s
[info] CompressionBenchmark.gzip                  true  thrpt    3    41.318 ±   1.437  ops/s
[info] CompressionBenchmark.gzip                 false  thrpt    3   225.468 ±  14.121  ops/s
[info] CompressionBenchmark.inflate               true  thrpt    3  2413.895 ± 354.697  ops/s
[info] CompressionBenchmark.inflate              false  thrpt    3   697.494 ±   5.701  ops/s

As the result – the inflate and gzip implementation is shared between JS and jvm, apart from the platform specific interface to the inflater.

Changes are here: https://github.com/yurique/fs2/pull/1/files

jvm chunk inflater: https://github.com/yurique/fs2/pull/1/files#diff-8c2b0bb4111ef230a00138813c001efa1fa187ee4c6f43cf514e0074b570e084R147-R181

js chunk inflater: https://github.com/yurique/fs2/pull/1/files#diff-1b982c5a5c6bd8cdbb47ca7c2114327cb202075ee95749b8024d7078b6fa7490R73-R207

shared code: https://github.com/yurique/fs2/pull/1/files#diff-e7f835b1438d510176b8ad23b915f2c27e152bc2768847265da54e9ce98acb50R33
it's mostly the same as the jvm implementation used to be before this change, but with an additional pull "stage" – drain, which is only invoked in JS; it reads the output without writing to the inflater (when there's nothing more to write) until the reader end event is emitted.

@armanbilge
Copy link
Member

Haven't had a chance to look yet but just wanted to 🎉 that's amazing!

@armanbilge
Copy link
Member

I'm really sorry, has this PR just been waiting for my review? 😬 😅 I'll try and resolve the merge conflicts.

@armanbilge
Copy link
Member

Oh I see, there are more changes in a PR against this PR :)

@yurique
Copy link
Contributor Author
yurique commented Apr 22, 2022

I'm trying to recollect what was going on here.

I think the PR agains this PR was an "experiment" which turned out to be a successful one – more shared code between JS and JVM, and without a performance hit. But I can't remember all the details off the top of my head now :'(

@armanbilge
Copy link
Member

I remember :) you were able to implement a common Pull-based inflater with both the JDK and Node.js APIs which (finally!) wasn't a performance regression on the JVM 🎉 Then everything else could be expressed in terms of that.

I think that's the hardest part of this PR 😅 so thanks for your big effort on it!

IIRC some remaining things are dealing with bincompatibly changing the API to use FiniteDuration instead java.time and also platforming the OS-detection stuff. These are annoying, but doable.

The other thing that I remember was frustrating was the poor performance of the CRC calculation on JS, particularly when using the scodec-bits implementation. This PR inlines a mutable implementation which works better (I can't remember how much better). I wish we didn't have to take that on but not sure if there's another way since Node.js doesn't expose native methods for calculating the CRC.

@armanbilge
Copy link
Member

So I took another look at how http4s is using the Compression trait especially because I was wondering how the Node.js native HTTP client supports compression without all this functionality e.g. see:
https://nodejs.org/api/zlib.html#compressing-http-requests-and-responses

It seems none of the http4s gzip middlewares are using the additional headers such as file name or modification time, it is just directly interfacing with the underlying byte stream:

https://github.com/http4s/http4s/blob/b3220ab944042a6729ecdcbbd2b8a034250e98f5/client/jvm/src/main/scala/org/http4s/client/middleware/GZip.scala#L65

https://github.com/http4s/http4s/blob/b3220ab944042a6729ecdcbbd2b8a034250e98f5/server/jvm/src/main/scala/org/http4s/server/middleware/GZip.scala#L78

IIUC it should be possible to support these http4s use-cases just by directly shelling out to the Node.js APIs like in the example there. Not to discount your huge efforts on this PR :( but maybe the easy win here for http4s is to add new cross-platform APIs to Compression for gzipping without the optional headers. I think this will also be more performant, since calculating CRCs in raw JS was a bottleneck.

Meanwhile, I see there is an (unfortunately inactive) feature request for Node.js itself to add native support for these optional headers. If that ever materializes, we would be able to implement these APIs much more easily and with much better performance.

What do you think?

@armanbilge armanbilge mentioned this pull request Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0