cross-platforming compression #2771

yurique · 2021-12-25T06:40:31Z

No description provided.

io/js/src/main/scala/fs2/io/compressionplatform.scala

yurique · 2021-12-26T20:53:46Z

One question would be: can we think of something to organize all this stuff better? Right now we have (probably) too many "*platforms" and co.

io/js/src/main/scala/fs2/io/compressionplatform.scala

yurique · 2022-01-18T09:44:58Z

Is there a way to reduce the number of Pull "stages" in the implementation, without relying on custom constructs like your UnsafeMutableContainer?

I couldn't find a way to simplify things more. And I don't think even using more Refs / UnsafeMutableContainers would allow to do so.

The Pulls here are like this:

read the length N from the stream and skip N bytes (updating the crc32 while skipping)
read the bytes until there's a 0 (updating the crc32), but only take at most approximately M bytes (soft limits for file name / comment length)
go through the stream and feed it into the inflater (updating the crc32 with the inflater output), stop when the inflater says it's done, get trailerSize bytes from the [ bytes not processed by the inflater ++ rest of the stream ]

Some of those are simpler (but complicated by the need to update the crc32 and/or count the bytes), others are tricky on their own.

I'd be happy to receive suggestions, though :)

yurique · 2022-01-18T10:01:45Z

The results with a combined Ref:

[info] Benchmark                     (withRandomBytes)   Mode  Cnt     Score     Error  Units
[info] CompressionBenchmark.deflate               true  thrpt    3    42.204 ±   8.658  ops/s
[info] CompressionBenchmark.deflate              false  thrpt    3   240.016 ±  32.985  ops/s
[info] CompressionBenchmark.gunzip                true  thrpt    3  2089.465 ±  87.978  ops/s
[info] CompressionBenchmark.gunzip               false  thrpt    3   669.763 ±  12.048  ops/s
[info] CompressionBenchmark.gzip                  true  thrpt    3    42.233 ±   1.149  ops/s
[info] CompressionBenchmark.gzip                 false  thrpt    3   230.807 ±  29.052  ops/s
[info] CompressionBenchmark.inflate               true  thrpt    3  2471.074 ± 417.530  ops/s
[info] CompressionBenchmark.inflate              false  thrpt    3   715.019 ±  21.269  ops/s

Before:

[info] Benchmark                     (withRandomBytes)   Mode  Cnt     Score     Error  Units
[info] CompressionBenchmark.deflate               true  thrpt    3    42.188 ±   2.239  ops/s
[info] CompressionBenchmark.deflate              false  thrpt    3   239.414 ±  20.532  ops/s
[info] CompressionBenchmark.gunzip                true  thrpt    3  2059.542 ±  43.628  ops/s
[info] CompressionBenchmark.gunzip               false  thrpt    3   665.145 ±   5.446  ops/s
[info] CompressionBenchmark.gzip                  true  thrpt    3    41.829 ±   2.228  ops/s
[info] CompressionBenchmark.gzip                 false  thrpt    3   230.061 ±  26.736  ops/s
[info] CompressionBenchmark.inflate               true  thrpt    3  2479.163 ± 150.095  ops/s
[info] CompressionBenchmark.inflate              false  thrpt    3   711.984 ±   7.221  ops/s

For a reference (copy from above): after doing weird things and removing the crc wrapper, the best I got was this:

[info] CompressionBenchmark.gunzip               true  thrpt    3  2171.996 ± 24.345  ops/s

And main:

[info] CompressionBenchmark.gunzip                true  thrpt    3  2256.685 ±  50.507  ops/s

yurique · 2022-01-18T10:45:21Z

combining the Refs complicates stuff in js (those refs are set in different places/callbacks there).

armanbilge · 2022-01-18T15:29:29Z

@yurique thanks for all the updates. Actually that last pair of numbers you posted in #2771 (comment) look quite good! Basically the same within the margin of error.

At this point I'm not sure what constitutes weird anymore :) for example, I wonder if these defs could somehow be nested so that they can share a var instead of using UnsafeMutableContainer—but I'm not sure if that's more or less weird.

I'll try and take a deeper look sometime this week. Thanks again, for the massive effort this turned out to be...

yurique · 2022-01-18T18:55:34Z

I wonder if these defs could somehow be nested so that they can share a var instead of using UnsafeMutableContainer

That would be... tricky :)

The Refs (or UnsafeMutableContainers) are created and set in platform-specific inflate implementations, but consumed in the shared Gzip.gunzip.

The latest numbers are not too bad, I agree :) (but combining Refs didn't give us the perf improvement I was silently hoping for).

yurique · 2022-01-18T18:59:19Z

I'm still concerned about all the "weird" things we have to have with GunzipResult.

Should we instead think about introducing a new set of APIs: Gzip (next to Compression), have the existing jvm gzip functions forward to it, make a new GunzipResult (in a new package) and deprecate the old one? 🤔

armanbilge · 2022-01-18T19:11:34Z

I'm still concerned about all the "weird" things we have to have with GunzipResult.

I'm less worried about that. In this case it would be okay to implement that class twice, once on JS and once on JVM and avoid the weirdness.

If you still have appetite to try a different approach, here's an idea: you could try interacting directly with the Node.js APIs, rather than converting them to Stream. Then, maybe more of the current JVM code can be reused by building a common interface on top of the JDK/Node.js APIs. This may still be tricky because Node.js is async. WDYT?

yurique · 2022-01-18T19:27:34Z

you could try interacting directly with the Node.js APIs

This may still be tricky because Node.js is async

Interesting 🤔

Also, there are sync alternatives to the async functions in node. I can't remember if I considered this initially (and if so, what it was that made me not do it this was) – I'll take a peek :)

armanbilge · 2022-01-18T19:29:57Z

I'd advise against the sync methods, since they are effectively blocking the main (and only) thread in the Node runtime.

yurique · 2022-01-18T19:32:35Z

Oh, now I also remember – the sync functions operate on whole inputs (as well as the async ones). What we use is the "inflater" and "deflater" which are Duplexes, and we have this whole suspendReadableAndRead machinery to deal with those (which involves subscribing to events etc) – something tells me we can't really do it better than that :) (or utilize these things the way we utilize jvm's inflater and deflater, which are sync and designed to take data in in chunks).

armanbilge · 2022-01-18T19:45:54Z

or utilize these things the way we utilize jvm's inflater and deflater, which are sync and designed to take data in in chunks

The Duplex works basically the same way, you can write chunks, and it emits chunks as events. Only it happens async instead of sync, but it's all F in the end 😉 Assuming one chunk comes out per chunk that goes in, it should be easy I would think. Otherwise, it gets a lot harder and you'd probably end up re-implemetning Stream 😆

yurique · 2022-01-18T20:31:53Z

Yeah, it can probably be more or less straightforward :)

Those pulls that do (approx) inflater.inflate(nextChunk); Pull.output(inflaterOutput) would have to do (approx) Pull.eval(inflateChunk(nextChunk).flatMap(_.get)).flatMap( output => Pull.output(output)).

The inflateChunk(nextChunk) would probably have to return something like a Deferred: but we don't have Async in the jvm interface, and also this could affect the performance quite a bit.

yurique · 2022-01-18T20:32:53Z

Otherwise, having only this, lower-level, part have platform-specific implementations and everything else – shared would be nice indeed!

armanbilge · 2022-01-18T20:37:54Z

I don't think it has to return Deferred. I'm hoping the difference should be essentially F.delay( ... do something sync ... ) for JVM vs F.async_(cb => ... do something async ...) for JS, which both give you F.

yurique · 2022-01-18T20:47:15Z

F.async_(cb => ... do something async ...) for JS

Indeed! :)

Well, I'll give it a try as soon as I can. This should work. Curious to see how it affects the benchmark.

# Conflicts: # build.sbt

yurique · 2022-01-22T23:00:42Z

@armanbilge

I tried implementing inflate in terms of this:

trait ChunkInflater[F[_]] {
  /** @param bytesChunk bytes to inflate
    * @return (inflatedChunk, remainingBytes, finished)
    */
  def inflateChunk(
      bytesChunk: Chunk[Byte]
  ): Pull[F, INothing, (Chunk[Byte], Int, Boolean)]
}

jvm implementation:

new ChunkInflater[F] {
        def inflateChunk(
            bytesChunk: Chunk.ArraySlice[Byte],
            offset: Int
        ): Pull[F, INothing, (Chunk[Byte], Int, Boolean)] = {
          inflater.setInput(
            bytesChunk.values,
            bytesChunk.offset + offset,
            bytesChunk.length - offset
          )
          val inflatedBytes = inflater.inflate(inflatedBuffer)
          Pull.pure(
            (
              copyAsChunkBytes(inflatedBuffer, inflatedBytes),
              inflater.getRemaining,
              inflater.finished()
            )
          )
        }
      }

(changes: https://github.com/yurique/fs2/pull/1/files)

Have the tests passing for jvm.

Inflate implementation is the same, except using ChunkInflater instead of directly interfacing with juz.Inflater.

Benchmarks look weird, though (ran it a few times, it's consistent):

[info] Benchmark                     (withRandomBytes)   Mode  Cnt     Score    Error  Units
[info] CompressionBenchmark.deflate               true  thrpt    4    41.692 ±  0.356  ops/s
[info] CompressionBenchmark.deflate              false  thrpt    4   235.074 ±  1.212  ops/s
[info] CompressionBenchmark.gunzip                true  thrpt    4  1557.763 ± 19.680  ops/s <----
[info] CompressionBenchmark.gunzip               false  thrpt    4   603.812 ±  2.886  ops/s
[info] CompressionBenchmark.gzip                  true  thrpt    4    41.193 ±  0.687  ops/s
[info] CompressionBenchmark.gzip                 false  thrpt    4   226.022 ±  3.354  ops/s
[info] CompressionBenchmark.inflate               true  thrpt    4  2446.234 ± 15.744  ops/s <----
[info] CompressionBenchmark.inflate              false  thrpt    4   700.276 ±  4.743  ops/s

before:

[info] Benchmark                     (withRandomBytes)   Mode  Cnt     Score    Error  Units
[info] CompressionBenchmark.deflate               true  thrpt    3    41.596 ±  0.976  ops/s
[info] CompressionBenchmark.deflate              false  thrpt    3   234.454 ± 13.996  ops/s
[info] CompressionBenchmark.gunzip                true  thrpt    3  2023.898 ± 87.731  ops/s <----
[info] CompressionBenchmark.gunzip               false  
F438
thrpt    3   655.208 ±  5.633  ops/s
[info] CompressionBenchmark.gzip                  true  thrpt    3    41.279 ±  1.060  ops/s
[info] CompressionBenchmark.gzip                 false  thrpt    3   225.414 ± 28.269  ops/s
[info] CompressionBenchmark.inflate               true  thrpt    3  2450.515 ± 53.559  ops/s <----
[info] CompressionBenchmark.inflate              false  thrpt    3   700.596 ± 14.874  ops/s

I can't explain the difference in the gunzip benchmarks given near-identical results for inflate 🤔 (gunzip implementation is the same). Can you spot something I'm missing?

Kind of half way there with the JS implementation (though it turns out even trickier than I expected), but unless we can fix this performance issue, not sure if that's the way to go.

As a side note: in the inflate implementation, I have this:

              bytesChunk: Chunk.ArraySlice[Byte],
              offset: Int,

This is used to send the whole chunk into the inflater (inflater doesn't always accept the whole input because the output might overflow the output buffer, so this slice is being re-used, and the offset within that slice is increased as needed).

So I thought if I should maybe just use the Chunk and .takeRight(...) instead.

(changes: https://github.com/yurique/fs2/pull/2/files)

Got a rather surprising result:

[info] Benchmark                     (withRandomBytes)   Mode  Cnt    Score    Error  Units
[info] CompressionBenchmark.deflate               true  thrpt    3   41.563 ±  1.225  ops/s
[info] CompressionBenchmark.deflate              false  thrpt    3  234.662 ± 25.051  ops/s
[info] CompressionBenchmark.gunzip                true  thrpt    3  374.088 ±  8.450  ops/s  <----
[info] CompressionBenchmark.gunzip               false  thrpt    3  601.324 ±  5.705  ops/s
[info] CompressionBenchmark.gzip                  true  thrpt    3   41.181 ±  0.693  ops/s
[info] CompressionBenchmark.gzip                 false  thrpt    3  223.710 ± 29.040  ops/s
[info] CompressionBenchmark.inflate               true  thrpt    3  422.174 ± 16.779  ops/s  <----
[info] CompressionBenchmark.inflate              false  thrpt    3  697.039 ±  4.659  ops/s

armanbilge · 2022-01-23T17:18:44Z

@yurique so if I'm reading those benchmarks correctly, gunzip performance decreased but inflate stayed the same. IIRC the difference is that gunzip calculates the CRC for verification but inflate does not (apologies if I remember incorrectly). So could it be related to that somehow?

yurique · 2022-01-24T07:19:53Z

@armanbilge yes, that's right – gunzip "requests" the calculation of the crc and the trailer.

I was thinking "those parts haven't changed at all either" but your comment reminded me of something, I looked through it again. And there is a subtle difference:

if (track) crcBuilder.update(inflatedChunk)

So now inflatedChunk is a Chunk, and crcBuilder.update calls the .toArraySlice under the hood. Before, there was no .toArraySlice involved as the inflater output buffer was available as Array[Byte]. This must be it.

I changed it so that the ChunkInflater returns its buffer + length without wrapping in a chunk, so there's no extra .toArraySlice when calling crcBuilder.update.

And the benchmark is almost back to where it was:

[info] Benchmark                     (withRandomBytes)   Mode  Cnt     Score     Error  Units
[info] CompressionBenchmark.deflate               true  thrpt    3    41.041 ±   4.382  ops/s
[info] CompressionBenchmark.deflate              false  thrpt    3   231.742 ±  18.002  ops/s
[info] CompressionBenchmark.gunzip                true  thrpt    3  1971.619 ±  67.197  ops/s
[info] CompressionBenchmark.gunzip               false  thrpt    3   646.719 ±   6.831  ops/s
[info] CompressionBenchmark.gzip                  true  thrpt    3    40.631 ±   2.349  ops/s
[info] CompressionBenchmark.gzip                 false  thrpt    3   221.801 ±  35.506  ops/s
[info] CompressionBenchmark.inflate               true  thrpt    3  2408.610 ± 173.101  ops/s
[info] CompressionBenchmark.inflate              false  thrpt    3   690.958 ±  10.401  ops/s

But I'm really surprised: the Chunk is created with Chunk.array(target, 0, length) – I'd assumed .toArraySlice would be nearly a no-op in such a case. I guess it's not :)

  def toArraySlice[O2 >: O](implicit ct: ClassTag[O2]): Chunk.ArraySlice[O2] =
    this match {
      case as: Chunk.ArraySlice[_] if ct.wrap.runtimeClass eq as.getClass =>
        as.asInstanceOf[Chunk.ArraySlice[O2]]
      case _ => Chunk.ArraySlice(toArray, 0, size)
    }

But this is good! I'll try finishing the JS implementation.

armanbilge · 2022-01-24T07:22:55Z

Nice! 🎉 And very interesting about toArraySlice 🤔

yurique · 2022-01-24T20:11:49Z

@armanbilge so I managed to make a JS implementation – it was hell, but it works =)

The benchmark results are the same as before the JS implementstion (as they should be – I added a couple of things to the shared inflate implementation, but those do not kick in in JVM).

[info] Benchmark                     (withRandomBytes)   Mode  Cnt     Score     Error  Units
[info] CompressionBenchmark.deflate               true  thrpt    3    41.846 ±   0.349  ops/s
[info] CompressionBenchmark.deflate              false  thrpt    3   234.625 ±  14.218  ops/s
[info] CompressionBenchmark.gunzip                true  thrpt    3  2015.259 ± 119.846  ops/s
[info] CompressionBenchmark.gunzip               false  thrpt    3   655.609 ±  10.494  ops/s
[info] CompressionBenchmark.gzip                  true  thrpt    3    41.318 ±   1.437  ops/s
[info] CompressionBenchmark.gzip                 false  thrpt    3   225.468 ±  14.121  ops/s
[info] CompressionBenchmark.inflate               true  thrpt    3  2413.895 ± 354.697  ops/s
[info] CompressionBenchmark.inflate              false  thrpt    3   697.494 ±   5.701  ops/s

As the result – the inflate and gzip implementation is shared between JS and jvm, apart from the platform specific interface to the inflater.

Changes are here: https://github.com/yurique/fs2/pull/1/files

jvm chunk inflater: https://github.com/yurique/fs2/pull/1/files#diff-8c2b0bb4111ef230a00138813c001efa1fa187ee4c6f43cf514e0074b570e084R147-R181

js chunk inflater: https://github.com/yurique/fs2/pull/1/files#diff-1b982c5a5c6bd8cdbb47ca7c2114327cb202075ee95749b8024d7078b6fa7490R73-R207

shared code: https://github.com/yurique/fs2/pull/1/files#diff-e7f835b1438d510176b8ad23b915f2c27e152bc2768847265da54e9ce98acb50R33 –
it's mostly the same as the jvm implementation used to be before this change, but with an additional pull "stage" – drain, which is only invoked in JS; it reads the output without writing to the inflater (when there's nothing more to write) until the reader end event is emitted.

armanbilge · 2022-01-24T20:15:51Z

Haven't had a chance to look yet but just wanted to 🎉 that's amazing!

armanbilge · 2022-04-22T15:09:10Z

I'm really sorry, has this PR just been waiting for my review? 😬 😅 I'll try and resolve the merge conflicts.

armanbilge · 2022-04-22T15:10:39Z

Oh I see, there are more changes in a PR against this PR :)

chunk inflater yurique/fs2#1

yurique · 2022-04-22T15:57:20Z

I'm trying to recollect what was going on here.

I think the PR agains this PR was an "experiment" which turned out to be a successful one – more shared code between JS and JVM, and without a performance hit. But I can't remember all the details off the top of my head now :'(

armanbilge · 2022-04-22T21:29:26Z

I remember :) you were able to implement a common Pull-based inflater with both the JDK and Node.js APIs which (finally!) wasn't a performance regression on the JVM 🎉 Then everything else could be expressed in terms of that.

I think that's the hardest part of this PR 😅 so thanks for your big effort on it!

IIRC some remaining things are dealing with bincompatibly changing the API to use FiniteDuration instead java.time and also platforming the OS-detection stuff. These are annoying, but doable.

The other thing that I remember was frustrating was the poor performance of the CRC calculation on JS, particularly when using the scodec-bits implementation. This PR inlines a mutable implementation which works better (I can't remember how much better). I wish we didn't have to take that on but not sure if there's another way since Node.js doesn't expose native methods for calculating the CRC.

armanbilge · 2022-04-23T14:58:22Z

So I took another look at how http4s is using the Compression trait especially because I was wondering how the Node.js native HTTP client supports compression without all this functionality e.g. see:
https://nodejs.org/api/zlib.html#compressing-http-requests-and-responses

It seems none of the http4s gzip middlewares are using the additional headers such as file name or modification time, it is just directly interfacing with the underlying byte stream:

https://github.com/http4s/http4s/blob/b3220ab944042a6729ecdcbbd2b8a034250e98f5/client/jvm/src/main/scala/org/http4s/client/middleware/GZip.scala#L65

https://github.com/http4s/http4s/blob/b3220ab944042a6729ecdcbbd2b8a034250e98f5/server/jvm/src/main/scala/org/http4s/server/middleware/GZip.scala#L78

IIUC it should be possible to support these http4s use-cases just by directly shelling out to the Node.js APIs like in the example there. Not to discount your huge efforts on this PR :( but maybe the easy win here for http4s is to add new cross-platform APIs to Compression for gzipping without the optional headers. I think this will also be more performant, since calculating CRCs in raw JS was a bottleneck.

Meanwhile, I see there is an (unfortunately inactive) feature request for Node.js itself to add native support for these optional headers. If that ever materializes, we would be able to implement these APIs much more easily and with much better performance.

Add (full) gzip header/fields nodejs/node#39545

What do you think?

yurique added 5 commits December 25, 2021 08:38

cross-platforming compression

9b1b145

FinalBytesPipe (?)

5c2c8e3

most tests are passing

c60c8b6

CRC32

4f1a564

all tests are passing

f8bfc16

yurique commented Dec 26, 2021

View reviewed changes

io/js/src/main/scala/fs2/io/compressionplatform.scala Outdated Show resolved Hide resolved

yurique commented Dec 26, 2021

View reviewed changes

io/js/src/main/scala/fs2/io/compressionplatform.scala Outdated Show resolved Hide resolved

yurique added 3 commits December 26, 2021 22:48

.takeRight

3258066

revert fileCommentBytesSoftLimit to the larger value

8d8e15d

revert unneded changes

bb02efd

yurique added 5 commits December 27, 2021 00:14

keep trailer chunk not the whole stream after compressed bytes

8e1d285

some comments

01c8005

comments

4c6f3bb

revert ioplatform

fd7057e

use SuspendedStream instead

181e4a7

yurique commented Dec 26, 2021

View reviewed changes

io/js/src/main/scala/fs2/io/compressionplatform.scala Outdated Show resolved Hide resolved

yurique added 13 commits December 27, 2021 01:34

-empty line

5b5a11b

scalac warnings

5eb7309

formatting

fb3f701

less diff

87a529e

less nesting

f661471

less deferred

f9a308a

Merge branch 'main' into feature/js-compression-platform

88cd215

update args for zlib in test

0917bcc

minor refactoring

04de5aa

move pipes into separate files

57d8023

Sync gzip and crc from scodec

75e672c

shared gzip for js/jvm

f5de433

put the tests into the existing NodeJSCompressionSuite

28daec4

Merge branch 'main' into feature/js-compression-platform

022229e

# Conflicts: # build.sbt

yurique force-pushed the feature/js-compression-platform branch from 1745d7a to 022229e Compare January 22, 2022 22:46

armanbilge mentioned this pull request Jul 27, 2022

Node.js Gzip #2953

Merged

mpilquist closed this in #2953 Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cross-platforming compression #2771

cross-platforming compression #2771

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cross-platforming compression #2771

cross-platforming compression #2771

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!