add `Buffer.unsafe_internal_buffer` by c-cube · Pull Request #10982 · ocaml/ocaml

c-cube · 2022-02-02T15:50:21Z

This is something I've wanted to do for the past 8 years or so. Currently, Buffer.t is useful as a way to produce values of type string efficiently, but it falls short in other areas where access to the underlying byte buffer is required, because it provides no escape hatch at all. Functions like output_buffer, or add_int64_ne have to be hardcoded, and anything else will require an extra copy, which is problematic when the use of a buffer in the first place is to reduce allocation. If I wanted to read an int64 from a buffer, I'd have to extract to bytes first. If I want to use a Lwt_io function to write bytes to some channel, or use gzip bindings to compress the content of a buffer, or any other form of C bindings, or search for a delimiter (like \n), I have to copy/sub.

This is "unsafe" in the sense of "I know what I'm doing", but it doesn't violate any OCaml invariant. It's less risky than Bytes.unsafe_to_string, so I'm not entirely sure it deserves the prefix.

nojb · 2022-02-02T17:09:10Z

I understand the need, but my intuition is that this is not a good idea: for one, it leaks an implementation detail, but more to the point each issue mentioned above revolves around the possibility of reading from a Buffer.t, as opposed to appending to it (I am ignoring Buffer.add_int64_ne which already exists).

I would rather first discuss whether we agree to extend the API of Buffer.t with "read" operations. These could be exposed and implemented with the same API as string/bytes. Missing functions such as output_buffer could also be added, even though to be honest I am also wondering if one extra copy in Buffer.contents really justifies all these additions.

c-cube · 2022-02-02T17:15:30Z

@nojb I'm afraid I disagree on all points :).

Buffer is already readable, see nth. It's just not very efficient. Event extracting the content is, after all, a form of reading!
Buffer being backed by a bytes array isn't an implementation detail, it's a defining property. I don't think it'd bad to leak it, as a lot of existing functions rely on it for the performance guarantees. I would expect to_string to be one call to sub, add_slice to be an optional resize + blit, etc. so the argument that we could some day replace by a rope or something else seems moot to me.

alainfrisch · 2022-02-02T17:24:50Z

Related: #6538, #2672, and also #6975 (comment).

While I like the idea of exposing a more powerful API to turn Buffer.t into proper mutable/extensible strings, I think we should indeed avoid exposing its internal implementation. It could for instance be considered in the future to switch to char bigarrays (esp. now that bigarray is part of the stdlib) which might be better suited than bytes to hold long-lived large strings, and which might allow faster "direct" I/O.

c-cube · 2022-02-02T17:32:39Z

@alainfrisch honestly, even moving to bigarray seems dubious to me. It's clear that allocating a lot of small buffers is fine, and people do it (even in, say, Format.asprintf — you have no choice in the new multicore world!), whereas doing the same with bigarrays will badly behave.

I think the backup buffer is an implementation detail, but the main one isn't. It's been around for decades so the implementation details have irreversibly leaked. relevant xkcd if one wanted to modify the implementation.

c-cube · 2022-02-02T17:34:12Z

Of course, I should add that I've regularly reimplemented my own crappy buffer because I needed access to the internals after some IO reads. Even calling String.find or String.index_from on the content is currently impossible.

alainfrisch · 2022-02-02T17:37:17Z

It's clear that allocating a lot of small buffers is fine, and people do it

Which is not incompatible with automatically switching to char bigarray when resizing above a certain limit...

c-cube · 2022-02-02T17:46:41Z

8000

Another remark:

@alainfrisch said:
which might allow faster "direct" I/O

how do you do that if you don't have access to the internal buffer? 🙂 . Or are you advocating for having these builtin in the stdlib?

To me that's a different, specialized type, of IO buffer with a long lifetime and guaranteed bigarray backing. You don't want your writev to fail because it's actually a small bytes buffer underneath instead of a bigstring, do you? Buffer.t already has tons of use cases that preclude a pure bigarray backing storage, and mixing the two seems too complicated.

Why not just make Buffer.t more useful in the situations where it already kind of works, instead of trying to make it into what it'll never be (i.e. a high performance, C-compatible, IO buffer)?

xavierleroy · 2022-02-02T17:48:59Z

Why not just make Buffer.t more useful in the situations where it already kind of works, instead of trying to make it into what it'll never be (i.e. a high performance, C-compatible, IO buffer)?

Why don't you work out a design and an implementation of the "high-performance, C-compatible, IO buffer" or whatever you have in mind as a separate library, or as part of one of the alternate standard libraries around there ? The OCaml stdlib is not the place for experiments, but can benefit from such experiments.

c-cube · 2022-02-02T18:00:10Z

I'm not sure I understand your angle, @xavierleroy. I'm advocating to commit to the current design (which I think is arguably fixed in stone at this point), and provide access to the buffer's data as it is. On the other hand, my understanding is that @alainfrisch is advocating for keeping the internal buffer private so it could change in the future (and perhaps be the foundation for bigstring-backed IOs); I think this ship has sailed for a module that is 23 years old.

xavierleroy · 2022-02-02T18:12:11Z

I probably quoted the wrong part of the discussion, sorry. It's the part where you mention " I've regularly reimplemented my own crappy buffer because I needed access to the internals after some IO reads" that made me think you could first design and implement your own non-crappy buffer that does what you want, then show it to the world. I would hope something could come out of this effort that would be nicer than just "let's expose the internal state of the existing abstraction".

c-cube · 2022-02-02T18:37:27Z

I don't have a nicer abstraction, I don't think there can exist one unless we had rust's immutable slices, in which case it'd be val peek : t -> slice. Rust's buffer, called Vec<u8>, exposes that, and they can then leverage plenty of generic functions on slices (or iterators, obtainable from slices). My better buffer type is this PR 😇 .

A few examples of where I just redid my thing:

this to be able to write this
this so I could write this, although admittedly Buffer could fit the bill now (I don't remember). There's also a lot of terrible code in most of my projects that pairs bytes ref (or a mutable field) with one or two integers, precisely to represent a byte slice. Example in this project + camlzip.

As for more general IO buffers, we have biniou which long powered Yojson (although this should change with 2.0), and ocamlnet's netbuffer that is linked in a previous issue.

dinosaure · 2022-02-03T08:48:32Z

I don't think there can exist one unless we had rust's immutable slices, in which case it'd be val peek : t -> slice. Rust's buffer, called Vec, exposes that, and they can then leverage plenty of generic functions on slices (or iterators, obtainable from slices).

I'm not sure to understand this missing part where cstruct exists. It offers a large and stable API and, structurally, it's a slice of a bigarray.

johnwhitington · 2022-02-03T11:40:32Z

Some discussion of reading integers, floats etc. from a Buffer without copying to a string here, and elsewhere in that issue:

#5443 (comment)

c-cube · 2022-02-03T13:05:32Z

I'm not sure to understand this missing part where cstruct exists. It offers a large and stable API and, structurally, it's a slice of a bigarray.

these slices:

allocate.
work on bigarrays only, so, not valid in the present case. A lot of buffers, in my experience, are small and short lived (my go-to initial size is 32…), so you don't want to replace all buffers with cstruct (think of the impact on Format.asprintf, for example).

so I don't think it's a valid counterpoint.

alainfrisch · 2022-02-03T13:33:53Z

Just to clarify my position: I'm not trying to push for a change of representation, but I don't think it's a good idea to prevent any such change in the future. Abstraction is useful!

I think that extending Buffer with all functions on bytes, also with buffer->buffer blitting, etc would cover most use cases. It's a bit of work, but it seems much more robust than leaking parts of the internal representation. (One case not covered : directly accessing the data from C bindings.)

bluddy · 2022-02-03T13:40:51Z

Buffer is a high-performance data structure, and realistically its implementation is extremely unlikely to change. If we ever want a BigArray-backed Buffer, we'll have to make a new type. Just think of having to test the performance impact of that kind of change in the OCaml universe: it's realistically never ever going to happen. I therefore see nothing wrong with exposing the internals so as to let the data structure do what it was meant to do, which is to allow for high performance operations.

c-cube · 2022-02-03T17:51:48Z

I agree with @bluddy. Also, abstraction is useful, but not for such low-level components; one also expects bytes to be an array and not, say, a rope.

@alainfrisch could you sketch what a realistic change to Buffer.t's internals would look like — one that keeps the current level of performance for blit, sub, add_char, and the good GC behavior on potentially fitting on the minor heap? Even just a type definition would do. To me it's not possible, but I'd like to see what you have in mind.

I should have opened this years ago and I'm really frustrated by the pushback. I never wrote CCBuffer in containers just because I'm locked out of the implementation, and I'll end up using %identity if needs be. Why shouldn't we strive to share common vocabulary via the stdlib, on important things like byte buffers? It happened for iterators, either, unicode chars (which have of_int, btw — the implementation is fixed in practice in the exact same way).

c-cube · 2022-02-03T17:55:35Z

As for the "missing functions": the de-facto blocking IO interfaces in OCaml are:

val read: bytes -> int -> int -> int
val write: bytes -> int -> int -> unit

You cannot add to Buffer.t all the IO functions that require access to this kind of buffer. It's hardcoded for in_channel and out_channel (which, of course, are not extensible, see RFC 19), but that's not helpful if you use anything else. It'd be possible to add with_buf : Buffer.t -> (bytes -> int -> int -> 'a) -> 'a but that's less convenient and less efficient than just giving access to the bytes.

vphantom · 2022-02-03T18:20:08Z

I'm a relatively new user to OCaml (since 2018 part-time, 2020 full-time) but I really like this PR. One of my growing pains has been to choose between bytes and Buffer.t for various uses, and whenever I did use Buffer.t I always had to blit to a bytes copy in order to read or especially to pass it along to other modules (ocaml-protoc to name just one). This happened enough that I recently wrote my own Buffer-like module that can use bytes as its storage in an open manner, to avoid the extra copying.

A Buffer.unsafe_to_bytes would do wonders to make the Stdlib more uniform: it would be clearer that Buffer would be the ideal structure to use to write incrementally into a growing memory area and that Bytes would be best suited for random-access reading and writing, thanks to that escape hatch. Just like the Bytes vs String unsafe conversions, this would require care from callers, since resizing the underlying buffer would invalidate the previous storage area. I don't think this would be unreasonable, given the obvious "unsafe" name.

vphantom · 2022-02-03T19:53:23Z

BTW, is there any reason this is called unsafe_internal_buffer instead of unsafe_to_bytes like we see elsewhere (i.e. String)?

c-cube · 2022-02-03T19:58:30Z

The reason is twofold:

such names are usually bikeshed to death, so I didn't put too much effort in picking one
to_bytes exists and refers to the valid slice of the buffer. By contrast, this exposes the full underlying byte buffer.

yallop · 2022-02-03T20:04:00Z

I'd also prefer to keep Buffer.t abstract. Even if there are no immediate plans to switch to a more efficient internal representation, there doesn't seem to be a strong argument for making it impossible to switch in future.

c-cube · 2022-02-03T20:12:41Z

… Am I making no arguments here? There's been no plan to change it FOR THE 23 YEARS IT HAS EXISTED. no one has a suggestion of how it might change (the bigstring one would break memory behavior for so many programs it's not funny).

This is seriously the last time I even consider contributing to the stdlib.

vphantom · 2022-02-03T20:23:13Z

Since the proposed function is clearly labeled "unsafe" and that it would solve a problem, I think it would be perfectly reasonable to include it now and if in some distant future the underlying implementation changed, which is unlikely but not impossible, the function could become a harmless alias of to_bytes to avoid breaking existing code.

alainfrisch · 2022-02-03T20:39:06Z

@alainfrisch could you sketch what a realistic change to Buffer.t's internals would look like — one that keeps the current level of performance for blit, sub, add_char, and the good GC behavior on potentially fitting on the minor heap?

Something like:

type buf = Bytes of bytes | BaString of (char, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t

type t =
 {mutable buffer : buf;
  mutable position : int;
  mutable length : int;
  initial_buffer : buf}

(with an unboxing annotation onf buf once it exists, because bytes and bigarrays have different tags.)

Are you concerned by the cost of the extra pattern-matching indirection on every access?

dbuenzli · 2022-02-03T20:47:47Z

(the bigstring one would break memory behavior for so many programs it's not funny).

You can make that opt-in at buffer creation time.

if in some distant future the underlying implementation changed, which is unlikely but not impossible, the function could become a harmless alias of to_bytes to avoid breaking existing code.

That one would break memory behaviour for so many programs it's not funny :–)

In general the story on how we do bytes – bytes or Bigbytes.t, interaction with (runtime released or not) C code, and IO – is really not great at the moment.

For example it would be nice to be able to choose the backing store of a buffer abstraction so that buffer values can be used to encode to either bigbytes or bytes without client even knowing to what they are encoding. I'm not sure these kind of things can or should be retrofitted in Buffer and maybe this should go in an entirely new API, however for sure this PR closes a few options and I understand those who don't want to break the abstraction.

gasche · 2022-02-03T20:48:04Z

in_channel and out_channel (which, of course, are not extensible, see RFC 19)

Apropos this, do you actually have plans to implement something along the lines of RFC 19 and submitting it as a PR? I think this is the expectation, and I'm curious if you have any plans in this direction.

c-cube · 2022-02-03T20:49:57Z

Apropos this, do you actually have plans to implement something along the lines of RFC 19 and submitting it as a PR? I think this is the expectation, and I'm curious if you have any plans in this direction.

I don't have the patience, energy, or motivation to work on that, especially if it's to face 2 years of nitpicking on the PR afterwards. So, no, fuck it.

c-cube · 2022-02-03T20:50:19Z

I'll do my own buffer somewhere else, probably in containers, and let the stdlib do its thing.

avsm · 2022-02-03T20:52:34Z

@c-cube wrote:

There's been no plan to change it FOR THE 23 YEARS IT HAS EXISTED

It's just that in the 24th year, OCaml 5.0 happens to have non-relocating byte (above a certain size, the multicore runtime just allocates those values using malloc, and there is currently no compaction). So the distinction between Bytes.t and Bigbytes.t is a lot smaller than it used to be. A slightly unfortunate time to propose breaking abstraction in Buffer (not necessarily a bad idea in my book, but there are a lot of moving parts right now).

c-cube · 2022-02-03T20:52:43Z

Are you concerned by the cost of the extra pattern-matching indirection on every access?

given nth exists, kind of, yes.

c-cube · 2022-02-03T21:07:12Z

@avsm bytes becoming better means that Buffer.t has even less reason to change, correct?

c-cube · 2022-02-04T01:28:16Z

I experimented a bit with what @alainfrisch was suggesting, and here's a super basic comparison of add_char for buffers implemented with just bytes (current) and a sum type with either bytes or bigstring on godbolt.

add_char, which typically should be inlined (and is, according to the .cmx in my 4.12 opam switch), is arguably one of the most critical functions in Buffer. By adding a sum type to dynamically switch between bytes and bigstring (not included in the sample), we trade 1 conditional jump + 1 function call, for 3 conditional jumps + 2 jmp + 1 function call, and the assembly is 55 lines long instead of 28.

I'm really not an assembly expert. It doesn't look that bad, but it's still a likely slowdown for code that calls add_char a lot. It's probably better if one stores the capacity in the record, too.

edit: a bigger problem is probably the conditional jump in nth.

alainfrisch · 2022-02-04T08:59:15Z

I also did a quick benchmark: bench.ml.txt

I cannot observe any difference between the two versions (note the inlining annotation on nth and add_char), and this doesn't even include a possible future optimization which would avoid an extra memory load in the new version (unboxing of the sum type, using the fact that bytes and bigarrays have different tags).

(Side note: I doubt that performance-critical code would rely a lot on nth!)

But my point is not so much about this possible change, but about keeping the possibility to consider changes in the future. This could be a radically different implementation when targetting Javascript, or keeping list of chunks above a certain size (which can avoid fragmentation when there is no compaction, and avoid copying large strings if compaction exists).

avsm · 2022-02-04T08:59:50Z

bytes becoming better means that Buffer.t has even less reason to change, correct?

Maybe, maybe not. It depends whether we add a compactor or not. Certainly, there's a lot of work to do to figure it out (some of which is happening in https://github.com/ocaml-multicore/eio). I'm not in a rush to break abstraction in Buffer now of all times.

c-cube · 2022-02-04T13:54:37Z

That's alright, I'm done with this issue anyway. It's clear that Buffer isn't designed for my use cases beyond "append*; contents". I'll write my own with stronger guarantees, fragmentation notwithstanding. I'd have loved to propose a standard vector implementation at some point, but I feel like yesterday: not going to happen.

add Buffer.unsafe_internal_buffer

0bd1829

alainfrisch added the stdlib label Feb 2, 2022

c-cube added 3 commits February 3, 2022 14:36

shorter lines

14d5e66

add changes entry

9658dec

changes: shorter lines

4afac52

c-cube closed this Feb 3, 2022

c-cube deleted the expose-internal-buffer-buffer branch February 3, 2022 20:50

c-cube mentioned this pull request Sep 28, 2022

add Dynarray to the stdlib. #11563

Closed

gasche mentioned this pull request Feb 21, 2024

add Buffer.iter_chunks #12931

Open

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants