E520 add `Buffer.unsafe_internal_buffer` by c-cube · Pull Request #10982 · ocaml/ocaml · GitHub
[go: up one dir, main page]

Skip to content

add Buffer.unsafe_internal_buffer#10982

Closed
c-cube wants to merge 4 commits intoocaml:trunkfrom
c-cube:expose-internal-buffer-buffer
Closed

add Buffer.unsafe_internal_buffer#10982
c-cube wants to merge 4 commits intoocaml:trunkfrom
c-cube:expose-internal-buffer-buffer

Conversation

@c-cube
Copy link
Contributor
@c-cube c-cube commented Feb 2, 2022

This is something I've wanted to do for the past 8 years or so. Currently, Buffer.t is useful as a way to produce values of type string efficiently, but it falls short in other areas where access to the underlying byte buffer is required, because it provides no escape hatch at all. Functions like output_buffer, or add_int64_ne have to be hardcoded, and anything else will require an extra copy, which is problematic when the use of a buffer in the first place is to reduce allocation. If I wanted to read an int64 from a buffer, I'd have to extract to bytes first. If I want to use a Lwt_io function to write bytes to some channel, or use gzip bindings to compress the content of a buffer, or any other form of C bindings, or search for a delimiter (like \n), I have to copy/sub.

This is "unsafe" in the sense of "I know what I'm doing", but it doesn't violate any OCaml invariant. It's less risky than Bytes.unsafe_to_string, so I'm not entirely sure it deserves the prefix.

@nojb
Copy link
Contributor
nojb commented Feb 2, 2022

I understand the need, but my intuition is that this is not a good idea: for one, it leaks an implementation detail, but more to the point each issue mentioned above revolves around the possibility of reading from a Buffer.t, as opposed to appending to it (I am ignoring Buffer.add_int64_ne which already exists).

I would rather first discuss whether we agree to extend the API of Buffer.t with "read" operations. These could be exposed and implemented with the same API as string/bytes. Missing functions such as output_buffer could also be added, even though to be honest I am also wondering if one extra copy in Buffer.contents really justifies all these additions.

@c-cube
Copy link
Contributor Author
c-cube commented Feb 2, 2022

@nojb I'm afraid I disagree on all points :).

  • Buffer is already readable, see nth. It's just not very efficient. Event extracting the content is, after all, a form of reading!
  • Buffer being backed by a bytes array isn't an implementation detail, it's a defining property. I don't think it'd bad to leak it, as a lot of existing functions rely on it for the performance guarantees. I would expect to_string to be one call to sub, add_slice to be an optional resize + blit, etc. so the argument that we could some day replace by a rope or something else seems moot to me.

@alainfrisch
Copy link
Contributor

Related: #6538, #2672, and also #6975 (comment).

While I like the idea of exposing a more powerful API to turn Buffer.t into proper mutable/extensible strings, I think we should indeed avoid exposing its internal implementation. It could for instance be considered in the future to switch to char bigarrays (esp. now that bigarray is part of the stdlib) which might be better suited than bytes to hold long-lived large strings, and which might allow faster "direct" I/O.

@c-cube
Copy link
Contributor Author
c-cube commented Feb 2, 2022

@alainfrisch honestly, even moving to bigarray seems dubious to me. It's clear that allocating a lot of small buffers is fine, and people do it (even in, say, Format.asprintf — you have no choice in the new multicore world!), whereas doing the same with bigarrays will badly behave.

I think the backup buffer is an implementation detail, but the main one isn't. It's been around for decades so the implementation details have irreversibly leaked. relevant xkcd if one wanted to modify the implementation.

@c-cube
Copy link
Contributor Author
c-cube commented Feb 2, 2022

Of course, I should add that I've regularly reimplemented my own crappy buffer because I needed access to the internals after some IO reads. Even calling String.find or String.index_from on the content is currently impossible.

@alainfrisch
Copy link
Contributor

It's clear that allocating a lot of small buffers is fine, and people do it

Which is not incompatible with automatically switching to char bigarray when resizing above a certain limit...

@c-cube
Copy link
Contributor Author
c-cube commented Feb 2, 2022
8000

Another remark:

@alainfrisch said:
which might allow faster "direct" I/O

how do you do that if you don't have access to the internal buffer? 🙂 . Or are you advocating for having these builtin in the stdlib?

To me that's a different, specialized type, of IO buffer with a long lifetime and guaranteed bigarray backing. You don't want your writev to fail because it's actually a small bytes buffer underneath instead of a bigstring, do you? Buffer.t already has tons of use cases that preclude a pure bigarray backing storage, and mixing the two seems too complicated.

Why not just make Buffer.t more useful in the situations where it already kind of works, instead of trying to make it into what it'll never be (i.e. a high performance, C-compatible, IO buffer)?

@xavierleroy
Copy link
Contributor
xavierleroy commented Feb 2, 2022

Why not just make Buffer.t more useful in the situations where it already kind of works, instead of trying to make it into what it'll never be (i.e. a high performance, C-compatible, IO buffer)?

Why don't you work out a design and an implementation of the "high-performance, C-compatible, IO buffer" or whatever you have in mind as a separate library, or as part of one of the alternate standard libraries around there ? The OCaml stdlib is not the place for experiments, but can benefit from such experiments.

@c-cube
Copy link
Contributor Author
c-cube commented Feb 2, 2022

I'm not sure I understand your angle, @xavierleroy. I'm advocating to commit to the current design (which I think is arguably fixed in stone at this point), and provide access to the buffer's data as it is. On the other hand, my understanding is that @alainfrisch is advocating for keeping the internal buffer private so it could change in the future (and perhaps be the foundation for bigstring-backed IOs); I think this ship has sailed for a module that is 23 years old.

@xavierleroy
Copy link
Contributor

I probably quoted the wrong part of the discussion, sorry. It's the part where you mention " I've regularly reimplemented my own crappy buffer because I needed access to the internals after some IO reads" that made me think you could first design and implement your own non-crappy buffer that does what you want, then show it to the world. I would hope something could come out of this effort that would be nicer than just "let's expose the internal state of the existing abstraction".

@c-cube
Copy link
Contributor Author
c-cube commented Feb 2, 2022

I don't have a nicer abstraction, I don't think there can exist one unless we had rust's immutable slices, in which case it'd be val peek : t -> slice. Rust's buffer, called Vec<u8>, exposes that, and they can then leverage plenty of generic functions on slices (or iterators, obtainable from slices). My better buffer type is this PR 😇 .

A few examples of where I just redid my thing:

  • this to be able to write this
  • this so I could write this, although admittedly Buffer could fit the bill now (I don't remember). There's also a lot of terrible code in most of my projects that pairs bytes ref (or a mutable field) with one or two integers, precisely to represent a byte slice. Example in this project + camlzip.

As for more general IO buffers, we have biniou which long powered Yojson (although this should change with 2.0), and ocamlnet's netbuffer that is linked in a previous issue.

@dinosaure
Copy link

I don't think there can exist one unless we had rust's immutable slices, in which case it'd be val peek : t -> slice. Rust's buffer, called Vec, exposes that, and they can then leverage plenty of generic functions on slices (or iterators, obtainable from slices).

I'm not sure to understand this missing part where cstruct exists. It offers a large and stable API and, structurally, it's a slice of a bigarray.

@johnwhitington
Copy link
Contributor

Some discussion of reading integers, floats etc. from a Buffer without copying to a string here, and elsewhere in that issue:

#5443 (comment)

@c-cube
Copy link
Contributor Author
c-cube commented Feb 3, 2022

I'm not sure to understand this missing part where cstruct exists. It offers a large and stable API and, structurally, it's a slice of a bigarray.

these slices:

  • allocate.
  • work on bigarrays only, so, not valid in the present case. A lot of buffers, in my experience, are small and short lived (my go-to initial size is 32…), so you don't want to replace all buffers with cstruct (think of the impact on Format.asprintf, for example).

so I don't think it's a valid counterpoint.

@alainfrisch
Copy link
Contributor

Just to clarify my position: I'm not trying to push for a change of representation, but I don't think it's a good idea to prevent any such change in the future. Abstraction is useful!

I think that extending Buffer with all functions on bytes, also with buffer->buffer blitting, etc would cover most use cases. It's a bit of work, but it seems much more robust than leaking parts of the internal representation. (One case not covered : directly accessing the data from C bindings.)

@bluddy
Copy link
Contributor
bluddy commented Feb 3, 2022

Buffer is a high-performance data structure, and realistically its implementation is extremely unlikely to change. If we ever want a BigArray-backed Buffer, we'll have to make a new type. Just think of having to test the performance impact of that kind of change in the OCaml universe: it's realistically never ever going to happen. I therefore see nothing wrong with exposing the internals so as to let the data structure do what it was meant to do, which is to allow for high performance operations.

@c-cube
Copy link
Contributor Author
c-cube commented Feb 3, 2022

I agree with @bluddy. Also, abstraction is useful, but not for such low-level components; one also expects bytes to be an array and not, say, a rope.

@alainfrisch could you sketch what a realistic change to Buffer.t's internals would look like — one that keeps the current level of performance for blit, sub, add_char, and the good GC behavior on potentially fitting on the minor heap? Even just a type definition would do. To me it's not possible, but I'd like to see what you have in mind.


I should have opened this years ago and I'm really frustrated by the pushback. I never wrote CCBuffer in containers just because I'm locked out of the implementation, and I'll end up using %identity if needs be. Why shouldn't we strive to share common vocabulary via the stdlib, on important things like byte buffers? It happened for iterators, either, unicode chars (which have of_int, btw ­— the implementation is fixed in practice in the exact same way).

@c-cube
Copy link
Contributor Author
c-cube commented Feb 3, 2022

As for the "missing functions": the de-facto blocking IO interfaces in OCaml are:

val read: bytes -> int -> int -> int
val write: bytes -> int -> int -> unit

You cannot add to Buffer.t all the IO functions that require access to this kind of buffer. It's hardcoded for in_channel and out_channel (which, of course, are not extensible, see RFC 19), but that's not helpful if you use anything else. It'd be possible to add with_buf : Buffer.t -> (bytes -> int -> int -> 'a) -> 'a but that's less convenient and less efficient than just giving access to the bytes.

@vphantom
Copy link
vphantom commented Feb 3, 2022

I'm a relatively new user to OCaml (since 2018 part-time, 2020 full-time) but I really like this PR. One of my growing pains has been to choose between bytes and Buffer.t for various uses, and whenever I did use Buffer.t I always had to blit to a bytes copy in order to read or especially to pass it along to other modules (ocaml-protoc to name just one). This happened enough that I recently wrote my own Buffer-like module that can use bytes as its storage in an open manner, to avoid the extra copying.

A Buffer.unsafe_to_bytes would do wonders to make the Stdlib more uniform: it would be clearer that Buffer would be the ideal structure to use to write incrementally into a growing memory area and that Bytes would be best suited for random-access reading and writing, thanks to that escape hatch. Just like the Bytes vs String unsafe conversions, this would require care from callers, since resizing the underlying buffer would invalidate the previous storage area. I don't think this would be unreasonable, given the obvious "unsafe" name.

@vphantom
Copy link
vphantom commented Feb 3, 2022

BTW, is there any reason this is called unsafe_internal_buffer instead of unsafe_to_bytes like we see elsewhere (i.e. String)?

@c-cube
Copy link
Contributor Author
c-cube commented Feb 3, 2022

The reason is twofold:

  • such names are usually bikeshed to death, so I didn't put too much effort in picking one
  • to_bytes exists and refers to the valid slice of the buffer. By contrast, this exposes the full underlying byte buffer.

@yallop
Copy link
Member
yallop commented Feb 3, 2022

I'd also prefer to keep Buffer.t abstract. Even if there are no immediate plans to switch to a more efficient internal representation, there doesn't seem to be a strong argument for making it impossible to switch in future.

@c-cube
Copy link
Contributor Author
c-cube commented Feb 3, 2022

… Am I making no arguments here? There's been no plan to change it FOR THE 23 YEARS IT HAS EXISTED. no one has a suggestion of how it might change (the bigstring one would break memory behavior for so many programs it's not funny).

This is seriously the last time I even consider contributing to the stdlib.

@vphantom
Copy link
vphantom commented Feb 3, 2022

Since the proposed function is clearly labeled "unsafe" and that it would solve a problem, I think it would be perfectly reasonable to include it now and if in some distant future the underlying implementation changed, which is unlikely but not impossible, the function could become a harmless alias of to_bytes to avoid breaking existing code.

@alainfrisch
Copy link
Contributor

@alainfrisch could you sketch what a realistic change to Buffer.t's internals would look like — one that keeps the current level of performance for blit, sub, add_char, and the good GC behavior on potentially fitting on the minor heap?

Something like:

type buf = Bytes of bytes | BaString of (char, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t

type t =
 {mutable buffer : buf;
  mutable position : int;
  mutable length : int;
  initial_buffer : buf}

(with an unboxing annotation onf buf once it exists, because bytes and bigarrays have different tags.)

Are you concerned by the cost of the extra pattern-matching indirection on every access?

@dbuenzli
Copy link
Contributor
dbuenzli commented Feb 3, 2022

(the bigstring one would break memory behavior for so many programs it's not funny).

You can make that opt-in at buffer creation time.

if in some distant future the underlying implementation changed, which is unlikely but not impossible, the function could become a harmless alias of to_bytes to avoid breaking existing code.

That one would break memory behaviour for so many programs it's not funny :–)

In general the story on how we do bytes – bytes or Bigbytes.t, interaction with (runtime released or not) C code, and IO – is really not great at the moment.

For example it would be nice to be able to choose the backing store of a buffer abstraction so that buffer values can be used to encode to either bigbytes or bytes without client even knowing to what they are encoding. I'm not sure these kind of things can or should be retrofitted in Buffer and maybe this should go in an entirely new API, however for sure this PR closes a few options and I understand those who don't want to break the abstraction.

@gasche
Copy link
Member
gasche commented Feb 3, 2022

in_channel and out_channel (which, of course, are not extensible, see RFC 19)

Apropos this, do you actually have plans to implement something along the lines of RFC 19 and submitting it as a PR? I think this is the expectation, and I'm curious if you have any plans in this direction.

@c-cube
Copy link
Contributor Author
c-cube commented Feb 3, 2022

Apropos this, do you actually have plans to implement something along the lines of RFC 19 and submitting it as a PR? I think this is the expectation, and I'm curious if you have any plans in this direction.

I don't have the patience, energy, or motivation to work on that, especially if it's to face 2 years of nitpicking on the PR afterwards. So, no, fuck it.

@c-cube
Copy link
Contributor Author
c-cube commented Feb 3, 2022

I'll do my own buffer somewhere else, probably in containers, and let the stdlib do its thing.

@c-cube c-cube closed this Feb 3, 2022
@c-cube c-cube deleted the expose-internal-buffer-buffer branch February 3, 2022 20:50
@avsm
Copy link
Member
avsm commented Feb 3, 2022

@c-cube wrote:

There's been no plan to change it FOR THE 23 YEARS IT HAS EXISTED

It's just that in the 24th year, OCaml 5.0 happens to have non-relocating byte (above a certain size, the multicore runtime just allocates those values using malloc, and there is currently no compaction). So the distinction between Bytes.t and Bigbytes.t is a lot smaller than it used to be. A slightly unfortunate time to propose breaking abstraction in Buffer (not necessarily a bad idea in my book, but there are a lot of moving parts right now).

@c-cube
Copy link
Contributor Author
c-cube commented Feb 3, 2022

Are you concerned by the cost of the extra pattern-matching indirection on every access?

given nth exists, kind of, yes.

@c-cube
Copy link
Contributor Author
c-cube commented Feb 3, 2022

@avsm bytes becoming better means that Buffer.t has even less reason to change, correct?

@c-cube
Copy link
Contributor Author
c-cube commented Feb 4, 2022

I experimented a bit with what @alainfrisch was suggesting, and here's a super basic comparison of add_char for buffers implemented with just bytes (current) and a sum type with either bytes or bigstring on godbolt.

add_char, which typically should be inlined (and is, according to the .cmx in my 4.12 opam switch), is arguably one of the most critical functions in Buffer. By adding a sum type to dynamically switch between bytes and bigstring (not included in the sample), we trade 1 conditional jump + 1 function call, for 3 conditional jumps + 2 jmp + 1 function call, and the assembly is 55 lines long instead of 28.

I'm really not an assembly expert. It doesn't look that bad, but it's still a likely slowdown for code that calls add_char a lot. It's probably better if one stores the capacity in the record, too.

edit: a bigger problem is probably the conditional jump in nth.

@alainfrisch
Copy link
Contributor
alainfrisch commented Feb 4, 2022

I also did a quick benchmark: bench.ml.txt

I cannot observe any difference between the two versions (note the inlining annotation on nth and add_char), and this doesn't even include a possible future optimization which would avoid an extra memory load in the new version (unboxing of the sum type, using the fact that bytes and bigarrays have different tags).

(Side note: I doubt that performance-critical code would rely a lot on nth!)

But my point is not so much about this possible change, but about keeping the possibility to consider changes in the future. This could be a radically different implementation when targetting Javascript, or keeping list of chunks above a certain size (which can avoid fragmentation when there is no compaction, and avoid copying large strings if compaction exists).

@avsm
Copy link
Member
avsm commented Feb 4, 2022

bytes becoming better means that Buffer.t has even less reason to change, correct?

Maybe, maybe not. It depends whether we add a compactor or not. Certainly, there's a lot of work to do to figure it out (some of which is happening in https://github.com/ocaml-multicore/eio). I'm not in a rush to break abstraction in Buffer now of all times.

@c-cube
Copy link
Contributor Author
c-cube commented Feb 4, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0