8000 string type memory representation · Issue #158 · libdynd/libdynd · GitHub
[go: up one dir, main page]

Skip to content

string type memory representation #158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mwiebe opened this issue Oct 31, 2014 · 1 comment
Open

string type memory representation #158

mwiebe opened this issue Oct 31, 2014 · 1 comment

Comments

@mwiebe
Copy link
Member
mwiebe commented Oct 31, 2014

DyND's string representation could use some refinement. Currently, there are two ways strings are represented: A numpy-style fixed size buffer, and a pooled allocation with each string being a pair of pointers into that pool. The default string type is the latter, using the utf-8 encoding. This has some slightly unintuitive consequences, the biggest being that the string acts as a "write once" type. This is fine for simple data conversions and some kinds of computations, but not for interactive manipulation or algorithms which will repeatedly append/modify an array of strings.

Some properties we would like DyND's string representation to have include:

  1. Heap allocation by default, but allow for pooled allocation and referring to strings inside other buffers.
  2. Support the small string optimization, so strings that fit in 15 or fewer bytes don't require a separate memory allocation.
  3. Have other string representations, like fixed-size buffers or various encodings, be expression types whose value type is the standard string type, and whose storage type is bytes[N] or bytes.

It may be desirable to have an additional "rope" type to represent enormous editable strings, but this is not an immediate priority.

The implementation changes to represent strings satisfying the desired properties are:

  1. Change the memory block allocation to have a heap vs pooled capability.
  2. Introduce the small string optimization. Make the storage be two 64-bit values on all platforms, with the last byte signalling whether it refers to an external buffer or data in the first 15 bytes. Some accounting for big/little endian must occur here.
  3. Change the fixedstring, etc types to be "adapt" types, to fit them into a uniform adaptation mechanism. Probably good to still have string[...] aliases for the type representation for simple spellings of these types.
@jreback
Copy link
Contributor
jreback commented Oct 31, 2014

your number 2 is effectively interning, so +1 on that.
In theory strings < 5 in length should be differently but probably complicates things. (as you can hold this in a single 64-point pointer).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants
0