-
Notifications
You must be signed in to change notification settings - Fork 24.7k
Description
🚀 The feature, motivation and pitch
This is useful to avoid copies related to copy-on-write (actually copy-on-read because of python's finicky ref-counters) problems with DataLoader: #13246. A typical application: list of file names or file paths in a dataset (avoiding creating hundreds of thousands/millions of python string objects on the heap), string<>token lookup tables.
For fixed-size-characters (ascii, utf_32) there is my prototype in https://gist.github.com/vadimkantorov/86c3a46bf25bed3ad45d043ae86fff57 , but some other designs can be considered. In essence, initially this means some fast parallelized APIs for conversion from python string lists and accessing individual elements / sub-lists (maybe parallelized string encoding/decoding). Different storage formats can be envisaged: e.g. fixed-length string array (with some null-byte padding) or no-padding packed format as in the gist above. I think for practical use in compaction of strings in datasets, the no-padding format is needed (although for parallelized hashing the fixed-length strings may be easier). Also should be decided if (stable) hashes can be precomputed/cached/stored along with the strings.
This would be useful for some fast column-based string processing or mmap'ing some dataset files . Probably NumPy / HDF5 / Apache Arrow / parquet / data frame libraries have also some support along these lines.
It seems that torcharrow.StringColumn might implement this. I think it's worth moving a string list class like this in core. Maybe even more lightweight - a Tensor subclass or even just methods for working with string-array holding uint8/int16/int32 tensors, because it's very useful for working around #13246 and otherwise more economic/parallelized basic string/file paths manipulation.
A useful string function to include is some parallelized string hashing methods that are stable (e.g. hash all of the strings in the array at once). Then this could be used for fast hashtable construction / keys hash computation. Another useful concept can be "string lists" that allow appends (with some exponential storage reallocation): #64359
Related issues on "zero-copy": #43949, #33041, #34651 (about getting a bytes
view over a sub-tensor - can be useful as an ascii string substitute, and in general for zero-copy pytorch interop. i wonder if python has some native string views over utf-8 strings?) It may seem that there's even an option to hack around CPython PyUnicode structure and create a "view" over char bytes (stored in tensor) without any char byte copies (although it's maybe not very safe): python/cpython#104689 https://stackoverflow.com/questions/76291943/create-a-python-string-from-a-native-pointer-without-char-buffer-copy
Going further, maybe some simplistic dataframe class can be added to PyTorch (being a tuple of tensors with having equal leftmost dim). These dataframes would primarily be used for some simple dataset serialization/deserialization / filtering and transformation. Ideally, a dataframe should support two modes of serialization: array-of-structs and column-based. Imagine, having a list of COCO per-image annotation objects and just giving it to some sort of dataframe constructor (maybe along with some schema/spec) and getting back a set of column tensors (with some helper accessor methods). This dataframe could be scattered without copies to DataLoader workers. Native CUDA-accelerated basic CSV-parsing could also be nice (especially if combined with mmap-based file reading?). I can see that this is implemented by torcharrow, maybe time to move some of its core structures to core?
Discussion of conversion of nested structures to columns:
- https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper
- https://research.google/pubs/pub36632/
Maybe some simple nested schemas can be supported first:
- array of dicts of primitive types
- array of dicts with nested arrays of dicts of primitive types
These might be enough to represent data annotation schemas of common datasets (?)