-
-
Notifications
You must be signed in to change notification settings - Fork 8
Add asciidtype #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add asciidtype #10
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a brief look through if I saw anything. I wonder if it would make sense to make the scalar just a subclass of str
(i.e. on the path to having no scalar).
The main problem with that is NumPy's insistence to convert 0-D arrays to scalars (so long we enforce getitem not having a scalar is probably even fine).
int contig = (strides[0] == ((ASCIIDTypeObject *)descrs[0])->size * | ||
sizeof(char) && | ||
strides[1] == ((ASCIIDTypeObject *)descrs[1])->size * | ||
sizeof(char)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that your contig loop also assumes descrs[0]->size == descrs[1]->size
which is not checked below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was planning to relax that requirement on a second pass, but added a hard error for now.
*out = *in; | ||
out++; | ||
in++; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could even use plain memcpy
if you like.
asciidtype/asciidtype/src/casts.c
Outdated
*out_loop = (PyArrayMethod_StridedLoop *)&ascii_to_ascii_strided; | ||
} | ||
else { | ||
*out_loop = (PyArrayMethod_StridedLoop *)&ascii_to_ascii_unaligned; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aligment of char
will be 1, so while I think I currently sometimes pass "unaligned" to not bother figuring out alignment, you could just use the same loop for both. I suppose you can also argue that char isn't standardized to 1 byte :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still not clear to me how I'd actually reach the unaligned loop at all. What sort of python-level operation would trigger it? Not necessarily for ASCIIDType if that makes the question easier to answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its hard, but there are four paths right now how you can reach unaligned casts:
- I think I had some scalar paths where I didn't bother checking alignment and just calling the unaligned version). (I have to double check this, but I think it happens for casts between dtypes when assigning a single value.)
- You can manage it by manually ensuring you have unaligned data ;). But that is tricky, e.g. with
np.zeros(1001, dtype="uint8")[1:].view("U10")
. (EDIT: sorry forgot the uint8) - Structured dtypes by default use packed structs rather than aligned ones. So if you do
dtype=[("a", "?"), ("b", "U5")]
than the"b"
field is unaligned. (EDIT: this is the default, you can sayalign=True
). - Structured dtypes are a mess and they (currently) do not even try to pass on correct alignment to the field dtypes when necessary (i.e. in casts). That is something that can be done, but I really would like to pass through the actual alignment, rather than just "its all aligned". (MAINT: Figure out alignment for complex loops numpy#17359)
The last point is the main one really, the second one is rather theoretical... Which is why my suggestion to allow flagging "do not allow unaligned".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Having such a flag, with the default to not allow unaligned, would certainly reduce the conceptual overhead quite a bit.
common_instance(ASCIIDTypeObject *dtype1, ASCIIDTypeObject *dtype2) | ||
{ | ||
if (!PyObject_RichCompareBool((PyObject *)dtype1, (PyObject *)dtype2, | ||
Py_EQ)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N.B.: This is actually the same as checking whether your resolver returns NO_CASTING
or EQUIV_CASTING
.
dtype = ASCIIDType(7) | ||
arr = np.array(["hello", "this", "is", "an", "array"], dtype=dtype) | ||
assert repr(arr) == ( | ||
"array(['hello', 'this', 'is', 'an', 'array'], dtype=ASCIIDType(7))") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have length detection like the built-in unicode/byte types:
arr = np.array(["hello", "this", "is", "an", "array"], dtype=ASCIIDType)
assert arr.dtype == ASCIIDType(5)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to say yes, but currently the dtype=ASCIIDType
isn't passed through correctly, it probably works for np.core.multiarray._multiarray_umath._discover_array_parameters()
, though.
Passing through the DType type/class rather than just the dtype is something we/I need to look into... (A similar alternative might be to introduce abstract instances, that cannot be attached to an array, but represent the same thing as ASCIIDType).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened numpy/numpy#22756 to track this, seems like it would be useful.
Adds a fixed-width ASCII dtype. This is still pretty bare-bones but I figured it was worth getting in the repo in case anyone wants to look at a string dtype. I'll be continuing to work on this.
Going to leave this open until tomorrow for reviews and then self-merge then if I don't get any reviews, assuming tests pass.