8000 Add asciidtype by ngoldbaum · Pull Request #10 · numpy/numpy-user-dtypes · GitHub
[go: up one dir, main page]

Skip to content

Add asciidtype #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Dec 7, 2022
Merged

Add asciidtype #10

merged 9 commits into from
Dec 7, 2022

Conversation

ngoldbaum
Copy link
Member

Adds a fixed-width ASCII dtype. This is still pretty bare-bones but I figured it was worth getting in the repo in case anyone wants to look at a string dtype. I'll be continuing to work on this.

Going to leave this open until tomorrow for reviews and then self-merge then if I don't get any reviews, assuming tests pass.

Copy link
Member
@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a brief look through if I saw anything. I wonder if it would make sense to make the scalar just a subclass of str (i.e. on the path to having no scalar).
The main problem with that is NumPy's insistence to convert 0-D arrays to scalars (so long we enforce getitem not having a scalar is probably even fine).

int contig = (strides[0] == ((ASCIIDTypeObject *)descrs[0])->size *
sizeof(char) &&
strides[1] == ((ASCIIDTypeObject *)descrs[1])->size *
sizeof(char));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that your contig loop also assumes descrs[0]->size == descrs[1]->size which is not checked below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was planning to relax that requirement on a second pass, but added a hard error for now.

*out = *in;
out++;
in++;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could even use plain memcpy if you like.

*out_loop = (PyArrayMethod_StridedLoop *)&ascii_to_ascii_strided;
}
else {
*out_loop = (PyArrayMethod_StridedLoop *)&ascii_to_ascii_unaligned;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aligment of char will be 1, so while I think I currently sometimes pass "unaligned" to not bother figuring out alignment, you could just use the same loop for both. I suppose you can also argue that char isn't standardized to 1 byte :).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still not clear to me how I'd actually reach the unaligned loop at all. What sort of python-level operation would trigger it? Not necessarily for ASCIIDType if that makes the question easier to answer.

Copy link
Member
@seberg seberg Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its hard, but there are four paths right now how you can reach unaligned casts:

  • I think I had some scalar paths where I didn't bother checking alignment and just calling the unaligned version). (I have to double check this, but I think it happens for casts between dtypes when assigning a single value.)
  • You can manage it by manually ensuring you have unaligned data ;). But that is tricky, e.g. with np.zeros(1001, dtype="uint8")[1:].view("U10"). (EDIT: sorry forgot the uint8)
  • Structured dtypes by default use packed structs rather than aligned ones. So if you do dtype=[("a", "?"), ("b", "U5")] than the "b" field is unaligned. (EDIT: this is the default, you can say align=True).
  • Structured dtypes are a mess and they (currently) do not even try to pass on correct alignment to the field dtypes when necessary (i.e. in casts). That is something that can be done, but I really would like to pass through the actual alignment, rather than just "its all aligned". (MAINT: Figure out alignment for complex loops numpy#17359)

The last point is the main one really, the second one is rather theoretical... Which is why my suggestion to allow flagging "do not allow unaligned".

Copy link
Member Author
@ngoldbaum ngoldbaum Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Having such a flag, with the default to not allow unaligned, would certainly reduce the conceptual overhead quite a bit.

8000
common_instance(ASCIIDTypeObject *dtype1, ASCIIDTypeObject *dtype2)
{
if (!PyObject_RichCompareBool((PyObject *)dtype1, (PyObject *)dtype2,
Py_EQ)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B.: This is actually the same as checking whether your resolver returns NO_CASTING or EQUIV_CASTING.

dtype = ASCIIDType(7)
arr = np.array(["hello", "this", "is", "an", "array"], dtype=dtype)
assert repr(arr) == (
"array(['hello', 'this', 'is', 'an', 'array'], dtype=ASCIIDType(7))")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have length detection like the built-in unicode/byte types:

arr = np.array(["hello", "this", "is", "an", "array"], dtype=ASCIIDType)
assert arr.dtype == ASCIIDType(5)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to say yes, but currently the dtype=ASCIIDType isn't passed through correctly, it probably works for np.core.multiarray._multiarray_umath._discover_array_parameters(), though.

Passing through the DType type/class rather than just the dtype is something we/I need to look into... (A similar alternative might be to introduce abstract instances, that cannot be attached to an array, but represent the same thing as ASCIIDType).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened numpy/numpy#22756 to track this, seems like it would be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0