-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: Allow genfromtxt to unpack structured arrays #16650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a couple minor things. Thanks @AndrewEckart !
numpy/lib/npyio.py
Outdated
if unpack: | ||
return output.squeeze().T | ||
return output.squeeze() | ||
if names is not None and len(names) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the len(names) > 1
condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to match squeeze
, then len(names) != 1
would work better for when names == []
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change it, but I think names == []
necessarily implies output
is empty (i.e. the input was an empty file).
Consider the following two calls:
np.genfromtxt(StringIO(''), dtype=None, unpack=True)
np.genfromtxt(StringIO(''), dtype={'names': (), 'formats': ()}, unpack=True)
With len(names > 1)
, both will return an empty ndarray
, but with len(names) != 1)
, both will return an empty list.
numpy/lib/tests/test_io.py
Outdated
def test_unpack_auto_dtype(self): | ||
# Regression test for gh-4341 | ||
# Unpacking should work when dtype=None | ||
txt = TextIO("21 72\n35 58") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM! would be also useful to add a test with a mix of integer and string to make sure that the field array dtypes are being discerned correctly.
numpy/lib/tests/test_io.py
Outdated
# even if dtype is structured | ||
txt = TextIO("1") | ||
dt = {'names': ('a',), 'formats': ('<i4',)} | ||
x = np.genfromtxt(txt, dtype=dt, unpack=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I get with this example (using StringIO
instead of the internal TextIO
):
In [37]: dt = np.dtype({'names': ('a',), 'formats': ('<i4',)})
In [38]: x = np.genfromtxt(StringIO('1'), dtype=dt, unpack=True)
In [39]: x
Out[39]: array((1,), dtype=[('a', '<i4')])
I don't think the dtype
of the return value should still be the structured dtype in this case--unpack=True
is supposed to unpack the fields. So I expect this output to be array(1, dtype=int32)
. In other words, the structure should still be pulled apart even when len(names) == 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I modified the logic to return the single unpacked ndarray when len(names) == 1
and a list of ndarrays when len(names > 1)
, which I think is a little bit cleaner. Technically, len(names) == 0)
is still possible with an empty input file and a structured dtype, but this already gives a UserWarning.
Previously, `genfromtxt` failed to transpose its output when `unpack=True` and `dtype` was structured (or `dtype=None` and the inferred `dtype` was structured). This patch resolves the issue by returning a list of arrays, as in `loadtxt`. Includes tests and updates the docstring to match `loadtxt`. Fixes numpy#4341
Co-authored-by: Eric Wieser <wieser.eric@gmail.com>
I think this is probably fine, although did not look too careful. Maybe @WarrenWeckesser can push this through? It seems that this does change behaviour fairly strongly? Probably the old behaviour can be defined as a clear bug here? @AndrewEckart could you comment briefly on that for me, and add a release note unless it was a loud error before? The release notes go into |
I'm not sure how to characterize it. In #4341, it's described as a bug, and the old behavior is contrasted with |
Definitely needs a release note IMO. Someones code will probably break somewhere. We have to make a call that it is very few people... |
Would you consider this |
Lets put it under compat, since the actual improvement is pretty straight forward (and arguably even a bug fix). |
one for each column:: | ||
|
||
>>> np.genfromtxt(StringIO("1 2.0 \n 3 4.0"), unpack=True) | ||
[array([1, 3]), array([2., 4.])]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this example requires a structured dtype=
being passed in? It may be good to point/show an example which triggers the changed behaviour without passing in dtype=
(if that is possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i.e. I tried the example on this branch, and the behaviour is not as written here – also there is an additional closing bracket)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that, it looks like the numbers in my example were too small so the inferred dtype
was '<u4'. I replaced it with the test case for dtype=None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I have to take a closer look at the tests, or am I running the wrong code? But it seems to me that this gives:
In [2]: from io import StringIO
In [3]: np.genfromtxt(StringIO("21 58.0 \n 35 72.0"), unpack=True)
Out[3]:
array([[21., 35.],
[58., 72.]])
OTOH, this one changed behaviour as expected:
In [5]: np.genfromtxt(StringIO("21 58.0 \n 35 72.0"), unpack=True, dtype="i,d")
so, right now I think the example is still incorrect, and if there are tests for this, they should probably be refined to check that the result is (correctly, I assume) a single array in the first case, since it should be a single array when there is no structured dtype detected (which seems to be the default?).
Unless this should change and I am missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, you were running the right code; it turns out that the example was incorrect. I forgot that detection of a structured dtype relies on passing dtype=None
to override the default dtype=float
(otherwise, a single array is the correct return value). My example in the release note failed to specify the dtype
argument. Should be correct now.
Unfortunately CircleCI does not merge-from-master before building. Could you either merge from master or rebase to clear the non-related failure? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the liberty of expanding the release note a bit. There are a couple things that need addressing in the updated tests
Changes introduced after initial approval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for sticking with this @AndrewEckart !
Hi, this change means that genfromtxt now ignores The example I am thinking of is, say we have a simple csv or space delimited document with 4 columns: val1 val2 valignore1 valignore2 I read this in using Now we get [array(val1, val3, val5), array(val1, val3, val5)], previously it would be an array of arrays with names (used to be called recarray I think but maybe that name has changed). |
Previously,
genfromtxt
failed to transpose its output whenunpack=True
anddtype
was structured (ordtype=None
and theinferred
dtype
was structured). This patch resolves the issue byreturning a list of arrays, as in
loadtxt
. Includes tests and updatesthe docstring to match
loadtxt
.Fixes #4341