8000 ENH: change default empty array dtype, currently 'float64' · Issue #10405 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: change default empty array dtype, currently 'float64' #10405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nschloe opened this issue Jan 14, 2018 · 11 comments
Open

ENH: change default empty array dtype, currently 'float64' #10405

nschloe opened this issue Jan 14, 2018 · 11 comments

Comments

@nschloe
Copy link
Contributor
nschloe commented Jan 14, 2018

(From #10135.)

In Numpy, empty arrays have the default data type numpy.float64.

print(numpy.array([]).dtype)  # float64

I'm questioning whether this is the right choice.

In principle, the default dtype choice for empty arrays is arbitrary since there is no data to work with. One good reason to choose a small data type is that operations with empty arrays should not needlessly augment the dtype. The current setting leads to rather surprising dtype changes like

print(numpy.concatenate([[0], []]))  # [0.0], not [0]

This would not have happened with, e.g., numpy.array([]).dtype == numpy.int8.

@pv
Copy link
Member
pv commented Jan 14, 2018

The default data type is float64 also for zeros, empty, et al.

It's also useful to keep in mind that Numpy API is no longer as open for changes as it was 10 years ago when it was established, and small cosmetic changes (which usually are subjective) are more likely to be harmful than useful, in that they can break many existing codebases.

@seberg
Copy link
Member
seberg commented Jan 14, 2018

I do not even believe there is a reasonable "minimal dtype". Even if a change would be reasonably possible, I doubt it makes sense. Yes, this is sometimes annoying, but no changing the default type is not the solution, and generally I personally think the solution is: Too bad, sometimes the user may have to explicitly put in the dtype by doing the array cast by hand.

@nschloe
Copy link
Contributor Author
nschloe commented Jan 15, 2018

Thank you very much for the comments. I agree with some and disagree with some other comments.

Yes, this is sometimes annoying, but no changing the default type is not the solution,

and small cosmetic changes (which usually are subjective)

I presented one example where changing the dtype does fix a problem. The change is certainly not cosmetic.

It's also useful to keep in mind that Numpy API is no longer as open for changes as it was 10 years ago when it was established, and small cosmetic changes (which usually are subjective) are more likely to be harmful than useful, in that they can break many existing codebases.

Indeed, changing something can always lead to problems downstream, and that is particularly risky if there is no estimation whatsoever how much could break. I would suggest that I make the change locally and compile a handful of large numerical libraries against it (e.g., scipy, sympy, pandas, scikit-learn, and astropy). If there are too many downstream changes required, I'd say let's forget about it. If there is no change, perhaps we could continue discussing.

@seberg
Copy link
Member
seberg commented Jan 15, 2018

Maybe it is not just cosmetic, but I disagree with the solution. It has a bad smell for me, basically it seems like replacing one problem with another one, and yes that might fix many problems but it also will create weird new ones (of course those might be better, but they might also be more tricky?!).
There are cases (such as concatenate), where you can assume that what is meant is: All arguments probably are supposed to have the same dtype, or cases like: This is an indexing array, so probably we can assume an integer is wanted. However, I believe those cases are limited to a specific scenario, and you are proposing a scenario agnostic solution, which I do not believe in.

@nschloe
Copy link
Contributor Author
nschloe commented Jan 16, 2018

Thanks again for the comment.

Let me first say that I'm not emotionally attached to the fix, it's just an inconsistency I noticed which I though could be fixed easily enough. I can also understand that a package like numpy takes a rather conservative approach in software development. When rejecting, it certainly makes things clearer if a small counterexample is given, something along the lines of

The suggestion may fix one thing, but unfortunately it changes the behavior of

something(else)

so it's probably not worth fixing it this way.

That said, feel free to close.

@seberg
Copy link
Member
seberg commented Jan 17, 2018

Well, I do not have a very concrete example, but if you sum up an empty array or use its dtype for anything specifically, you will then get an integer dtype (possibly a small integer).
This whole thing is just guessing, int8 will in many cases be a more useful guess, but it is still a guess. And it is a dangerous type if you actually manage to use it somehow (read: most users do not want it or even really understand it well).
So yeah, maybe it is even a better guess on average, but I think it can also be an even more confusing guess sometimes.

@godaygo
Copy link
Contributor
godaygo commented Jan 19, 2018

While I agree with initial idea, that float type is not the best default choice. Nevertheless, in the case of an integer type, especially int8, for many users, the following result will be very strange, although correct in nature:

# default is np.int8
>>> np.concatenate([[100], []]) + np.concatenate([[100], []])
np.array([-56], dtype=int8)

But, saying this, I still believe that if the default type was chosen as the numpy's default integer type, this would be more reasonable for me.

@mattip mattip changed the title empty array has needlessly large data type ENH: change default empty array dtype, currently 'float64' Jun 21, 2018
@mattip mattip added 01 - Enhancement 54 - Needs decision 57 - Close? Issues which may be closable unless discussion continued labels Jun 21, 2018
@hmaarrfk
Copy link
Contributor
hmaarrfk commented Oct 6, 2018

I think the toy example could be made more explicit to solve this bug unexpected behaviour.

index_array_a = np.asarray([0], dtype=int)
index_array_b = np.asarray([], dtype=int)
numpy.concatenate([index_array_a, index_array_b])

Seems to solve that particular edge case at least :S

if list concatenation is all that is needed,
then [0] + [] should be used.

I would be wary of stating that int8 is faster than float. I've tried to do some simple benchmarks and sometimes didn't find much difference.

@hmaarrfk
Copy link
Contributor
hmaarrfk commented Oct 6, 2018

very similar discussion happening here: #1586

@mattip mattip added 23 - Wish List and removed 54 - Needs decision 57 - Close? Issues which may be closable unless discussion continued labels Apr 26, 2019
@rusu24edward
Copy link

I think the toy example could be made more explicit to solve this bug unexpected behaviour.

index_array_a = np.asarray([0], dtype=int)
index_array_b = np.asarray([], dtype=int)
numpy.concatenate([index_array_a, index_array_b])

Would it be worthwhile to have dtype as an additional argument in concatenate?

@seberg
Copy link
Member
seberg commented Sep 3, 2020

Pretty much just requires a merge: gh-16134 (although in the meantime a merge conflict happened).

EDIT: Although, since that still requires casting, I am not sure it would help you much, it would still be a floating point array, so it is an unsafe cast to integer. That PR also adds casting="unsafe", but the question is wheth 7A1F er that is much nicer in that use-case. (Its still an important thing, because it was discussed as a prerequisite for other things)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants
0