-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
DOC: Design drafts to assist with next-gen pandas internals discussion #13944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
4bfbd3f
26bb013
ec953d2
2684160
136ade9
eda2cff
c742d5d
c7819cf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,7 +17,7 @@ Logical types and Physical Storage Decoupling | |
|
||
Since this is the most important, but perhaps also most controversial, change | ||
(in my opinion) to pandas, I'm going to go over it in great detail. I think the | ||
hardest part of coming up with clear language and definitions for concepts so | ||
hardest part is coming up with clear language and definitions for concepts so | ||
that we can communicate effectively. For example the term "data type" is vague | ||
and may mean different things to different people. | ||
|
||
|
@@ -124,9 +124,9 @@ types. For example: you could have a categorical type (a logical construct | |
consisting of multiple arrays of data) whose categories are some other logical | ||
type. | ||
|
||
For historical reasons, **pandas never developed a clear semantic separation in | ||
its user API between logical and physical data types**. Also, the addition of | ||
new, pandas-only "synthetic" dtypes that are unknown to NumPy (like | ||
For historical reasons, **pandas never developed a clear or clean semantic | ||
separation in its user API between logical and physical data types**. Also, the | ||
addition of new, pandas-only "synthetic" dtypes that are unknown to NumPy (like | ||
categorical, datetimetz, etc.) has expanded this conflation considerably. If | ||
you also consider pandas's custom missing / NULL data behavior, the addition of | ||
ad hoc missing data semantics to a physical NumPy data type created, by the | ||
|
@@ -168,7 +168,7 @@ The major goals of introducing a logical type abstraction are the follows: | |
right code branches based on the data type. | ||
* Enabling pandas to decouple both its internal semantics and physical storage | ||
from NumPy's metadata and APIs. Note that this is already happening with | ||
categorical types, since a particular instance ``CategoricalDtype`` may | ||
categorical types, since a particular instance of ``CategoricalDtype`` may | ||
physically be stored in one of 4 NumPy data types. | ||
|
||
Physical storage decoupling | ||
|
@@ -189,19 +189,70 @@ By separating pandas data from the presumption of using a particular physical | |
data by forming a composite data structure consisting of a NumPy array plus a | ||
bitmap marking the null / not-null values. | ||
|
||
* We can start to think about improved behavior around data ownership (like | ||
copy-on-write) which may yield many benefits. I will write a dedicated | ||
section about this. | ||
|
||
Note that neither of these points implies that we are trying to use NumPy | ||
less. We already have large amounts of code that implement algorithms also | ||
found in NumPy (see ``pandas.unique`` or the implementation of ``Series.sum``), | ||
but taking into account pandas's missing data representation, etc. Internally, | ||
we can use NumPy when its computational semantics match those we've chosen for | ||
pandas, and elsewhere we can invoke pandas-specific code. | ||
less. We already have large amounts of code that implement algorithms similar | ||
to those found in NumPy (e.g. ``pandas.unique`` or the implementation of | ||
``Series.sum``), but taking into account pandas's missing data representation, | ||
etc. Internally, we can use NumPy when its computational semantics match those | ||
we've chosen for pandas, and elsewhere we can invoke pandas-specific code. | ||
|
||
A major concern here based on these ideas is **preserving NumPy | ||
interoperability**, so I'll examine this topic in some detail next. | ||
|
||
Preserving NumPy interoperability | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Some of types of intended interoperability between NumPy and pandas are as | ||
follows: | ||
|
||
* Users can obtain the a ``numpy.ndarray`` (possibly a view depending on the | ||
internal block structure, more on this soon) in constant time and without | ||
copying the actual data. This has a couple other implications | ||
|
||
* Changes made to this array will be reflected in the source pandas object. | ||
* If you write C extension code (possibly in Cython) and respect pandas's | ||
missing data details, you can invoke certain kinds of fast custom code on | ||
pandas data (but it's somewhat inflexible -- see the latest discussion on | ||
adding a native code API to pandas). | ||
|
||
* NumPy ufuncs (like ``np.sqrt`` or ``np.log``) can be invoked on | ||
pandas objects like Series and DataFrame | ||
|
||
* ``numpy.asarray`` will always yield some array, even if it discards metadata | ||
or has to create a new array. For example ``asarray`` invoked on | ||
``pandas.Categorical`` yields a reconstructed array (rather than either the | ||
categories or codes internal arrays) | ||
|
||
* Many NumPy methods designed to work on subclasses (or duck-typed classes) of | ||
``ndarray`` may be used. For example ``numpy.sum`` may be used on a Series | ||
even though it does not invoke NumPy's internal C sum algorithm. This means | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am a little wary about this. I've worked on this compatibility issue from both sides (
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I deliberately chose pretty hedgy language here. "Many methods ... may be used" instead of "All methods ... can be used". Maybe we should say that we're going to back away from this here? Maintaining an explicit API contract that you can pass a Series or DataFrame wherever you might otherwise pass an ndarray seems like a bad idea to me, and is likely to be a continuous source of bugs and maintainability problems. I know you've worked a lot on this and very honestly I cringed when I saw the patches come through — I would rather not reinforce this interchangeability in users' minds. I agree that accommodating NumPy functions where it does no harm to pandas is okay, but I don't think it's our responsibility to do so. But honestly: the average pandas user is, in my anecdotal experience, more familiar with pandas's features than NumPy's features, so the audience of people for whom "semantic parity" with NumPy is important is probably growing smaller over time. I would rather focus on creating a consistent and self-contained user experience in pandas. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Toning down the promise of inter-op functionality with |
||
that a Series may be used as an interchangeable argument in a large set of | ||
functions that only know about NumPy arrays. | ||
|
||
By and large, I think much of this can be preserved, but there will be some API | ||
breakage. | ||
|
||
If we add more composite data structures (Categorical can be thought of as | ||
one existing composite data structure) to pandas or alternate non-NumPy data | ||
structures, there will be cases where the semantic information in a Series | ||
cannot be adequately represented in a NumPy array. | ||
|
||
As one example, if we add pandas-only missing data support to integer and | ||
boolean data (a long requested feature), calling ``np.asarray`` on such data | ||
may not have well-defined behavior. As present, pandas is implicitly converting | ||
these types to ``float64`` (see more below), which isn't too great. A decision | ||
does not need to be made now, but the benefits of solving this long-standing | ||
issue may merit breaking ``asarray`` as long as we provide an explicit way to | ||
obtain the original casted ``float64`` NumPy array (with ``NaN`` for NULL/NA | ||
values) | ||
|
||
For pandas data that does not step outside NumPy's semantic realm, we can | ||
continue to provide zero-copy views in many cases. | ||
|
||
Removal of BlockManager / new DataFrame internals | ||
================================================= | ||
|
||
|
@@ -360,7 +411,7 @@ computations. Let's take for example: | |
Profiling ``s.sum()`` with ``%prun`` in IPython, I am seeing 116 function | ||
calls (pandas 0.18.1). Let's look at the microperformance: | ||
|
||
.. code-block:: python | ||
.. code-block:: text | ||
|
||
In [14]: timeit s.sum() | ||
10000 loops, best of 3: 31.7 µs per loop | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much agree with this and think it is a crucial point that could be emphasised more in the docs. I suspect that this is a tripping point for many users regarding the application of
numpy
functionality onpandas
objects.