8000 DOC: Design drafts to assist with next-gen pandas internals discussion by wesm · Pull Request #13944 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

DOC: Design drafts to assist with next-gen pandas internals discussion #13944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Section on numpy interoperability
  • Loading branch information
wesm committed Aug 9, 2016
commit c742d5d17487be59731bc5ea175ca3eda721da3b
2 changes: 1 addition & 1 deletion doc/pandas-2.0/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,7 @@
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'pandas20DesignDocs.tex', 'pandas 2.0 Design Docs Documentation',
'pandas Core Team', 'manual'),
'Wes McKinney', 'manual'),
]

# The name of an image file (relative to this directory) to place at the top of
Expand Down
73 changes: 62 additions & 11 deletions doc/pandas-2.0/source/internal-architecture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Logical types and Physical Storage Decoupling

Since this is the most important, but perhaps also most controversial, change
(in my opinion) to pandas, I'm going to go over it in great detail. I think the
hardest part of coming up with clear language and definitions for concepts so
hardest part is coming up with clear language and definitions for concepts so
that we can communicate effectively. For example the term "data type" is vague
and may mean different things to different people.

Expand Down Expand Up @@ -124,9 +124,9 @@ types. For example: you could have a categorical type (a logical construct
consisting of multiple arrays of data) whose categories are some other logical
type.

For historical reasons, **pandas never developed a clear semantic separation in
its user API between logical and physical data types**. Also, the addition of
new, pandas-only "synthetic" dtypes that are unknown to NumPy (like
For historical reasons, **pandas never developed a clear or clean semantic
separation in its user API between logical and physical data types**. Also, the
addition of new, pandas-only "synthetic" dtypes that are unknown to NumPy (like
categorical, datetimetz, etc.) has expanded this conflation considerably. If
you also consider pandas's custom missing / NULL data behavior, the addition of
ad hoc missing data semantics to a physical NumPy data type created, by the
Expand Down Expand Up @@ -168,7 +168,7 @@ The major goals of introducing a logical type abstraction are the follows:
right code branches based on the data type.
* Enabling pandas to decouple both its internal semantics and physical storage
from NumPy's metadata and APIs. Note that this is already happening with
categorical types, since a particular instance ``CategoricalDtype`` may
categorical types, since a particular instance of ``CategoricalDtype`` may
physically be stored in one of 4 NumPy data types.

Physical storage decoupling
Expand All @@ -189,19 +189,70 @@ By separating pandas data from the presumption of using a particular physical
data by forming a composite data structure consisting of a NumPy array plus a
bitmap marking the null / not-null values.

* We can start to think about improved behavior around data ownership (like
copy-on-write) which may yield many benefits. I will write a dedicated
section about this.

Note that neither of these points implies that we are trying to use NumPy
less. We already have large amounts of code that implement algorithms also
found in NumPy (see ``pandas.unique`` or the implementation of ``Series.sum``),
but taking into account pandas's missing data representation, etc. Internally,
we can use NumPy when its computational semantics match those we've chosen for
pandas, and elsewhere we can invoke pandas-specific code.
less. We already have large amounts of code that implement algorithms similar
to those found in NumPy (e.g. ``pandas.unique`` or the implementation of
``Series.sum``), but taking into account pandas's missing data representation,
etc. Internally, we can use NumPy when its computational semantics match those
we've chosen for pandas, and elsewhere we can invoke pandas-specific code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much agree with this and think it is a crucial point that could be emphasised more in the docs. I suspect that this is a tripping point for many users regarding the application of numpy functionality on pandas objects.

A major concern here based on these ideas is **preserving NumPy
interoperability**, so I'll examine this topic in some detail next.

Preserving NumPy interoperability
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some of types of intended interoperability between NumPy and pandas are as
follows:

* Users can obtain the a ``numpy.ndarray`` (possibly a view depending on the
internal block structure, more on this soon) in constant time and without
copying the actual data. This has a couple other implications

* Changes made to this array will be reflected in the source pandas object.
* If you write C extension code (possibly in Cython) and respect pandas's
missing data details, you can invoke certain kinds of fast custom code on
pandas data (but it's somewhat inflexible -- see the latest discussion on
adding a native code API to pandas).

* NumPy ufuncs (like ``np.sqrt`` or ``np.log``) can be invoked on
pandas objects like Series and DataFrame

* ``numpy.asarray`` will always yield some array, even if it discards metadata
or has to create a new array. For example ``asarray`` invoked on
``pandas.Categorical`` yields a reconstructed array (rather than either the
categories or codes internal arrays)

* Many NumPy methods designed to work on subclasses (or duck-typed classes) of
``ndarray`` may be used. For example ``numpy.sum`` may be used on a Series
even though it does not invoke NumPy's internal C sum algorithm. This means
Copy link
Member
@gfyoung gfyoung Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little wary about this. I've worked on this compatibility issue from both sides (numpy and pandas), and in pandas, my feeling is that we've had to bend over backwards a bit (this whole directory alone here speaks for itself) for such compatibility. Do we really want to commit ourselves to this in a new-and-improved pandas?

Copy link
Member Author
@wesm wesm Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deliberately chose pretty hedgy language here. "Many methods ... may be used" instead of "All methods ... can be used". Maybe we should say that we're going to back away from this here?

Maintaining an explicit API contract that you can pass a Series or DataFrame wherever you might otherwise pass an ndarray seems like a bad idea to me, and is likely to be a continuous source of bugs and maintainability problems. I know you've worked a lot on this and very honestly I cringed when I saw the patches come through — I would rather not reinforce this interchangeability in users' minds. I agree that accommodating NumPy functions where it does no harm to pandas is okay, but I don't think it's our responsibility to do so.

But honestly: the average pandas user is, in my anecdotal experience, more familiar with pandas's features than NumPy's features, so the audience of people for whom "semantic parity" with NumPy is important is probably growing smaller over time. I would rather focus on creating a consistent and self-contained user experience in pandas.

Copy link
Member
@gfyoung gfyoung Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Toning down the promise of inter-op functionality with numpy would be a good idea for now, and yes, I agree: consistent with decoupling from numpy is to create a more self-contained user experience in pandas.

that a Series may be used as an interchangeable argument in a large set of
functions that only know about NumPy arrays.

By and large, I think much of this can be preserved, but there will be some API
breakage.

If we add more composite data structures (Categorical can be thought of as
one existing composite data structure) to pandas or alternate non-NumPy data
structures, there will be cases where the semantic information in a Series
cannot be adequately represented in a NumPy array.

As one example, if we add pandas-only missing data support to integer and
boolean data (a long requested feature), calling ``np.asarray`` on such data
may not have well-defined behavior. As present, pandas is implicitly converting
these types to ``float64`` (see more below), which isn't too great. A decision
does not need to be made now, but the benefits of solving this long-standing
issue may merit breaking ``asarray`` as long as we provide an explicit way to
obtain the original casted ``float64`` NumPy array (with ``NaN`` for NULL/NA
values)

For pandas data that does not step outside NumPy's semantic realm, we can
continue to provide zero-copy views in many cases.

Removal of BlockManager / new DataFrame internals
=================================================

Expand Down Expand Up @@ -360,7 +411,7 @@ computations. Let's take for example:
Profiling ``s.sum()`` with ``%prun`` in IPython, I am seeing 116 function
calls (pandas 0.18.1). Let's look at the microperformance:

.. code-block:: python
.. code-block:: text

In [14]: timeit s.sum()
10000 loops, best of 3: 31.7 µs per loop
Expand Down
0