8000 DOC: Design drafts to assist with next-gen pandas internals discussion by wesm · Pull Request #13944 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

DOC: Design drafts to assist with next-gen pandas internals discussion #13944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Some exposition on missing data
  • Loading branch information
wesm committed Aug 8, 2016
commit ec953d2975d13df57a0c9b2680a8e2e912c2b7ab
12 changes: 7 additions & 5 deletions doc/pandas-2.0/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,14 @@
# -- General configuration ------------------------------------------------

# If your documentation needs a minimal Sphinx version, state it here.
#needs_sphinx = '1.0'
# needs_sphinx = '1.0'

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = []

extensions = ['IPython.sphinxext.ipython_directive',
'IPython.sphinxext.ipython_console_highlighting']

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
Expand All @@ -46,9 +48,9 @@
master_doc = 'index'

# General information about the project.
project = 'pandas 2.0 Design Docs'
copyright = '2016, pandas Core Team'
author = 'pandas Core Team'
project = "Wes's pandas 2.0 Design Docs"
copyright = '2016, Wes McKinney'
author = 'Wes McKinney'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
Expand Down
3 changes: 2 additions & 1 deletion doc/pandas-2.0/source/goals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@
.. note::

These documents are largely written by Wes McKinney, and at this point
reflect his opinions for the time being
reflect his opinions for the time being. Many things may change as we discuss
and work to reach a consensus about the path forward.

The pandas codebase is now over 8 years old, having grown to over 200,000 lines
of code from its original ~10,000 LOC in the original 0.1 open source release
Expand Down
4 changes: 2 additions & 2 deletions doc/pandas-2.0/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
pandas 2.0 Design Documents
===========================
Wes's pandas 2.0 Design Documents
=================================

These are a set of documents, based on discussions started in December 2015, to
assist with discussions around changes to Python pandas's internal design
Expand Down
139 changes: 139 additions & 0 deletions doc/pandas-2.0/source/internal-architecture.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
.. _internal-architecture:

.. ipython:: python
:suppress:

import numpy as np
import pandas as pd
np.set_printoptions(precision=4, suppress=True)
pd.options.display.max_rows = 100

===============================
Internal Architecture Changes
===============================

Logical types and Physical Storage Decoupling
=============================================

Removal of BlockManager / new DataFrame internals
=================================================

``pandas.Array`` and ``pandas.Table``
=====================================

Missing data consistency
========================

Once the physical memory representation has been effectively decoupled from the
user API, we can consider various approaches to implementing missing data in a
consistent way for every logical pandas data type.

To motivate this, let's look at some integer data:

.. ipython:: python

s = pd.Series([1, 2, 3, 4, 5])
s
8000 s.dtype
s.values

If we assign a ``numpy.NaN``, see what happens:

.. ipython:: python

s[2] = np.NaN
s
s.dtype
s.values

The story for boolean data is similar:

.. ipython:: python

s = pd.Series([True, False, True])
s.dtype
s[2] = np.NaN
s.dtype
s.values

This implicit behavior appears in many scenarios, such as:

* Loading data from any source: databases, CSV files, R data files, etc.
* Joins or reindexing operations introducing missing data
* Pivot / reshape operations
* Time series resampling
* Certain types of GroupBy operations

A proposed solution
~~~~~~~~~~~~~~~~~~~

My proposal for introducing missing data into any NumPy type outside of
floating point (which uses ``NaN`` for now) and Python object (which uses
``None`` or ``NaN`` interchangeably) is to **allocate and manage an internal
bitmap** (which the user never sees). This has numerous benefits:

* 1 byte of memory overhead for each 8 values
* Bitmaps can propagate their nulls in C through bitwise ``&`` or ``|``
operations, which are inexpensive.
* Getting and setting bits on modern hardware is very CPU-inexpensive. For
single-pass array operations (like groupbys) on very large arrays this may
also result in better CPU cache utilization (fewer main-memory reads of the
bitmap).
* Hardware and SIMD "popcount" intrinsics (which can operate on 64-128 bits at
a time) can be used to count bits and skip null-handling on segments of data
containing no nulls.

Notably, this is the way that PostgreSQL handles null values. For example, we
might have:

.. code-block::
Copy link
Contributor
@chrisaycock chrisaycock 8000 Aug 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block is not showing-up in the generated HTML document. Is a language required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, apparently code-block does not default even to plain text if you don't indicate the language. I'll fix soon on my pass through to incorporate all the feedback here


[0, 1, 2, NA, NA, 5, 6, NA]

i: 7 6 5 4 3 2 1 0
bitmap: 0 1 1 0 0 1 1 1

Here, the convention of 1 for "not null" (a la PostgreSQL) and
least-significant bit ordering (LSB "bit endianness") is being used.

Under the new regime, users could simply write:

.. code-block:: python

s[2] = pandas.NA

and the data type would be unmodified. It may be necessary to write something
akin to:

.. code-block:: python

s.to_numpy(dtype=np.float64, na_rep=np.nan)

and that would emulate the current behavior. Attempts to use ``__array__` (for
example: calling ``np.sqrt`` on the data) would result in an error since we
will likely want to refuse to make a guess as for what casting behavior the
user desires.

Tradeoffs
~~~~~~~~~

One potential downside of the bitmap approach is that missing data implemented
outside of NumPy's domain will need to be explicitly converted if it is needed
in another library that only knows about NumPy. I argue that this is better
than the current

Proper types for strings and some non-numeric data
==================================================

I believe that frequently-occurring data types, such as UTF8 strings, are
important enough to deserve a dedicated logical pandas data type. This will
enable us both to enforce tighter API semantics (i.e. attempts to assign a
non-string into string data will be a ``TypeError``) and improved performance
and memory use under the hood. I will devote an entire section to talking about
strings.

C++11/14 for lowest implementation tier
=======================================

3rd-party native API (i.e. Cython and C / C++)
==============================================
0