8000 DOC: Design drafts to assist with next-gen pandas internals discussion by wesm · Pull Request #13944 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

DOC: Design drafts to assist with next-gen pandas internals discussion #13944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
More faq
  • Loading branch information
wesm committed Aug 8, 2016
commit 26bb013e690dc42ff6faa5bc4881483b86530261
81 changes: 71 additions & 10 deletions doc/pandas-2.0/source/goals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,23 @@ The pandas codebase is now over 8 years old, having grown to over 200,000 lines
of code from its original ~10,000 LOC in the original 0.1 open source release
in January 2010.

At a high level, the "pandas 2.0" effort is based on a couple of observations:

* The pandas 0.x series of releases have been primarily iterative improvements
to the library, with new features, bug fixes, and improved
documentation. There have also been a series of deprecations, API changes,
and other evolutions of pandas's API to account for suboptimal design choices
(for example: the ``.ix`` operator) made in the early days of the project
(2010 to 2012).
At a high level, the "pandas 2.0" effort is based on a number of observations:

* The pandas 0.x series of releases have consisted with huge amounts of
Copy link
Contributor
@TomAugspurger TomAugspurger Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first point muddles the waters a bit (for me); likewise with the point starting on line 79. " Removal of deprecated / underutilized functionality".

Further down you seem to split between

  1. API changes that are the result of pandas fixing 0.X implementation details (e.g. integer-NA), and
  2. API changes that would be because the original idea may be flawed / out of scope (.ix plotting).

Does it make sense to limit this document entirely to the internals refactoring, and only talk about API changes of the first kind?

EDIT: I realize I didn't say why I thought the discussion should be limited to just the first kind. I worry discussions about arbitrary API changes will distract from what is probably the more import issue of the internals refactoring. I imagine there are people on the internet who will raise havoc if you try to take away their DataFrame.plot 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. I mainly fixated on .ix because the .loc and .iloc indexing operators will probably need to be reimplemented as part of this internals overhaul, and having to drag along things like .ix would add implementation burden for unclear benefit. I agree talking about other refactoring / cleaning is a distraction. Will make some amendments to make this more clear.

The other intent of this first point was that the iterative / agile development style of the project (from its early days until now) has made it difficult to consider large/invasive changes to the internals, and after so much time we are due to seriously contemplate what's working well and what's not working well.

Copy link
Member
@jorisvandenbossche jorisvandenbossche Aug 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on making this distinction. We can also start drafting documents that layout other API changes (not related to internal refactoring) and put those in the same directory (so the 'goals and motivations' can touch both aspects), but in separate PRs

EDIT: I see you already said the same below .. :-)

iterative improvements to the library along with some major new features, bug
fixes, and improved documentation. There have also been a series of
deprecations, API changes, and other evolutions of pandas's API to account
for suboptimal design choices (for example: the ``.ix`` operator) made in the
early days of the project (2010 to 2012).
* The unification of Series and DataFrame internals to be based on a common
``NDFrame`` base class and "block manager" data structure (heroically
championed by Jeff Reback), while introducing many benefits to pandas, has
come to be viewed as a long-term source of technical debt and code
complexity.
* pandas's ability to support an increasingly broad set of use cases has been
significantly constrained (as will be examined in detail in these documents)
by its tight coupling to NumPy and therefore subject to design limitations in
NumPy.
by its tight coupling to NumPy and therefore subject to various limitations
in NumPy.
* Making significant functional additions (particularly filling gaps in NumPy)
to pandas, particularly new data types, has grown increasingly complex with
very obvious accumulations of technical debt.
Expand Down Expand Up @@ -78,6 +83,8 @@ As this will be a quite nuanced discussion, especially for those not intimately
familiar with pandas's implementation details, I wanted to speak to a couple of
commonly-asked questions in brief:

````

1. **Will this work make it harder to use pandas with NumPy, scikit-learn,
statsmodels, SciPy, or other libraries that depend on NumPy
interoperability?**
Expand All @@ -91,6 +98,8 @@ commonly-asked questions in brief:
generally make production code using a downstream library like scikit-learn
more dependable and future-proof.

````

2. **By decoupling from NumPy, it sounds like you are reimplementing NumPy or
adding a new data type system**

Expand All @@ -113,6 +122,58 @@ commonly-asked questions in brief:
types** (i.e. NumPy's dtypes) and **logical types** (i.e. what pandas
currently has, implicitly).

````

3. **Shouldn't you try to accomplish your goals by contributing work to NumPy
instead of investing major work in pandas's internals?**

* In my opinion, this is a "false dichotomy"; i.e. these things are not
mutually exclusive.

* Yes, we should define, scope, and if possible help implement improvements
to NumPy that make sense. As NumPy serves a significantly larger and more
diverse set of users, major changes to the NumPy C codebase must be
approached more conservatively.

* It is unclear that pandas's body of domain-specific data handling and
computational code is entirely "in scope" for NumPy. Some technical
details, such as our categorical or datetime data semantics, "group by"
functionality, relational algebra (joins), etc., may be ideal 9600 for pandas
but not necessarily ideal for a general user of NumPy. My opinion is that
functionality from NumPy we wish to use in pandas should "pass through" to
the user unmodified, but we must retain the flexibility to work "outside
the box" (implement things not found in NumPy) without adding technical
debt or user API complexity.

````

4. **API changes / breaks are thought to be bad; don't you have a
responsibility to maintain backwards compatibility for users that heavily
depend on pandas?**

* It's true that APIs should not be broken or changed, and as such should be
approached with extreme caution.

* The goal of the pandas 2.0 initiative is to only make "good" API breaks
that yield a net benefit that can be easily demonstrated. As an example:
adding native missing data support to integer and boolean data (without
casting to another physical storage type) may break user code that has
knowledge of the "rough edge" (the behavior that we are fixing). As these
changes will mostly affect advanced pandas users, I expect they will be
welcomed.

* Any major API change or break will be documented and justified to assist
with code migration.

* As soon as we are able, we will post binary development artifacts for the
pandas 2.0 development branch to get early feedback from heavy pandas
users to understand the impact of changes and how we can better help the
existing user base.

* Some users will find that a certain piece of code has been working "by
accident" (i.e. relying upon undocumented behavior). This kind of breakage
is already a routine occurrence unfortunately.

Summary
=======

Expand Down
0