8000 PDEP-14: Dedicated string data type for pandas 3.0 by jorisvandenbossche · Pull Request #58551 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

PDEP-14: Dedicated string data type for pandas 3.0 #58551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from 1 commit
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
fbeb69d
PDEP: Dedicated string data type for pandas 3.0
jorisvandenbossche May 3, 2024
f03f54d
small textual edits and typos
jorisvandenbossche May 3, 2024
561de87
address part of the feedback
jorisvandenbossche May 5, 2024
86f4e51
Update web/pandas/pdeps/00xx-string-dtype.md
jorisvandenbossche May 5, 2024
30c7b43
rename file
jorisvandenbossche May 13, 2024
54a43b3
expand Missing value semantics section
jorisvandenbossche May 13, 2024
5b5835b
expand Naming subsection with storage+na_value proposal
jorisvandenbossche May 13, 2024
9ede2e6
Expand Backward compatibility section + add proposal for deprecation
jorisvandenbossche May 13, 2024
f5faf4e
update timeline
jorisvandenbossche May 13, 2024
f554909
Apply suggestions from code review
jorisvandenbossche May 13, 2024
ac2d21a
Apply suggestions from code review
jorisvandenbossche May 13, 2024
82027d2
reflow after online edits
jorisvandenbossche May 13, 2024
5b24c24
Update web/pandas/pdeps/0014-string-dtype.md
jorisvandenbossche May 13, 2024
f9c55f4
Apply suggestions from code review
jorisvandenbossche May 13, 2024
2c58c4c
Fixup table (#2)
rhshadrach May 14, 2024
0a68504
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche May 20, 2024
8974c5b
next round of updates (small text updates, add capitalized String alias)
jorisvandenbossche May 20, 2024
cca3a7f
use capitalized alias in the overview table
jorisvandenbossche May 20, 2024
d24a80a
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche Jun 10, 2024
9c5342a
New revision: keep back compat for 'string', introduce 'str' for the …
jorisvandenbossche Jun 10, 2024
b5663cc
Apply suggestions from code review
jorisvandenbossche Jun 11, 2024
1c4c2d9
Update web/pandas/pdeps/0014-string-dtype.md
jorisvandenbossche Jun 12, 2024
c44bfb5
rephrase main points in proposal
jorisvandenbossche Jun 12, 2024
af5ad3c
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche Jun 14, 2024
bd52f39
tiny edit
jorisvandenbossche Jun 14, 2024
f8fbc61
mismatched quote
jorisvandenbossche Jun 14, 2024
d78462d
Update 0014-string-dtype.md
phofl Jul 22, 2024
4de20d1
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche Jul 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
small textual edits and typos
  • Loading branch information
jorisvandenbossche committed May 3, 2024
commit f03f54d8c67a011a62779a363798d0dc2d9cf0f4
26 changes: 14 additions & 12 deletions web/pandas/pdeps/00xx-string-dtype.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ default in pandas 3.0:

* In pandas 3.0, enable a "string" dtype by default, using PyArrow if available
or otherwise the numpy object-dtype alternative.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you allow the possability of a NumPy 2 improved type for pandas 3? With a heirarchy arrow -> np 2 -> np object?

Copy link
Member Author
@jorisvandenbossche jorisvandenbossche May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal does not preclude any further improvements for the numpy-based string dtype using numpy 2.0. A few lines below I explicitly mention it as a future improvement and in the "Object-dtype "fallback" implementation" section as well.

I just don't want to explicitly commit to anything for pandas 3.0 related to that, given it is hard to judge right now how well it will work / how much work it is to get it ready (not only our own implementation, but also support in the rest of the ecosystem). If it is ready by 3.0, then we can evaluate that separately, but this proposal doesn't stand or fall with it.

Regardless of whether to also use numpy 2.0, we have to agree on 1) making a "string" dtype the default for 3.0, 2) the missing value behaviour to use for this dtype, and 3) whether to provide an alternative for PyArrow (in which case we need the object-dtype version anyway since we also can't require numpy 2.0). I would like the proposal to focus on those aspects.

* The default string dtype will use missing value semantics using NaN consistent
* The default string dtype will use missing value semantics (using NaN) consistent
with the other default data types.

This will give users a long-awaited proper string dtype for 3.0, while 1) not
Expand All @@ -26,12 +26,12 @@ using NumPy 2.0, etc).
## Background

Currently, pandas by default stores text data in an `object`-dtype NumPy array.
The current implementation has two primary drawbacks: First, `object`-dtype is
The current implementation has two primary drawbacks. First, `object` dtype is
not specific to strings: any Python object can be stored in an `object`-dtype
array, not just strings, and seeing `object` as the dtype for a column with
strings is confusing for users. Second: this is not efficient (all string
methods on a Series are eventually done by calling Python methods on the
individual string objects).
methods on a Series are eventually calling Python methods on the individual
string objects).

To solve the first issue, a dedicated extension dtype for string data has
already been
Expand All @@ -51,8 +51,9 @@ This could be specified with the `storage` keyword in the opt-in string dtype
Since its introduction, the `StringDtype` has always been opt-in, and has used
the experimental `pd.NA` sentinel for missing values (which was also [introduced
in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)).
However, up to this date, pandas has not yet made the step to use `pd.NA` by
default.
However, up to this date, pandas has not yet taken the step to use `pd.NA` by
default, and thus the `StringDtype` deviates in missing value behaviour compared
to the default data types.

In 2023, [PDEP-10](https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html)
proposed to start using a PyArrow-backed string dtype by default in pandas 3.0
Expand Down Expand Up @@ -125,15 +126,15 @@ This option will be expanded to also work when PyArrow is not installed.

### Missing value semantics

Given that all other default data types uses NaN semantics for missing values,
Given that all other default data types use NaN semantics for missing values,
this proposal says that a new default string dtype should still use the same
default semantics. Further, it should result in default data types when doing
operations on the string column that result in a boolean or numeric data type
(e.g., methods like `.str.startswith(..)` or `.str.len(..)`, or comparison
operators like `==`, should result in default `int64` and `bool` data types).

Because the current original `StringDtype` implementations already use `pd.NA`
and return masked integer and boolean arrays in operations, a new variant of the
Because the original `StringDtype` implementations already use `pd.NA` and
return masked integer and boolean arrays in operations, a new variant of the
existing dtypes that uses `NaN` and default data types is needed.

### Object-dtype "fallback" implementation
Expand Down Expand Up @@ -175,7 +176,7 @@ To avoid introducing a new string dtype while other discussions and changes are
in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as
the default missing value sentinel? using the new NumPy 2.0 capabilities?), we
could also delay introducing a default string dtype until there is more clarity
for those other discussions.
in those other discussions.

However:

Expand All @@ -184,11 +185,12 @@ However:
significant part of the user base that has PyArrow installed) in performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can just claim this. I don't disagree, but this should be backed up more.

At least from the feedback received from #57073 and the other issue, there's at least a significant part of the user base that doesn't use strings.

There's also a significant chunk of the population that can't install pyarrow (due to size requirements or exotic platforms or whatever).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this argument is that convincing either, although for slightly different reasons. I don't think we need to feel rushed for the next release

Copy link
Member Author
@jorisvandenbossche jorisvandenbossche May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can just claim this. I don't disagree, but this should be backed up more.

@lithomas1 can you clarify which part of the paragraph you think requires more backing up?
The fact that I say a "significant" part of our user base has pyarrow installed?

I don't think we can ever know exact numbers for this, but one data point is that pandas currently has 210M monthly downloads and pyarrow has 120M monthly downloads. Of course not all of those pyarrow users are also using pandas, but let's just assume that half of those pyarrow downloads come from people using pandas, that would mean that around 30% for our users already have pyarrow installed, which I would consider as a "significant part".
(and my guess is that for people working with larger datasets, where the speed of pyarrow becomes more important, this percentage will be higher, for example because of using the parquet IO)

But anyway, we are never going to know this exact number, but IMO we do know that a significant part of our userbase has pyarrow and will benefit from using that by default.

there's at least a significant part of the user base that doesn't use strings.

Yes, and then this PDEP is not relevant for them. But it's not because some users don't use strings, that we shouldn't improve the life of those users that do use strings? (so just not really understanding how this is a relevant argument)

There's also a significant chunk of the population that can't install pyarrow

Yes, and this PDEP addresses that by allowing a fallback when pyarrow is not installed.

I am not sure this argument is that convincing either, although for slightly different reasons.

@WillAyd can you then clarify which other reasons?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My other reason is that I don't think there is ever a rush to get a release out; we have historically never operated that way

Copy link
Member Author
@jorisvandenbossche jorisvandenbossche May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is ever a rush to get a release out; we have historically never operated that way

For the last six years, we have roughly released a new feature release every six months. We indeed never rush a specific release if there is something holding it up for a bit, but historically we have been releasing somewhat regularly.

At this point, a next feature release will be 3.0 given the amount of changes we already made on the main branch that require the next release cut from main to be 3.0 and not 2.3 (enforced deprecations etc).
(we can cut a 2.3 release from the the 2.2.x maintenance branch, which we might want to do for several reasons, but not counting that as a feature release for this discussion, as that will not actually contain features)

So I would say there is not necessarily a rush to do a release with a default "string" dtype (that is up for debate, i.e. this PDEP), but there is some rush to get a 3.0 release out. In the meaning that I think we don't want to delay 3.0 for like half a year or longer.

So for me delaying the string dtype, essentially means not including it in 3.0 but postponing it to pandas 4.0 (I should maybe be clearer in the paragraph above about that).

And then I try to argue in the text here that postponing it for 4.0 has a cost (or, missed benefit), because we have an implementation we could use for a default string dtype in pandas 3.0, and postponing introducing it makes that users will use the sub-optimal object dtype for longer, for (IMO) no good reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can just claim this. I don't disagree, but this should be backed up more.

@lithomas1 can you clarify which part of the paragraph you think requires more backing up? The fact that I say a "significant" part of our user base has pyarrow installed?

It'd be nice to add how much perf benefits Arrow strings are expected to bring (e.g. 20%? 2x? 10x?).
Putting in the part about how many users have pyarrow would also help.

It'd also be good to elaborate on the usability part. IIUC, the main benefit here is not having to manually check element to see whether your object dtype'd column contains strings (since I think all the string methods work on object dtype'd columns).

I think it's also fair to amend this part to say "massive benefits to users that use strings" (instead of in general).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarks are going to be highly dependent on usage and context. If working in an Arrow native ecosystem, the speedup of strings may be a factor over 100x. If working in a space where you have to copy back and forth a lot with NumPy, that number goes way down.

I think trying to set expectations on one number / benchmark for performance is futile, but generally Arrow only helps, and makes it so that we as developers don't need to write custom I/O solutions (eg: ADBC Drivers, parquet, read_csv with pyarrow all work with Arrow natively with no extra pandas dev effort)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to add how much perf benefits Arrow strings are expected to bring (e.g. 20%? 2x? 10x?).

Benchmarks are going to be highly dependent on usage and context.

Indeed, for single operations you can easily get a >10x speedup, but of course a typical workflow does not consist of just string operations, and the overall speedup depends a lot (see this slide for one small example comparison (https://phofl.github.io/pydata-berlin/pydata-berlin-2023/intro.html#74) and this blogpost from Patrick showing the benefit in a dask example workflow (https://towardsdatascience.com/utilizing-pyarrow-to-improve-pandas-and-dask-workflows-2891d3d96d2b).

but generally Arrow only helps, and makes it so that we as developers don't need to write custom I/O solutions

That is often true, but except for strings ;).
For strings, the faster compute kernels will still give a lot of value even if your IO wasn't done through Arrow (and give a lot more value compared to using pyarrow for numeric data)

2. In case we eventually transition to use `pd.NA` as the default missing value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the challenges around this will not be unique to the string dtype and
therefore not a reason to delay this.

I might be missing the intent but I don't understand why the larger issue of NA handling means we should be faster to implement this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the larger issue of NA handling means we should be faster to implement this

It's not a reason to do it "faster", but I meant to say that the discussion regarding NA is not a reason to do it "slower" (to delay introducing a dedicated string dtype)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the flip side is that if we aren't careful about the NA handling we can introduce some new keywords / terminology that makes it very confusing in the long run (which is essentially one of the problems with our strings naming conventions)

As a practical example, if we decided we wanted semantics= as a keyword argument to StringDtype in this PDEP to move the NA discussion along, that might be counter-productive when we look at more data types and decide semantics= was not a clear way to allow datetime data types to support pd.NaT as the missing value.

(not saying the above is necessarily the truth, just cherry picking from conversation so far)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one reason that I personally would prefer not introducing a keyword specifically for the missing value semantics, for now (just for this PDEP / the string dtype). I just listed some options in #58613, and I think we can do without it.

sentinel, we will need a migration path for _all_ our data types, and thus
the challenges around this will not be unique to the string dtype.
the challenges around this will not be unique to the string dtype and
therefore not a reason to delay this.

### Why not use the existing StringDtype with `pd.NA`?

Because adding even more variants of the string dtype will make things only more
Wouldn't adding even more variants of the string dtype will make things only more
confusing? Indeed, this proposal unfortunately introduces more variants of the
string dtype. However, the reason for this is to ensure the actual default user
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just retroactively clarifies the reasoning for string[pyarrow_numpy] to have existed in the first place right? Or is it supposed to be hinting at some other feature that the implementation details of the PDEP is proposing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's indeed explaining why we did this, which is of course "retroactively" given I was asked to write this PDEP partly for changes that have already been released. So a big part of the PDEP is retroactively in that sense (which it not necessarily helping to write it clearly ..).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is it supposed to be hinting at some other feature that the implementation details of the PDEP is proposing?

however, more importantly, the PDEP makes this (the already added dtype) the default in 3.0. It would remain behind the future flag for the next release if enough people feel we are not ready.

experience is _less_ confusing, and the new string dtype fits better with the
Expand Down
0