8000 API: Add string extension type by TomAugspurger · Pull Request #27949 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

API: Add string extension type #27949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 59 commits into from
Oct 5, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
c24b5b6
API: Add string extension type
TomAugspurger Jul 31, 2019
3ecb5cc
test fixups
TomAugspurger Aug 16, 2019
59a7d39
string dtype
TomAugspurger Aug 16, 2019
7c07070
35 compat
TomAugspurger Aug 16, 2019
9e1a73b
doc
TomAugspurger Aug 16, 2019
16ccad8
fixups
TomAugspurger Aug 16, 2019
1027463
doc
TomAugspurger Aug 16, 2019
aafb53b
doc
TomAugspurger Aug 19, 2019
9cdfe2f
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Aug 19, 2019
ab49169
fix doc warnings
TomAugspurger Aug 19, 2019
978fb55
fixup docstrings
TomAugspurger Aug 19, 2019
aebc688
fixup docstrings
TomAugspurger Aug 19, 2019
d90d0ad
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 9, 2019
41dc0f9
lint
TomAugspurger Sep 9, 2019
b783559
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 16, 2019
13cdddd
typing
TomAugspurger Sep 16, 2019
78c2eaa
removed double assert
TomAugspurger Sep 18, 2019
726d0af
experimental
TomAugspurger Sep 19, 2019
69d24e5
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 19, 2019
9cd9945
failing
TomAugspurger Sep 19, 2019
070fb76
xfails
TomAugspurger Sep 19, 2019
2b90639
Handle non-ndarray in add
TomAugspurger Sep 19, 2019
381c889
fixup
TomAugspurger Sep 19, 2019
bf82aad
fixup
TomAugspurger Sep 19, 2019
79bd87a
note
TomAugspurger Sep 19, 2019
2af8c81
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 23, 2019
fd24274
spacing
TomAugspurger Sep 23, 2019
0635ede
warning note
TomAugspurger Sep 23, 2019
d3311ee
update doc
TomAugspurger Sep 23, 2019
dce9258
doc updates
TomAugspurger Sep 23, 2019
0524f7e
update ctor
TomAugspurger Sep 23, 2019
292a8f3
clean up wrapping
TomAugspurger Sep 23, 2019
2c88e3b
clarify
TomAugspurger Sep 23, 2019
1b8c83a
reduce sum
TomAugspurger Sep 23, 2019
f1dad2a
skip reduce sum
TomAugspurger Sep 23, 2019
be95ecb
rename
TomAugspurger Sep 23, 2019
903ea2f
move
TomAugspurger Sep 23, 2019
0e1f479
missed
TomAugspurger Sep 23, 2019
c168ecf
missed
TomAugspurger Sep 23, 2019
d06ba73
fixup rename
TomAugspurger Sep 24, 2019
3ba27c3
fixup
TomAugspurger Sep 24, 2019
fe8ee77
doctest
TomAugspurger Sep 24, 2019
d9f63aa
updates
TomAugspurger Sep 24, 2019
d3c49e2
fixups
TomAugspurger Sep 24, 2019
dcb84f9
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 24, 2019
43b51cd
length check
TomAugspurger Sep 24, 2019
4fd2d11
unimplement sum
TomAugspurger Sep 24, 2019
713f807
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 26, 2019
777b295
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 30, 2019
8714a53
fixup
TomAugspurger Sep 30, 2019
41f234c
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Oct 1, 2019
dc9ef3c
rename
TomAugspurger Oct 1, 2019
9419af2
rename
TomAugspurger Oct 1, 2019
462b29d
doc updates
TomAugspurger Oct 1, 2019
0391563
fixups
TomAugspurger Oct 1, 2019
129fe29
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Oct 3, 2019
6aebd8c
move and perf
TomAugspurger Oct 4, 2019
2ee5e30
test is_string_dtype
TomAugspurger Oct 4, 2019
7e92cde
helper
TomAugspurger Oct 4, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
rename
  • Loading branch information
TomAugspurger committed Oct 1, 2019
commit dc9ef3cffb5caca17eaa23245db67581c72b6c6f
10 changes: 5 additions & 5 deletions doc/source/getting_started/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -986,7 +986,7 @@ not noted for a particular column will be ``NaN``:

tsdf.agg({'A': ['mean', 'min'], 'B': 'sum'})

.. _basics.aggregation.mixed_dtypes:
.. _basics.aggregation.mixed_string:

Mixed dtypes
++++++++++++
Expand Down Expand Up @@ -1716,7 +1716,7 @@ always uses them).
.. note::

Prior to pandas 1.0, string methods were only available on ``object`` -dtype
``Series``. Pandas 1.0 added the :class:`TextDtype` which is dedicated
``Series``. Pandas 1.0 added the :class:`StringDtype` which is dedicated
to strings. See :ref:`text.types` for more.

Please see :ref:`Vectorized String Methods <text.string_methods>` for a complete
Expand Down Expand Up @@ -1932,15 +1932,15 @@ period (time spans) :class:`PeriodDtype` :class:`Period` :class:`arrays.
sparse :class:`SparseDtype` (none) :class:`arrays.SparseArray` :ref:`sparse`
intervals :class:`IntervalDtype` :class:`Interval` :class:`arrays.IntervalArray` :ref:`advanced.intervalindex`
nullable integer :class:`Int64Dtype`, ... (none) :class:`arrays.IntegerArray` :ref:`integer_na`
Text :class:`TextDtype` :class:`str` :class:`arrays.TextArray` :ref:`text`
Strings :class:`StringDtype` :class:`str` :class:`arrays.StringArray` :ref:`text`
=================== ========================= ================== ============================= =============================

Pandas has two ways to store strings.

1. ``object`` dtype, which can hold any Python object, including strings.
2. :class:`TextDtype`, which is dedicated to strings.
2. :class:`StringDtype`, which is dedicated to strings.

Generally, we recommend using :class:`TextDtype`. See :ref:`text.types` fore more.
Generally, we recommend using :class:`StringDtype`. See :ref:`text.types` fore more.

Finally, arbitrary objects may be stored using the ``object`` dtype, but should
be avoided to the extent possible (for performance and interoperability with
Expand Down
10 changes: 5 additions & 5 deletions doc/source/reference/arrays.rst
8000
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.array
Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na`
Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical`
Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse`
Text :class:`TextDtype` :class:`str` :ref:`api.arrays.string`
Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string`
=================== ========================= ================== =============================

Pandas and third-party libraries can extend NumPy's type system (see :ref:`extending.extension-types`).
Expand Down Expand Up @@ -467,21 +467,21 @@ Text data
---------

When working with text data, where each valid element is a string or missing,
we recommend using :class:`TextDtype` (with the alias ``"text"``).
we recommend using :class:`StringDtype` (with the alias ``"string"``).

.. autosummary::
:toctree: api/
:template: autosummary/class_without_autosummary.rst

arrays.TextArray
arrays.StringArray

.. autosummary::
:toctree: api/
:template: autosummary/class_without_autosummary.rst

TextDtype
StringDtype

The ``Series.str`` accessor is available for ``Series`` backed by a :class:`arrays.TextArray`.
The ``Series.str`` accessor is available for ``Series`` backed by a :class:`arrays.StringArray`.
See :ref:`api.series.str` for more.


Expand Down
16 changes: 8 additions & 8 deletions doc/source/user_guide/text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ Text Data Types
There are two main ways to store text data

1. ``object`` -dtype NumPy array.
2. :class:`TextDtype` extension type.
2. :class:`StringDtype` extension type.

We recommend using :class:`TextDtype` to store text data.
We recommend using :class:`StringDtype` to store text data.

Prior to pandas 1.0, ``object`` dtype was the only option. This was unfortunate
for many reasons:
Expand All @@ -32,13 +32,13 @@ for many reasons:
than ``text``.

Currently, the performance of ``object`` dtype arrays of strings and
:class:`arrays.TextArray` are about the same. We expect future enhancements
:class:`arrays.StringArray` are about the same. We expect future enhancements
to significantly increase the performance and lower the memory overhead of
:class:`~arrays.TextArray`.
:class:`~arrays.StringArray`.

.. warning::

``TextArray`` is currently considered experimental. The implementation
``StringArray`` is currently considered experimental. The implementation
and parts of the API may change without warning.

For backwards-compatibility, ``object`` dtype remains the default type we
Expand All @@ -53,7 +53,7 @@ To explicitly request ``text`` dtype, specify the ``dtype``
.. ipython:: python

pd.Series(['a', 'b', 'c'], dtype="text")
pd.Series(['a', 'b', 'c'], dtype=pd.TextDtype())
pd.Series(['a', 'b', 'c'], dtype=pd.StringDtype())

Or ``astype`` after the ``Series`` or ``DataFrame`` is created
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure of the convention, should Series and DataFrame be ":class:Foo" in this contest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we have a formal policy. I vaguely recall a discussion somewhere about doing it ~once per paragraph?


Expand Down Expand Up @@ -170,8 +170,8 @@ It is easy to expand this to return a DataFrame using ``expand``.

s2.str.split('_', expand=True)

When original ``Series`` has :class:`TextDtype`, the output columns will all
be :class:`TextDtype` as well.
When original ``Series`` has :class:`StringDtype`, the output columns will all
be :class:`StringDtype` as well.

It is also possible to limit the number of splits:

Expand Down
18 changes: 9 additions & 9 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,39 +50,39 @@ including other versions of pandas.
Enhancements
~~~~~~~~~~~~

.. _whatsnew_100.text:
.. _whatsnew_100.string:

Dedicated text data type
^^^^^^^^^^^^^^^^^^^^^^^^

We've added :class:`TextDtype`, an extension type dedicated to string data.
We've added :class:`StringDtype`, an extension type dedicated to string data.
Previously, strings were typically stored in object-dtype NumPy arrays.

.. warning::

``TextDtype`` and is currently considered experimental. The implementation
``StringDtype`` and is currently considered experimental. The implementation
and parts of the API may change without warning.

The text extension type solves several issues with object-dtype NumPy arrays:

1. You can accidentally store a *mixture* of strings and non-strings in an
``object`` dtype array. A ``TextArray`` can only store strings.
``object`` dtype array. A ``StringArray`` can only store strings.
2. ``object`` dtype breaks dtype-specific operations like :meth:`DataFrame.select_dtypes`.
There isn't a clear way to select *just* text while excluding non-text,
but still object-dtype columns.
3. When reading code, the contents of an ``object`` dtype array is less clear
than ``text``.
than ``string``.


.. ipython:: python

pd.Series(['abc', None, 'def'], dtype=pd.TextDtype())
pd.Series(['abc', None, 'def'], dtype=pd.StringDtype())

You can use the alias ``"text"`` as well.
You can use the alias ``"string"`` as well.

.. ipython:: python

s = pd.Series(['abc', None, 'def'], dtype="text")
s = pd.Series(['abc', None, 'def'], dtype="string")
s

The usual string accessor methods work. Where appropriate, the return type
Expand All @@ -91,7 +91,7 @@ of the Series or columns of a DataFrame will also have string dtype.
s.str.upper()
s.str.split('b', expand=True).dtypes

We recommend explicitly using the ``text`` data type when working with strings.
We recommend explicitly using the ``string`` data type when working with strings.
See :ref:`text.types` for more.

.. _whatsnew_1000.enhancements.other:
Expand Down
2 changes: 1 addition & 1 deletion pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@
PeriodDtype,
IntervalDtype,
DatetimeTZDtype,
TextDtype,
StringDtype,
# missing
isna,
isnull,
Expand Down
4 changes: 2 additions & 2 deletions pandas/arrays/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
PandasArray,
PeriodArray,
SparseArray,
TextArray,
StringArray,
TimedeltaArray,
)

Expand All @@ -23,6 +23,6 @@
"PandasArray",
"PeriodArray",
"SparseArray",
"TextArray",
"StringArray",
"TimedeltaArray",
]
2 changes: 1 addition & 1 deletion pandas/core/api.py
B0FB
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
UInt32Dtype,
UInt64Dtype,
)
from pandas.core.arrays.text import TextDtype
from pandas.core.arrays.text import StringDtype
from pandas.core.construction import array
from pandas.core.groupby import Grouper, NamedAgg
from pandas.core.index import (
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/arrays/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@
from .numpy_ import PandasArray, PandasDtype # noqa: F401
from .period import PeriodArray, period_array # noqa: F401
from .sparse import SparseArray # noqa: F401
from .text import TextArray # noqa: F401
from .text import StringArray # noqa: F401
from .timedeltas import TimedeltaArray # noqa: F401
Loading
0