8000 REF: StringArray._from_sequence, use less memory by topper-123 · Pull Request #35519 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

REF: StringArray._from_sequence, use less memory #35519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
REF: StringArray._from_sequence
  • Loading branch information
topper-123 committed Aug 16, 2020
commit df8e4d6498ee63fef64e4d7dd704b65f78e21e12
5 changes: 5 additions & 0 deletions doc/source/whatsnew/v1.1.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,11 @@ Categorical
- Bug in :class:`DataFrame` constructor failing to raise ``ValueError`` in some cases when data and index have mismatched lengths (:issue:`33437`)
-

**Strings**

- fix memory usage issue when instantiating large :class:`pandas.arrays.StringArray` (:issue:`35499`)


.. ---------------------------------------------------------------------------

.. _whatsnew_111.contributors:
Expand Down
14 changes: 14 additions & 0 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1698,6 +1698,20 @@ cpdef bint is_string_array(ndarray values, bint skipna=False):
return validator.validate(values)


cpdef ndarray ensure_string_array(ndarray values, object na_value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should return a new array
and pls
add a doc string

also i think we have a routine like this already

cdef:
Py_ssize_t i = 0, n = len(values)

for i in range(n):
val = values[i]
if not checknull(val):
values[i] = str(val)
else:
values[i] = na_value

return values


cdef class BytesValidator(Validator):
cdef inline bint is_value_typed(self, object value) except -1:
return isinstance(value, bytes)
Expand Down
21 changes: 4 additions & 17 deletions pandas/core/arrays/string_.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,11 +177,10 @@ class StringArray(PandasArray):

def __init__(self, values, copy=False):
values = extract_array(values)
skip_validation = isinstance(values, type(self))

super().__init__(values, copy=copy)
self._dtype = StringDtype()
if not skip_validation:
if not isinstance(values, type(self)):
self._validate()

def _validate(self):
Expand All @@ -195,28 +194,16 @@ def _validate(self):
)

@classmethod
def _from_sequence(cls, scalars, dtype=None, copy=False):
def _from_sequence(cls, scalars, dtype=None, copy=True):
if dtype:
assert dtype == "string"

result = np.asarray(scalars, dtype="object")
if copy and result is scalars:
result = result.copy()

# Standardize all missing-like values to NA
# TODO: it would be nice to do this in _validate / lib.is_string_array
# We are already doing a scan over the values there.
na_values = isna(result)
has_nans = na_values.any()
if has_nans and result is scalars:
# force a copy now, if we haven't already
result = result.copy()

# convert to str, then to object to avoid dtype like '<U3', then insert na_value
result = np.asarray(result, dtype=str)
result = np.asarray(result, dtype="object")
if has_nans:
result[na_values] = StringDtype.na_value
# convert non-na-likes to str, and nan-likes to StringDtype.na_value
result = lib.ensure_string_array(result, StringDtype.na_value)

return cls(result)

Expand Down
14 changes: 9 additions & 5 deletions pandas/tests/arrays/string_/test_string.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,12 +206,16 @@ def test_constructor_raises():

@pytest.mark.parametrize("copy", [True, False])
def test_from_sequence_no_mutate(copy):
a = np.array(["a", np.nan], dtype=object)
original = a.copy()
result = pd.arrays.StringArray._from_sequence(a, copy=copy)
expected = pd.arrays.StringArray(np.array(["a", pd.NA], dtype=object))
nan_arr = np.array(["a", np.nan], dtype=object)
na_arr = np.array(["a", pd.NA], dtype=object)

result = pd.arrays.StringArray._from_sequence(nan_arr, copy=copy)
expected = pd.arrays.StringArray(na_arr)

tm.assert_extension_array_equal(result, expected)
tm.assert_numpy_array_equal(a, original)

expected = nan_arr if copy else na_arr
tm.assert_numpy_array_equal(nan_arr, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this test is correctly changed. I think we should never mutate the input, whether copy is True or False (which is what it was testing before).
IMO the copy keyword is to indicate to simply always copy, or if False it will only copy when needed. And when you need to mutate, I would say the copy is needed.

It's also not taking a copy of the original array, so not even checking it wasn't changed (because even if it was changed, it would still compare equal to itself)



def test_astype_int():
Expand Down
0