8000 ENH: Add replace ufunc for bytes and unicode dtypes by lysnikolaou · Pull Request #25171 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Add replace ufunc for bytes and unicode dtypes #25171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 24, 2023

Conversation

lysnikolaou
Copy link
Member

No description provided.

@lysnikolaou lysnikolaou force-pushed the string-ufuncs-replace-v2 branch from e774722 to 19396d2 Compare November 17, 2023 15:31
@lysnikolaou
Copy link
Member Author

@seberg @ngoldbaum Another somewhat hacky ufunc for replace here. What's your opinion on this?

Copy link
Member
@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am generally OK with the hack here. Yes, the casting safety is side-stepped, but since we don't expose out if it is wrapped, that isn't a huge issue.

counts = count(x1, x2, 0, numpy.iinfo(numpy.int_).max)
buffersizes = str_len(x1) + counts * (str_len(x3)-str_len(x2))
max_buffersize = numpy.max(buffersizes)
out = numpy.empty(x1.shape, dtype=f"{x1.dtype.char}{max_buffersize}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one caveat here is that for nathan's new string dtype (or user one) this doesn't make sense, and we should only do it for the NumPy builtin strings (maybe in principle also for some other fixed length ones, but that seems less important).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Do you think it'd be fine to leave it as is for now and deal with it once stringdtype is in? cc @ngoldbaum

@lysnikolaou lysnikolaou force-pushed the string-ufuncs-replace-v2 branch from 61330b5 to cc561bf Compare December 6, 2023 14:08
@lysnikolaou lysnikolaou force-pushed the string-ufuncs-replace-v2 branch from cc561bf to bc5f633 Compare December 6, 2023 14:23
@lysnikolaou
Copy link
Member Author

Let's try to merge this before the 2.0 release candidate if possible.

@ngoldbaum
Copy link
Member

Missed this, will try to look at it tomorrow.

At the last community meeting Chuck said something about the middle of January at the earliest, so we have some time.

Are you still planning to do an np.strings PR? If not I'm happy to try.

I'm also planning to try working on stringdtype ufuncs next week. I'll ping you if I run into trouble adapting your code to work with utf8.

@lysnikolaou
Copy link
Member Author

Are you still planning to do an np.strings PR? If not I'm happy to try.

Yup, I'll start working on it today.

@ngoldbaum ngoldbaum merged commit 1b861a2 into numpy:main Dec 24, 2023
@seiko2plus
Copy link
Member
seiko2plus commented Dec 24, 2023

It seems this PR is responsible for the following test failure across all CI jobs:

Test failure
=================================== FAILURES ===================================
  _____________________ TestMethodsScalarValues.test_replace _____________________
  
  self = <numpy._core.tests.test_defchararray.TestMethodsScalarValues object at 0x7f27b5cde3a0>
  
      def test_replace(self):
  >       assert_equal(np.char.replace('Python is good', 'good', 'great'),
                       'Python is great')
  
  self       = <numpy._core.tests.test_defchararray.TestMethodsScalarValues object at 0x7f27b5cde3a0>
  
  numpy/_core/tests/test_defchararray.py:736: 
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
  
  a = 'Python is good', old = 'good', new = 'great', count = 9223372036854775807
  
      @array_function_dispatch(_replace_dispatcher)
      def replace(a, old, new, count=None):
          """
          For each element in `a`, return a copy of the string with all
          occurrences of substring `old` replaced by `new`.
      
          Calls :meth:`str.replace` element-wise.
      
          Parameters
          ----------
          a : array-like of str or unicode
      
          old, new : str or unicode
      
          count : int, optional
              If the optional argument `count` is given, only the first
              `count` occurrences are replaced.
      
          Returns
          -------
          out : ndarray
              Output array of str or unicode, depending on input type
      
          See Also
          --------
          str.replace
      
          Examples
          --------
          >>> a = np.array(["That is a mango", "Monkeys eat mangos"])
          >>> np.char.replace(a, 'mango', 'banana')
          array(['That is a banana', 'Monkeys eat bananas'], dtype='<U19')
      
          >>> a = np.array(["The dish is fresh", "This is it"])
          >>> np.char.replace(a, 'is', 'was')
          array(['The dwash was fresh', 'Thwas was it'], dtype='<U19')
      
          """
          max_int64 = numpy.iinfo(numpy.int64).max
          count = count if count is not None else max_int64
      
          counts = numpy._core.umath.count(a, old, 0, max_int64)
          buffersizes = (
              numpy._core.umath.str_len(a)
              + counts * (numpy._core.umath.str_len(new) -
                          numpy._core.umath.str_len(old))
          )
          max_buffersize = numpy.max(buffersizes)
  >       out = numpy.empty(a.shape, dtype=f"{a.dtype.char}{max_buffersize}")
  E       AttributeError: 'str' object has no attribute 'shape'

@ngoldbaum
Copy link
Member

Thanks! Should be fixed by #25484

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
31D6
0