ENH: Add replace ufunc for bytes and unicode dtypes #25171

lysnikolaou · 2023-11-17T15:13:05Z

No description provided.

lysnikolaou · 2023-11-17T15:32:52Z

@seberg @ngoldbaum Another somewhat hacky ufunc for replace here. What's your opinion on this?

seberg

I am generally OK with the hack here. Yes, the casting safety is side-stepped, but since we don't expose out if it is wrapped, that isn't a huge issue.

numpy/_core/umath.py

seberg · 2023-11-20T11:14:18Z

numpy/_core/umath.py

+    counts = count(x1, x2, 0, numpy.iinfo(numpy.int_).max)
+    buffersizes = str_len(x1) + counts * (str_len(x3)-str_len(x2))
+    max_buffersize = numpy.max(buffersizes)
+    out = numpy.empty(x1.shape, dtype=f"{x1.dtype.char}{max_buffersize}")


The one caveat here is that for nathan's new string dtype (or user one) this doesn't make sense, and we should only do it for the NumPy builtin strings (maybe in principle also for some other fixed length ones, but that seems less important).

That's true. Do you think it'd be fine to leave it as is for now and deal with it once stringdtype is in? cc @ngoldbaum

lysnikolaou · 2023-12-21T23:49:25Z

Let's try to merge this before the 2.0 release candidate if possible.

ngoldbaum · 2023-12-22T00:04:03Z

Missed this, will try to look at it tomorrow.

At the last community meeting Chuck said something about the middle of January at the earliest, so we have some time.

Are you still planning to do an np.strings PR? If not I'm happy to try.

I'm also planning to try working on stringdtype ufuncs next week. I'll ping you if I run into trouble adapting your code to work with utf8.

lysnikolaou · 2023-12-22T11:11:59Z

Are you still planning to do an np.strings PR? If not I'm happy to try.

Yup, I'll start working on it today.

seiko2plus · 2023-12-24T12:34:56Z

It seems this PR is responsible for the following test failure across all CI jobs:

Test failure

=================================== FAILURES ===================================
  _____________________ TestMethodsScalarValues.test_replace _____________________
  
  self = <numpy._core.tests.test_defchararray.TestMethodsScalarValues object at 0x7f27b5cde3a0>
  
      def test_replace(self):
  >       assert_equal(np.char.replace('Python is good', 'good', 'great'),
                       'Python is great')
  
  self       = <numpy._core.tests.test_defchararray.TestMethodsScalarValues object at 0x7f27b5cde3a0>
  
  numpy/_core/tests/test_defchararray.py:736: 
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
  
  a = 'Python is good', old = 'good', new = 'great', count = 9223372036854775807
  
      @array_function_dispatch(_replace_dispatcher)
      def replace(a, old, new, count=None):
          """
          For each element in `a`, return a copy of the string with all
          occurrences of substring `old` replaced by `new`.
      
          Calls :meth:`str.replace` element-wise.
      
          Parameters
          ----------
          a : array-like of str or unicode
      
          old, new : str or unicode
      
          count : int, optional
              If the optional argument `count` is given, only the first
              `count` occurrences are replaced.
      
          Returns
          -------
          out : ndarray
              Output array of str or unicode, depending on input type
      
          See Also
          --------
          str.replace
      
          Examples
          --------
          >>> a = np.array(["That is a mango", "Monkeys eat mangos"])
          >>> np.char.replace(a, 'mango', 'banana')
          array(['That is a banana', 'Monkeys eat bananas'], dtype='<U19')
      
          >>> a = np.array(["The dish is fresh", "This is it"])
          >>> np.char.replace(a, 'is', 'was')
          array(['The dwash was fresh', 'Thwas was it'], dtype='<U19')
      
          """
          max_int64 = numpy.iinfo(numpy.int64).max
          count = count if count is not None else max_int64
      
          counts = numpy._core.umath.count(a, old, 0, max_int64)
          buffersizes = (
              numpy._core.umath.str_len(a)
              + counts * (numpy._core.umath.str_len(new) -
                          numpy._core.umath.str_len(old))
          )
          max_buffersize = numpy.max(buffersizes)
  >       out = numpy.empty(a.shape, dtype=f"{a.dtype.char}{max_buffersize}")
  E       AttributeError: 'str' object has no attribute 'shape'

ngoldbaum · 2023-12-24T13:01:58Z

Thanks! Should be fixed by #25484

github-actions bot added the 01 - Enhancement label Nov 17, 2023

ENH: Add replace ufunc for bytes and unicode dtypes

19396d2

lysnikolaou force-pushed the string-ufuncs-replace-v2 branch from e774722 to 19396d2 Compare November 17, 2023 15:31

seberg reviewed Nov 20, 2023

View reviewed changes

Fix strcmp

4792b9f

lysnikolaou force-pushed the string-ufuncs-replace-v2 branch from 61330b5 to cc561bf Compare December 6, 2023 14:08

lysnikolaou added 2 commits December 6, 2023 15:22

Merge branch 'main' into string-ufuncs-replace-v2

5556f98

Rename to _replace and move implementation to np.char

bc5f633

lysnikolaou force-pushed the string-ufuncs-replace-v2 branch from cc561bf to bc5f633 Compare December 6, 2023 14:23

Fix tests

6c59e6b

ngoldbaum merged commit 1b861a2 into numpy:main Dec 24, 2023

ngoldbaum mentioned this pull request Dec 24, 2023

BUG: handle scalar input in np.char.replace #25484

Merged

This was referenced Dec 31, 2023

BUG: segfault in devdeps CI due to numpy-dev bug astropy/astropy#15797

Closed

BUG: A segfault in chararray (2.0.0dev0 regression) #25513

Closed

mhvk mentioned this pull request Dec 31, 2023

BUG: three string ufunc bugs, one leading to segfault #25515

Merged

neutrinoceros mentioned this pull request Feb 14, 2024

Deprecate use of chararray in io.fits astropy/astropy#3862

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Add replace ufunc for bytes and unicode dtypes #25171

ENH: Add replace ufunc for bytes and unicode dtypes #25171

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH: Add replace ufunc for bytes and unicode dtypes #25171

ENH: Add replace ufunc for bytes and unicode dtypes #25171

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!