[SPARK-42647][PYTHON] Change alias for numpy deprecated and removed types #40220

aimtsou · 2023-02-28T16:12:05Z

Problem description

Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0
One of the types was fixed back in September with this pull request

numpy 1.24.0: The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed.
numpy 1.20.0: Using the aliases of builtin types like np.int is deprecated

What changes were proposed in this pull request?

From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error:

attr = 'object'

    def __getattr__(attr):
        # Warn for expired attributes, and return a dummy function
        # that always raises an exception.
        import warnings
        try:
            msg = __expired_functions__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
    
            def _expired(*args, **kwds):
                raise RuntimeError(msg)
    
            return _expired
    
        # Emit warnings for deprecated attributes
        try:
            val, msg = __deprecated_attrs__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
            return val
    
        if attr in __future_scalars__:
            # And future warnings for those that will change, but also give
            # the AttributeError
            warnings.warn(
                f"In the future `np.{attr}` will be defined as the "
                "corresponding NumPy scalar.", FutureWarning, stacklevel=2)
    
        if attr in __former_attrs__:
>           raise AttributeError(__former_attrs__[attr])
E           AttributeError: module 'numpy' has no attribute 'object'.
E           `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. 
E           The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
E               https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8

np.object0(123)
:1: DeprecationWarning: np.object0 is a deprecated alias for ``np.object0is a deprecated alias fornp.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24)

Why are the changes needed?

The changes are needed so pyspark can be compatible with the latest numpy and avoid

attribute errors on data types being deprecated from version 1.20.0: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
warnings on deprecated data types from version 1.24.0: https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations

Does this PR introduce any user-facing change?

The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0

How was this patch tested?

I assume that the existing tests should catch this. (see all section Extra questions)

I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test:


_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.9/dist-packages/<my
8000
-pkg>/unit/spark_test.py:64: in run_testcase
    self.handler.compare_df(result, expected, config=self.compare_config)
/usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in compare_df
    actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True)
/usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: in toPandas
    corrected_dtypes[index] = np.object  # type: ignore[attr-defined]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

attr = 'object'

    def __getattr__(attr):
        # Warn for expired attributes, and return a dummy function
        # that always raises an exception.
        import warnings
        try:
            msg = __expired_functions__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
    
            def _expired(*args, **kwds):
                raise RuntimeError(msg)
    
            return _expired
    
        # Emit warnings for deprecated attributes
        try:
            val, msg = __deprecated_attrs__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
            return val
    
        if attr in __future_scalars__:
            # And future warnings for those that will change, but also give
            # the AttributeError
            warnings.warn(
                f"In the future `np.{attr}` will be defined as the "
                "corresponding NumPy scalar.", FutureWarning, stacklevel=2)
    
        if attr in __former_attrs__:
>           raise AttributeError(__former_attrs__[attr])
E           AttributeError: module 'numpy' has no attribute 'object'.
E           `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. 
E           The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
E               https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

/usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError

Although i cannot provide the code doing in python the following should show the problem:

>>> import numpy as np
>>> np.object0(123)
<stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead.  (Deprecated NumPy 1.24)`.  (Deprecated NumPy 1.24)
123
>>> np.object(123)
<stdin>:1: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. 
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy

Supported Versions:

I propose this fix to be included in all pyspark 3.3 and onwards

JIRA

I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there.

Extra questions:

By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected.

git grep np.object
python/pyspark/ml/functions.py:        return data.dtype == np.object_ and isinstance(data.iloc[0], (np.ndarray, list))
python/pyspark/ml/functions.py:        return any(data.dtypes == np.object_) and any(
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[1], np.object)
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[4], np.object)  # datetime.date
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[1], np.object)
python/pyspark/sql/tests/test_dataframe.py:                self.assertEqual(types[6], np.object)
python/pyspark/sql/tests/test_dataframe.py:                self.assertEqual(types[7], np.object)

git grep np.bool
python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool       BooleanType
python/pyspark/pandas/indexing.py:            isinstance(key, np.bool_) for key in cols_sel
python/pyspark/pandas/tests/test_typedef.py:            np.bool: (np.bool, BooleanType()),
python/pyspark/pandas/tests/test_typedef.py:            bool: (np.bool, BooleanType()),
python/pyspark/pandas/typedef/typehints.py:    elif tpe in (bool, np.bool_, "bool", "?"):
python/pyspark/sql/connect/expressions.py:                assert isinstance(value, (bool, np.bool_))
python/pyspark/sql/connect/expressions.py:                elif isinstance(value, np.bool_):
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[2], np.bool)
python/pyspark/sql/tests/test_functions.py:            (np.bool_, [("true", "boolean")]),

If yes concerning bool was merged already should we fix it too?

numpy 1.24.0: The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed. numpy 1.20.0: Using the aliases of builtin types like np.int is deprecated

HyukjinKwon · 2023-03-01T00:44:48Z

Sounds good. Mind filing a JIRA please?

aimtsou · 2023-03-01T09:52:39Z

Good morning @HyukjinKwon,

I requested for a JIRA account yesterday, still waiting for confirmation.
I have a few questions:

Shall we fix the data types also in tests?
For the moment, I have looked only for np.bool, np.object ans np.object0 but not the rest. Shall i try to find if there is something else?
I have a workflow error, Workflow run detection failed, i enabled github actions on my forked repo, how can i re run the tests?

HyukjinKwon · 2023-03-01T10:38:16Z

Shall we fix the data types also in tests?

Yeah, if the test failures look related, let's fix them.

For the moment, I have looked only for np.bool, np.object ans np.object0 but not the rest. Shall i try to find if there is something else?

If there are not a lot, it would be great to fix them together. If the occurrences are a lot, feel free to create a separate PR.

I have a workflow error, Workflow run detection failed, i enabled github actions on my forked repo, how can i re run the tests?

If you rebase or merge the upstream, the test will be triggered, and the GitHub check status will be updated.

aimtsou · 2023-03-01T12:23:20Z

@HyukjinKwon: I grepped for all the deprecated types and I list my findings below, please let me know if you see something that should not be changed.

For the deprecations introduced by numpy 1.24.0 and greping master branch as cloned yesterday:

spark % git grep np.object0
python/pyspark/sql/pandas/conversion.py:                                np.object0 if pandas_type is None else pandas_type
spark % git grep np.str0
spark % git grep np.bytes0
spark % git grep np.void0
spark % git grep np.int0
spark % git grep np.uint0
spark % git grep np.bool8

As we see we have only one np.object0 so we are pretty safe with these numpy changes.

For the deprecations introduced by numpy 1.20.0 that resulted in removals in 1.24.0 and greping master branch as cloned yesterday:

spark % git grep np.float | grep -v np.float_ | grep -v np.float64 | grep -v np.float32 | grep -v np.float8 | grep -v np.float16
mllib/src/test/scala/org/apache/spark/ml/feature/RobustScalerSuite.scala:      X = np.array([[0, 0], [1, -1], [2, -2], [3, -3], [4, -4]], dtype=np.float)
python/docs/source/user_guide/pandas_on_spark/types.rst:np.float      DoubleType
python/pyspark/pandas/tests/indexes/test_base.py:        self.assert_eq(psidx.astype(np.float), pidx.astype(np.float))
python/pyspark/pandas/tests/test_series.py:        self.assert_eq(psser.astype(np.float), pser.astype(np.float))
python/pyspark/pandas/typedef/typehints.py:    >>> def func() -> ps.DataFrame[np.float, str]:
python/pyspark/pandas/typedef/typehints.py:    >>> def func() -> ps.DataFrame[np.float]:
python/pyspark/pandas/typedef/typehints.py:    >>> def func() -> 'ps.DataFrame[np.float, str]':
python/pyspark/pandas/typedef/typehints.py:    >>> def func() -> 'ps.DataFrame[np.float]':
python/pyspark/pandas/typedef/typehints.py:    >>> def func() -> ps.DataFrame['a': np.float, 'b': int]:
python/pyspark/pandas/typedef/typehints.py:    >>> def func() -> "ps.DataFrame['a': np.float, 'b': int]":
spark % git grep np.str | grep -v np.str_ | grep -v np.string_
python/docs/source/user_guide/pandas_on_spark/types.rst:np.string\_   BinaryType
python/docs/source/user_guide/pandas_on_spark/types.rst:np.str        StringType
python/pyspark/pandas/tests/test_typedef.py:            np.str: (np.unicode_, StringType()),
spark % git grep np.object | grep -v np.object_ 
python/pyspark/sql/pandas/conversion.py:                                np.object0 if pandas_type is None else pandas_type
python/pyspark/sql/pandas/conversion.py:                corrected_dtypes[index] = np.object  # type: ignore[attr-defined]
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[1], np.object)
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[4], np.object)  # datetime.date
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[1], np.object)
python/pyspark/sql/tests/test_dataframe.py:                self.assertEqual(types[6], np.object)
python/pyspark/sql/tests/test_dataframe.py:                self.assertEqual(types[7], np.object)
spark % git grep np.complex | grep -v np.complex_ 
spark % git grep np.long  
spark % git grep np.unicode | grep -v np.unicode_ 
python/docs/source/user_guide/pa
8000
ndas_on_spark/types.rst:np.unicode\_  StringType
spark % git grep np.bool | grep -v np.bool_ 
python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool       BooleanType
python/pyspark/pandas/tests/test_typedef.py:            np.bool: (np.bool, BooleanType()),
python/pyspark/pandas/tests/test_typedef.py:            bool: (np.bool, BooleanType()),
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[2], np.bool)
spark % git grep np.int | grep -v np.int_ | grep -v np.int64 | grep -v np.int32 | grep -v np.int8 | grep -v np.int16
connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala:      // sadly we can't pinpoint specific data and invalidate cause we don't have unique id
core/src/main/resources/org/apache/spark/ui/static/vis-timeline-graph2d.min.js:
python/docs/source/user_guide/pandas_on_spark/types.rst:np.int        LongType
python/pyspark/mllib/regression.py:        return np.interp(x, self.boundaries, self.predictions)  # type: ignore[arg-type]
python/pyspark/pandas/groupby.py:        >>> def plus_max(x) -> ps.Series[np.int]:
python/pyspark/pandas/groupby.py:        >>> def plus_length(x) -> np.int:
python/pyspark/pandas/groupby.py:        >>> def calculation(x, y, z) -> np.int:
python/pyspark/pandas/groupby.py:        >>> def plus_max(x) -> ps.Series[np.int]:
python/pyspark/pandas/groupby.py:        >>> def calculation(x, y, z) -> ps.Series[np.int]:
python/pyspark/pandas/tests/indexes/test_base.py:        self.assert_eq(psidx.astype(np.int), pidx.astype(np.int))
python/pyspark/pandas/tests/test_series.py:        self.assert_eq(psser.astype(np.int), pser.astype(np.int))

As you can see the 2 most difficult types are the int and the float where I find even scala files and 1 js file. i will go through thoroughly the lines and let you know.

aimtsou · 2023-03-01T14:31:09Z

@HyukjinKwon,

I fixed all the changes I could find.
The tests are running but the base image is not working for me, consequently some of them are incomplete.

Run docker/login-action@v2
Logging into ghcr.io...
Error: Error response from daemon: Get "https://ghcr.io/v2/": received unexpected HTTP status: 503 Service Unavailable

I suggest we do the review, then I squash and remove the wip tag. What do you say?

srowen · 2023-03-01T14:35:08Z

We need to file a JIRA too.
BTW we merged a similar change for np.bool back to Spark 3.3.x; maybe we should do the same here.

srowen · 2023-03-01T14:35:39Z

Actually I don't know, this change might theoretically be breaking? I wasn't clear

aimtsou · 2023-03-01T15:30:26Z

@srowen: I will create the JIRA still waiting on an answer from the mailing list.
Why do you think the change is a breaking one?

srowen · 2023-03-01T15:45:40Z

I dont' know, maybe it doesn't. For example if I have something like def func() -> ps.DataFrame[np.float, str] in my code, does it still work?

aimtsou · 2023-03-01T17:49:21Z

@srowen: While that code does not exist in pyspark it is only used as an example in the infer_return_type() function I can tell you that for your example using numpy>=1.24.0 will result in an attribute error.

srowen · 2023-03-01T17:50:33Z

OK. After this change, would this still work with numpy 1.20.0, for example? I think that's the question.

aimtsou · 2023-03-01T18:00:25Z

OK. After this change, would this still work with numpy 1.20.0, for example? I think that's the question.

Actually to make it even more clear,

The only changes that are functional are related with the conversion.py file. The rest of the changes are inside tests in the user_guide or in some docstrings describing specific functions. Since I am not an expert in these tests I wait for the reviewer and some people with more experience in the pyspark code.

But for your question, these types are aliases for classic python types so yes they should work with all the numpy versions 1, 2. The error or warning comes from the call to the numpy.

I attached 2 links which explain the use-case.

HyukjinKwon · 2023-03-02T00:21:46Z

Maybe let's create a JIRA ..

I will create the JIRA still waiting on an answer from the mailing list.

BTW, what's the title of your email?

aimtsou · 2023-03-02T07:07:28Z

Maybe let's create a JIRA ..

I will create the JIRA still waiting on an answer from the mailing list.

BTW, what's the title of your email?

Email never arrived although i sent it to private[at]spark.apache.org like it says in the contribution guide.
I self registered with the other process described in the guide.

Consequently, I just created the JIRA.

HyukjinKwon · 2023-03-02T07:23:07Z

Seems like linter fails (https://github.com/aimtsou/spark/actions/runs/4304579333/jobs/7506798202).

aimtsou · 2023-03-02T07:28:43Z

Yes but this is the original code. Shall I remove the comment and how did it pass the linter to be in master?

srowen · 2023-03-02T13:18:28Z

The line changed, and now the 'ignore' is no longer relevant - yes remove it to pass the linter

aimtsou · 2023-03-02T16:19:14Z

Tests are completed, shall I squash and remove wip tag from the pull request?

srowen · 2023-03-02T17:17:23Z

Yes remove WIP just for completeness. No need to squash, the script does that

…ypes ### Problem description Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0 One of the types was fixed back in September with this [pull](#37817) request [numpy 1.24.0](numpy/numpy#22607): The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed. [numpy 1.20.0](numpy/numpy#14882): Using the aliases of builtin types like np.int is deprecated ### What changes were proposed in this pull request? From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error: ``` attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8 >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) ### Why are the changes needed? The changes are needed so pyspark can be compatible with the latest numpy and avoid - attribute errors on data types being deprecated from version 1.20.0: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations - warnings on deprecated data types from version 1.24.0: https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations ### Does this PR introduce _any_ user-facing change? The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0 ### How was this patch tested? I assume that the existing tests should catch this. (see all section Extra questions) I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test: ``` _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/local/lib/python3.9/dist-packages/<my-pkg>/unit/spark_test.py:64: in run_testcase self.handler.compare_df(result, expected, config=self.compare_config) /usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in compare_df actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True) /usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: in toPandas corrected_dtypes[index] = np.object # type: ignore[attr-defined] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations /usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError ``` Although i cannot provide the code doing in python the following should show the problem: ``` >>> import numpy as np >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) 123 >>> np.object(123) <stdin>:1: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, in __getattr__ raise AttributeError(__former_attrs__[attr]) AttributeError: module 'numpy' has no attribute 'object'. `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy ### Supported Versions: I propose this fix to be included in all pyspark 3.3 and onwards ### JIRA I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there. ### Extra questions: By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected. ``` git grep np.object python/pyspark/ml/functions.py: return data.dtype == np.object_ and isinstance(data.iloc[0], (np.ndarray, list)) python/pyspark/ml/functions.py: return any(data.dtypes == np.object_) and any( python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[4], np.object) # datetime.date python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[6], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[7], np.object) git grep np.bool python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool BooleanType python/pyspark/pandas/indexing.py: isinstance(key, np.bool_) for key in cols_sel python/pyspark/pandas/tests/test_typedef.py: np.bool: (np.bool, BooleanType()), python/pyspark/pandas/tests/test_typedef.py: bool: (np.bool, BooleanType()), python/pyspark/pandas/typedef/typehints.py: elif tpe in (bool, np.bool_, "bool", "?"): python/pyspark/sql/connect/expressions.py: assert isinstance(value, (bool, np.bool_)) python/pyspark/sql/connect/expressions.py: elif isinstance(value, np.bool_): python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[2], np.bool) python/pyspark/sql/tests/test_functions.py: (np.bool_, [("true", "boolean")]), ``` If yes concerning bool was merged already should we fix it too? Closes #40220 from aimtsou/numpy-patch. Authored-by: Aimilios Tsouvelekakis <aimtsou@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit b3c26b8) Signed-off-by: Sean Owen <srowen@gmail.com>

srowen · 2023-03-03T00:51:31Z

Merged to master/3.4/3.3, for consistency with https://issues.apache.org/jira/browse/SPARK-40376

…ypes ### Problem description Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0 One of the types was fixed back in September with this [pull](apache#37817) request [numpy 1.24.0](numpy/numpy#22607): The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed. [numpy 1.20.0](numpy/numpy#14882): Using the aliases of builtin types like np.int is deprecated ### What changes were proposed in this pull request? From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error: ``` attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8 >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) ### Why are the changes needed? The changes are needed so pyspark can be compatible with the latest numpy and avoid - attribute errors on data types being deprecated from version 1.20.0: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations - warnings on deprecated data types from version 1.24.0: https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations ### Does this PR introduce _any_ user-facing change? The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0 ### How was this patch tested? I assume that the existing tests should catch this. (see all section Extra questions) I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test: ``` _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/local/lib/python3.9/dist-packages/<my-pkg>/unit/spark_test.py:64: in run_testcase self.handler.compare_df(result, expected, config=self.compare_config) /usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in compare_df actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True) /usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: in toPandas corrected_dtypes[index] = np.object # type: ignore[attr-defined] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations /usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError ``` Although i cannot provide the code doing in python the following should show the problem: ``` >>> import numpy as np >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) 123 >>> np.object(123) <stdin>:1: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, in __getattr__ raise AttributeError(__former_attrs__[attr]) AttributeError: module 'numpy' has no attribute 'object'. `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy ### Supported Versions: I propose this fix to be included in all pyspark 3.3 and onwards ### JIRA I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there. ### Extra questions: By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected. ``` git grep np.object python/pyspark/ml/functions.py: return data.dtype == np.object_ and isinstance(data.iloc[0], (np.ndarray, list)) python/pyspark/ml/functions.py: return any(data.dtypes == np.object_) and any( python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[4], np.object) # datetime.date python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[6], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[7], np.object) git grep np.bool python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool BooleanType python/pyspark/pandas/indexing.py: isinstance(key, np.bool_) for key in cols_sel python/pyspark/pandas/tests/test_typedef.py: np.bool: (np.bool, BooleanType()), python/pyspark/pandas/tests/test_typedef.py: bool: (np.bool, BooleanType()), python/pyspark/pandas/typedef/typehints.py: elif tpe in (bool, np.bool_, "bool", "?"): python/pyspark/sql/connect/expressions.py: assert isinstance(value, (bool, np.bool_)) python/pyspark/sql/connect/expressions.py: elif isinstance(value, np.bool_): python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[2], np.bool) python/pyspark/sql/tests/test_functions.py: (np.bool_, [("true", "boolean")]), ``` If yes concerning bool was merged already should we fix it too? Closes apache#40220 from aimtsou/numpy-patch. Authored-by: Aimilios Tsouvelekakis <aimtsou@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit b3c26b8) Signed-off-by: Sean Owen <srowen@gmail.com>

github-actions bot added CORE PYTHON SQL labels Feb 28, 2023

HyukjinKwon requested a review from xinrong-meng March 1, 2023 00:47

aimtsou changed the title ~~[PYTHON] Change alias for numpy deprecated and removed types~~ [WIP][PYTHON] Change alias for numpy deprecated and removed types Mar 1, 2023

Removing deprecated types #1

9a498dc

github-actions bot added the PANDAS API ON SPARK label Mar 1, 2023

aimtsou added 7 commits March 1, 2023 14:56

Removing deprecared types #2

df8be60

Removing deprecated types #3

27f3df3

Removing deprecated types #4

b7439c9

Removing deprecated types #5

f825b79

Removing deprecated types #6

7823131

Removing deprecated types #7

3113f54

Removing deprecated types #8

fcfc56c

aimtsou changed the title ~~[WIP][PYTHON] Change alias for numpy deprecated and removed types~~ [WIP][SPARK-42647][PYTHON] Change alias for numpy deprecated and removed types Mar 2, 2023

Removing not needed comment

1827f21

aimtsou changed the title ~~[WIP][SPARK-42647][PYTHON] Change alias for numpy deprecated and removed types~~ [SPARK-42647][PYTHON] Change alias for numpy deprecated and removed types Mar 2, 2023

srowen closed this in b3c26b8 Mar 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-42647][PYTHON] Change alias for numpy deprecated and removed types #40220

[SPARK-42647][PYTHON] Change alias for numpy deprecated and removed types #40220

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[SPARK-42647][PYTHON] Change alias for numpy deprecated and removed types #40220

[SPARK-42647][PYTHON] Change alias for numpy deprecated and removed types #40220

Uh oh!

Conversation

Uh oh!

Problem description

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Supported Versions:

JIRA

Extra questions:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!