-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-42647][PYTHON] Change alias for numpy deprecated and removed types #40220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
numpy 1.24.0: The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed. numpy 1.20.0: Using the aliases of builtin types like np.int is deprecated
Sounds good. Mind filing a JIRA please? |
Good morning @HyukjinKwon, I requested for a JIRA account yesterday, still waiting for confirmation.
|
Yeah, if the test failures look related, let's fix them.
If there are not a lot, it would be great to fix them together. If the occurrences are a lot, feel free to create a separate PR.
If you rebase or merge the upstream, the test will be triggered, and the GitHub check status will be updated. |
@HyukjinKwon: I grepped for all the deprecated types and I list my findings below, please let me know if you see something that should not be changed. For the deprecations introduced by numpy 1.24.0 and greping master branch as cloned yesterday:
As we see we have only one np.object0 so we are pretty safe with these numpy changes. For the deprecations introduced by numpy 1.20.0 that resulted in removals in 1.24.0 and greping master branch as cloned yesterday:
As you can see the 2 most difficult types are the int and the float where I find even scala files and 1 js file. i will go through thoroughly the lines and let you know. |
I fixed all the changes I could find.
I suggest we do the review, then I squash and remove the wip tag. What do you say? |
We need to file a JIRA too. |
Actually I don't know, this change might theoretically be breaking? I wasn't clear |
@srowen: I will create the JIRA still waiting on an answer from the mailing list. |
I dont' know, maybe it doesn't. For example if I have something like |
@srowen: While that code does not exist in pyspark it is only used as an example in the infer_return_type() function I can tell you that for your example using numpy>=1.24.0 will result in an attribute error. |
OK. After this change, would this still work with numpy 1.20.0, for example? I think that's the question. |
Actually to make it even more clear, The only changes that are functional are related with the conversion.py file. The rest of the changes are inside tests in the user_guide or in some docstrings describing specific functions. Since I am not an expert in these tests I wait for the reviewer and some people with more experience in the pyspark code. But for your question, these types are aliases for classic python types so yes they should work with all the numpy versions 1, 2. The error or warning comes from the call to the numpy. I attached 2 links which explain the use-case. |
Maybe let's create a JIRA ..
BTW, what's the title of your email? |
Email never arrived although i sent it to private[at]spark.apache.org like it says in the contribution guide. Consequently, I just created the JIRA. |
Seems like linter fails (https://github.com/aimtsou/spark/actions/runs/4304579333/jobs/7506798202). |
Yes but this is the original code. Shall I remove the comment and how did it pass the linter to be in master? |
The line changed, and now the 'ignore' is no longer relevant - yes remove it to pass the linter |
Tests are completed, shall I squash and remove wip tag from the pull request? |
Yes remove WIP just for completeness. No need to squash, the script does that |
…ypes ### Problem description Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0 One of the types was fixed back in September with this [pull](#37817) request [numpy 1.24.0](numpy/numpy#22607): The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed. [numpy 1.20.0](numpy/numpy#14882): Using the aliases of builtin types like np.int is deprecated ### What changes were proposed in this pull request? From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error: ``` attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8 >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) ### Why are the changes needed? The changes are needed so pyspark can be compatible with the latest numpy and avoid - attribute errors on data types being deprecated from version 1.20.0: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations - warnings on deprecated data types from version 1.24.0: https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations ### Does this PR introduce _any_ user-facing change? The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0 ### How was this patch tested? I assume that the existing tests should catch this. (see all section Extra questions) I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test: ``` _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/local/lib/python3.9/dist-packages/<my-pkg>/unit/spark_test.py:64: in run_testcase self.handler.compare_df(result, expected, config=self.compare_config) /usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in compare_df actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True) /usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: in toPandas corrected_dtypes[index] = np.object # type: ignore[attr-defined] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations /usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError ``` Although i cannot provide the code doing in python the following should show the problem: ``` >>> import numpy as np >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) 123 >>> np.object(123) <stdin>:1: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, in __getattr__ raise AttributeError(__former_attrs__[attr]) AttributeError: module 'numpy' has no attribute 'object'. `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy ### Supported Versions: I propose this fix to be included in all pyspark 3.3 and onwards ### JIRA I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there. ### Extra questions: By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected. ``` git grep np.object python/pyspark/ml/functions.py: return data.dtype == np.object_ and isinstance(data.iloc[0], (np.ndarray, list)) python/pyspark/ml/functions.py: return any(data.dtypes == np.object_) and any( python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[4], np.object) # datetime.date python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[6], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[7], np.object) git grep np.bool python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool BooleanType python/pyspark/pandas/indexing.py: isinstance(key, np.bool_) for key in cols_sel python/pyspark/pandas/tests/test_typedef.py: np.bool: (np.bool, BooleanType()), python/pyspark/pandas/tests/test_typedef.py: bool: (np.bool, BooleanType()), python/pyspark/pandas/typedef/typehints.py: elif tpe in (bool, np.bool_, "bool", "?"): python/pyspark/sql/connect/expressions.py: assert isinstance(value, (bool, np.bool_)) python/pyspark/sql/connect/expressions.py: elif isinstance(value, np.bool_): python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[2], np.bool) python/pyspark/sql/tests/test_functions.py: (np.bool_, [("true", "boolean")]), ``` If yes concerning bool was merged already should we fix it too? Closes #40220 from aimtsou/numpy-patch. Authored-by: Aimilios Tsouvelekakis <aimtsou@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit b3c26b8) Signed-off-by: Sean Owen <srowen@gmail.com>
…ypes ### Problem description Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0 One of the types was fixed back in September with this [pull](#37817) request [numpy 1.24.0](numpy/numpy#22607): The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed. [numpy 1.20.0](numpy/numpy#14882): Using the aliases of builtin types like np.int is deprecated ### What changes were proposed in this pull request? From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error: ``` attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8 >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) ### Why are the changes needed? The changes are needed so pyspark can be compatible with the latest numpy and avoid - attribute errors on data types being deprecated from version 1.20.0: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations - warnings on deprecated data types from version 1.24.0: https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations ### Does this PR introduce _any_ user-facing change? The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0 ### How was this patch tested? I assume that the existing tests should catch this. (see all section Extra questions) I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test: ``` _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/local/lib/python3.9/dist-packages/<my-pkg>/unit/spark_test.py:64: in run_testcase self.handler.compare_df(result, expected, config=self.compare_config) /usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in compare_df actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True) /usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: in toPandas corrected_dtypes[index] = np.object # type: ignore[attr-defined] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations /usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError ``` Although i cannot provide the code doing in python the following should show the problem: ``` >>> import numpy as np >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) 123 >>> np.object(123) <stdin>:1: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, in __getattr__ raise AttributeError(__former_attrs__[attr]) AttributeError: module 'numpy' has no attribute 'object'. `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy ### Supported Versions: I propose this fix to be included in all pyspark 3.3 and onwards ### JIRA I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there. ### Extra questions: By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected. ``` git grep np.object python/pyspark/ml/functions.py: return data.dtype == np.object_ and isinstance(data.iloc[0], (np.ndarray, list)) python/pyspark/ml/functions.py: return any(data.dtypes == np.object_) and any( python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[4], np.object) # datetime.date python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[6], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[7], np.object) git grep np.bool python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool BooleanType python/pyspark/pandas/indexing.py: isinstance(key, np.bool_) for key in cols_sel python/pyspark/pandas/tests/test_typedef.py: np.bool: (np.bool, BooleanType()), python/pyspark/pandas/tests/test_typedef.py: bool: (np.bool, BooleanType()), python/pyspark/pandas/typedef/typehints.py: elif tpe in (bool, np.bool_, "bool", "?"): python/pyspark/sql/connect/expressions.py: assert isinstance(value, (bool, np.bool_)) python/pyspark/sql/connect/expressions.py: elif isinstance(value, np.bool_): python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[2], np.bool) python/pyspark/sql/tests/test_functions.py: (np.bool_, [("true", "boolean")]), ``` If yes concerning bool was merged already should we fix it too? Closes #40220 from aimtsou/numpy-patch. Authored-by: Aimilios Tsouvelekakis <aimtsou@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit b3c26b8) Signed-off-by: Sean Owen <srowen@gmail.com>
Merged to master/3.4/3.3, for consistency with https://issues.apache.org/jira/browse/SPARK-40376 |
…ypes ### Problem description Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0 One of the types was fixed back in September with this [pull](apache#37817) request [numpy 1.24.0](numpy/numpy#22607): The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed. [numpy 1.20.0](numpy/numpy#14882): Using the aliases of builtin types like np.int is deprecated ### What changes were proposed in this pull request? From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error: ``` attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8 >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) ### Why are the changes needed? The changes are needed so pyspark can be compatible with the latest numpy and avoid - attribute errors on data types being deprecated from version 1.20.0: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations - warnings on deprecated data types from version 1.24.0: https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations ### Does this PR introduce _any_ user-facing change? The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0 ### How was this patch tested? I assume that the existing tests should catch this. (see all section Extra questions) I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test: ``` _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/local/lib/python3.9/dist-packages/<my-pkg>/unit/spark_test.py:64: in run_testcase self.handler.compare_df(result, expected, config=self.compare_config) /usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in compare_df actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True) /usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: in toPandas corrected_dtypes[index] = np.object # type: ignore[attr-defined] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations /usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError ``` Although i cannot provide the code doing in python the following should show the problem: ``` >>> import numpy as np >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) 123 >>> np.object(123) <stdin>:1: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, in __getattr__ raise AttributeError(__former_attrs__[attr]) AttributeError: module 'numpy' has no attribute 'object'. `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy ### Supported Versions: I propose this fix to be included in all pyspark 3.3 and onwards ### JIRA I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there. ### Extra questions: By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected. ``` git grep np.object python/pyspark/ml/functions.py: return data.dtype == np.object_ and isinstance(data.iloc[0], (np.ndarray, list)) python/pyspark/ml/functions.py: return any(data.dtypes == np.object_) and any( python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[4], np.object) # datetime.date python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[6], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[7], np.object) git grep np.bool python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool BooleanType python/pyspark/pandas/indexing.py: isinstance(key, np.bool_) for key in cols_sel python/pyspark/pandas/tests/test_typedef.py: np.bool: (np.bool, BooleanType()), python/pyspark/pandas/tests/test_typedef.py: bool: (np.bool, BooleanType()), python/pyspark/pandas/typedef/typehints.py: elif tpe in (bool, np.bool_, "bool", "?"): python/pyspark/sql/connect/expressions.py: assert isinstance(value, (bool, np.bool_)) python/pyspark/sql/connect/expressions.py: elif isinstance(value, np.bool_): python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[2], np.bool) python/pyspark/sql/tests/test_functions.py: (np.bool_, [("true", "boolean")]), ``` If yes concerning bool was merged already should we fix it too? Closes apache#40220 from aimtsou/numpy-patch. Authored-by: Aimilios Tsouvelekakis <aimtsou@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit b3c26b8) Signed-off-by: Sean Owen <srowen@gmail.com>
Problem description
Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0
One of the types was fixed back in September with this pull request
numpy 1.24.0: The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed.
numpy 1.20.0: Using the aliases of builtin types like np.int is deprecated
What changes were proposed in this pull request?
From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error:
From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8
Why are the changes needed?
The changes are needed so pyspark can be compatible with the latest numpy and avoid
Does this PR introduce any user-facing change?
The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0
How was this patch tested?
I assume that the existing tests should catch this. (see all section Extra questions)
I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test:
Although i cannot provide the code doing in python the following should show the problem:
I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy
Supported Versions:
I propose this fix to be included in all pyspark 3.3 and onwards
JIRA
I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there.
Extra questions:
By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected.
If yes concerning bool was merged already should we fix it too?