8000 [SPARK-42647][PYTHON] Change alias for numpy deprecated and removed t… · snmvaughan/spark@b3c26b8 · GitHub
[go: up one dir, main page]

Skip to content

Commit b3c26b8

Browse files
aimtsousrowen
authored andcommitted
[SPARK-42647][PYTHON] Change alias for numpy deprecated and removed types
### Problem description Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face 8000 either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0 One of the types was fixed back in September with this [pull](apache#37817) request [numpy 1.24.0](numpy/numpy#22607): The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed. [numpy 1.20.0](numpy/numpy#14882): Using the aliases of builtin types like np.int is deprecated ### What changes were proposed in this pull request? From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error: ``` attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8 >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) ### Why are the changes needed? The changes are needed so pyspark can be compatible with the latest numpy and avoid - attribute errors on data types being deprecated from version 1.20.0: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations - warnings on deprecated data types from version 1.24.0: https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations ### Does this PR introduce _any_ user-facing change? The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0 ### How was this patch tested? I assume that the existing tests should catch this. (see all section Extra questions) I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test: ``` _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/local/lib/python3.9/dist-packages/<my-pkg>/unit/spark_test.py:64: in run_testcase self.handler.compare_df(result, expected, config=self.compare_config) /usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in compare_df actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True) /usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: in toPandas corrected_dtypes[index] = np.object # type: ignore[attr-defined] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations /usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError ``` Although i cannot provide the code doing in python the following should show the problem: ``` >>> import numpy as np >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) 123 >>> np.object(123) <stdin>:1: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, in __getattr__ raise AttributeError(__former_attrs__[attr]) AttributeError: module 'numpy' has no attribute 'object'. `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy ### Supported Versions: I propose this fix to be included in all pyspark 3.3 and onwards ### JIRA I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there. ### Extra questions: By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected. ``` git grep np.object python/pyspark/ml/functions.py: return data.dtype == np.object_ and isinstance(data.iloc[0], (np.ndarray, list)) python/pyspark/ml/functions.py: return any(data.dtypes == np.object_) and any( python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[4], np.object) # datetime.date python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[6], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[7], np.object) git grep np.bool python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool BooleanType python/pyspark/pandas/indexing.py: isinstance(key, np.bool_) for key in cols_sel python/pyspark/pandas/tests/test_typedef.py: np.bool: (np.bool, BooleanType()), python/pyspark/pandas/tests/test_typedef.py: bool: (np.bool, BooleanType()), python/pyspark/pandas/typedef/typehints.py: elif tpe in (bool, np.bool_, "bool", "?"): python/pyspark/sql/connect/expressions.py: assert isinstance(value, (bool, np.bool_)) python/pyspark/sql/connect/expressions.py: elif isinstance(value, np.bool_): python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[2], np.bool) python/pyspark/sql/tests/test_functions.py: (np.bool_, [("true", "boolean")]), ``` If yes concerning bool was merged already should we fix it too? Closes apache#40220 from aimtsou/numpy-patch. Authored-by: Aimilios Tsouvelekakis <aimtsou@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>
1 parent fd50043 commit b3c26b8

File tree

8 files changed

+20
-32
lines changed

8 files changed

+20
-32
lines changed

python/docs/source/user_guide/pandas_on_spark/types.rst

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -168,13 +168,9 @@ np.byte ByteType
168168
np.int16 ShortType
169169
np.int32 IntegerType
170170
np.int64 LongType
171-
np.int LongType
172171
np.float32 FloatType
173-
np.float DoubleType
174172
np.float64 DoubleType
175-
np.str StringType
176173
np.unicode\_ StringType
177-
np.bool BooleanType
178174
np.datetime64 TimestampType
179175
np.ndarray ArrayType(StringType())
180176
============= =======================

python/pyspark/pandas/groupby.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1923,7 +1923,7 @@ def apply(self, func: Callable, *args: Any, **kwargs: Any) -> Union[DataFrame, S
19231923
19241924
In case of Series, it works as below.
19251925
1926-
>>> def plus_max(x) -> ps.Series[np.int]:
1926+
>>> def plus_max(x) -> ps.Series[int]:
19271927
... return x + x.max()
19281928
>>> df.B.groupby(df.A).apply(plus_max).sort_index() # doctest: +SKIP
19291929
0 6
@@ -1941,7 +1941,7 @@ def apply(self, func: Callable, *args: Any, **kwargs: Any) -> Union[DataFrame, S
19411941
19421942
You can also return a scalar value as an aggregated value of the group:
19431943
1944-
>>> def plus_length(x) -> np.int:
1944+
>>> def plus_length(x) -> int:
19451945
... return len(x)
19461946
>>> df.B.groupby(df.A).apply(plus_length).sort_index() # doctest: +SKIP
19471947
0 1
@@ -1950,7 +1950,7 @@ def apply(self, func: Callable, *args: Any, **kwargs: Any) -> Union[DataFrame, S
19501950
19511951
The extra arguments to the function can be passed as below.
19521952
1953-
>>> def calculation(x, y, z) -> np.int:
1953+
>>> def calculation(x, y, z) -> int:
19541954
... return len(x) + y * z
19551955
>>> df.B.groupby(df.A).apply(calculation, 5, z=10).sort_index() # doctest: +SKIP
19561956
0 51
@@ -3077,7 +3077,7 @@ def transform(self, func: Callable[..., pd.Series], *args: Any, **kwargs: Any) -
30773077
1 a string 2 a string 6
30783078
2 a string 3 a string 5
30793079
3080-
>>> def plus_max(x) -> ps.Series[np.int]:
3080+
>>> def plus_max(x) -> ps.Series[int]:
30813081
... return x + x.max()
30823082
>>> g.transform(plus_max) # doctest: +NORMALIZE_WHITESPACE
30833083
B C
@@ -3111,7 +3111,7 @@ def transform(self, func: Callable[..., pd.Series], *args: Any, **kwargs: Any) -
31113111
31123112
You can also specify extra arguments to pass to the function.
31133113
3114-
>>> def calculation(x, y, z) -> ps.Series[np.int]:
3114+
>>> def calculation(x, y, z) -> ps.Series[int]:
31153115
... return x + x.min() + y + z
31163116
>>> g.transform(calculation, 5, z=20) # doctest: +NORMALIZE_WHITESPACE
31173117
B C

python/pyspark/pandas/tests/indexes/test_base.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2340,7 +2340,6 @@ def test_astype(self):
23402340
psidx = ps.Index(pidx)
23412341

23422342
self.assert_eq(psidx.astype(int), pidx.astype(int))
2343-
self.assert_eq(psidx.astype(np.int), pidx.astype(np.int))
23442343
self.assert_eq(psidx.astype(np.int8), pidx.astype(np.int8))
23452344
self.assert_eq(psidx.astype(np.int16), pidx.astype(np.int16))
23462345
self.assert_eq(psidx.astype(np.int32), pidx.astype(np.int32))
@@ -2356,7 +2355,6 @@ def test_astype(self):
23562355
self.assert_eq(psidx.astype("i"), pidx.astype("i"))
23572356
self.assert_eq(psidx.astype("long"), pidx.astype("long"))
23582357
self.assert_eq(psidx.astype("short"), pidx.astype("short"))
2359-
self.assert_eq(psidx.astype(np.float), pidx.astype(np.float))
23602358
self.assert_eq(psidx.astype(np.float32), pidx.astype(np.float32))
23612359
self.assert_eq(psidx.astype(np.float64), pidx.astype(np.float64))
23622360
self.assert_eq(psidx.astype("float"), pidx.astype("float"))

python/pyspark/pandas/tests/test_series.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1576,7 +1576,6 @@ def _test_numeric_astype(self, pser):
15761576
psser = ps.Series(pser)
15771577

15781578
self.assert_eq(psser.astype(int), pser.astype(int))
1579-
self.assert_eq(psser.astype(np.int), pser.astype(np.int))
15801579
self.assert_eq(psser.astype(np.int8), pser.astype(np.int8))
15811580
self.assert_eq(psser.astype(np.int16), pser.astype(np.int16))
15821581
self.assert_eq(psser.astype(np.int32), pser.astype(np.int32))
@@ -1592,7 +1591,6 @@ def _test_numeric_astype(self, pser):
15921591
self.assert_eq(psser.astype("i"), pser.astype("i"))
15931592
self.assert_eq(psser.astype("long"), pser.astype("long"))
15941593
self.assert_eq(psser.astype("short"), pser.astype("short"))
1595-
self.assert_eq(psser.astype(np.float), pser.astype(np.float))
15961594
self.assert_eq(psser.astype(np.float32), pser.astype(np.float32))
15971595
self.assert_eq(psser.astype(np.float64), pser.astype(np.float64))
15981596
self.assert_eq(psser.astype("float"), pser.astype("float"))

python/pyspark/pandas/tests/test_typedef.py

Lines changed: 1 addition & 5 deletions
< 10000 td data-grid-cell-id="diff-f885dcac00e9a10c710a24e247003066083c6cc1c52b307165a97eaef42b70d6-334-331-1" data-selected="false" role="gridcell" style="background-color:var(--bgColor-default);text-align:center" tabindex="-1" valign="top" class="focusable-grid-cell diff-line-number position-relative diff-line-number-neutral left-side">331
Original file line numberDiff line numberDiff line change
@@ -321,20 +321,16 @@ def test_as_spark_type_pandas_on_spark_dtype(self):
321321
np.int16: (np.int16, ShortType()),
322322
np.int32: (np.int32, IntegerType()),
323323
np.int64: (np.int64, LongType()),
324-
np.int: (np.int64, LongType()),
325324
int: (np.int64, LongType()),
326325
# floating
327326
np.float32: (np.float32, FloatType()),
328-
np.float: (np.float64, DoubleType()),
329327
np.float64: (np.float64, DoubleType()),
330328
float: (np.float64, DoubleType()),
331329
# string
332-
np.str: (np.unicode_, StringType()),
333330
np.unicode_: (np.unicode_, StringType()),
334
str: (np.unicode_, StringType()),
335332
# bool
336-
np.bool: (np.bool, BooleanType()),
337-
bool: (np.bool, BooleanType()),
333+
bool: (np.bool_, BooleanType()),
338334
# datetime
339335
np.datetime64: (np.datetime64, TimestampType()),
340336
datetime.datetime: (np.dtype("datetime64[ns]"), TimestampType()),

python/pyspark/pandas/typedef/typehints.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -391,15 +391,15 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
391391
>>> inferred.spark_type
392392
LongType()
393393
394-
>>> def func() -> ps.DataFrame[np.float, str]:
394+
>>> def func() -> ps.DataFrame[float, str]:
395395
... pass
396396
>>> inferred = infer_return_type(func)
397397
>>> inferred.dtypes
398398
[dtype('float64'), dtype('<U')]
399399
>>> inferred.spark_type
400400
StructType([StructField('c0', DoubleType(), True), StructField('c1', StringType(), True)])
401401
402-
>>> def func() -> ps.DataFrame[np.float]:
402+
>>> def func() -> ps.DataFrame[float]:
403403
... pass
404404
>>> inferred = infer_return_type(func)
405405
>>> inferred.dtypes
@@ -423,31 +423,31 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
423423
>>> inferred.spark_type
424424
LongType()
425425
426-
>>> def func() -> 'ps.DataFrame[np.float, str]':
426+
>>> def func() -> 'ps.DataFrame[float, str]':
427427
... pass
428428
>>> inferred = infer_return_type(func)
429429
>>> inferred.dtypes
430430
[dtype('float64'), dtype('<U')]
431431
>>> inferred.spark_type
432432
StructType([StructField('c0', DoubleType(), True), StructField('c1', StringType(), True)])
433433
434-
>>> def func() -> 'ps.DataFrame[np.float]':
434+
>>> def func() -> 'ps.DataFrame[float]':
435435
... pass
436436
>>> inferred = infer_return_type(func)
437437
>>> inferred.dtypes
438438
[dtype('float64')]
439439
>>> inferred.spark_type
440440
StructType([StructField('c0', DoubleType(), True)])
441441
442-
>>> def func() -> ps.DataFrame['a': np.float, 'b': int]:
442+
>>> def func() -> ps.DataFrame['a': float, 'b': int]:
443443
... pass
444444
>>> inferred = infer_return_type(func)
445445
>>> inferred.dtypes
446446
[dtype('float64'), dtype('int64')]
447447
>>> inferred.spark_type
448448
StructType([StructField('a', DoubleType(), True), StructField('b', LongType(), True)])
449449
450-
>>> def func() -> "ps.DataFrame['a': np.float, 'b': int]":
450+
>>> def func() -> "ps.DataFrame['a': float, 'b': int]":
451451
... pass
452452
>>> inferred = infer_return_type(func)
453453
>>> inferred.dtypes

python/pyspark/sql/pandas/conversion.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,7 @@ def toPandas(self) -> "PandasDataFrameLike":
182182
field.dataType
183183
)
184184
corrected_panda_types[tmp_column_names[index]] = (
185-
np.object0 if pandas_type is None else pandas_type
185+
object if pandas_type is None else pandas_type
186186
)
187187

188188
pdf = pd.DataFrame(columns=tmp_column_names).astype(
@@ -232,7 +232,7 @@ def toPandas(self) -> "PandasDataFrameLike":
232232
if isinstance(field.dataType, IntegralType) and pandas_col.isnull().any():
233233
corrected_dtypes[index] = np.float64
234234
if isinstance(field.dataType, BooleanType) and pandas_col.isnull().any():
235-
corrected_dtypes[index] = np.object # type: ignore[attr-defined]
235+
corrected_dtypes[index] = object
236236

237237
df = pd.DataFrame()
238238
for index, t in enumerate(corrected_dtypes):

python/pyspark/sql/tests/test_dataframe.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1119,10 +1119,10 @@ def test_to_pandas(self):
11191119
pdf = self._to_pandas()
11201120
types = pdf.dtypes
11211121
self.assertEqual(types[0], np.int32)
1122-
self.assertEqual(types[1], np.object)
1123-
self.assertEqual(types[2], np.bool)
1122+
self.assertEqual(types[1], object)
1123+
self.assertEqual(types[2], bool)
11241124
self.assertEqual(types[3], np.float32)
1125-
self.assertEqual(types[4], np.object) # datetime.date
1125+
self.assertEqual(types[4], object) # datetime.date
11261126
self.assertEqual(types[5], "datetime64[ns]")
11271127
self.assertEqual(types[6], "datetime64[ns]")
11281128
self.assertEqual(types[7], "timedelta64[ns]")
@@ -1181,7 +1181,7 @@ def test_to_pandas_avoid_astype(self):
11811181
df = self.spark.createDataFrame(data, schema)
11821182
types = df.toPandas().dtypes
11831183
self.assertEqual(types[0], np.float64) # doesn't convert to np.int32 due to NaN value.
1184-
self.assertEqual(types[1], np.object)
1184+
self.assertEqual(types[1], object)
11851185
self.assertEqual(types[2], np.float64)
11861186

11871187
@unittest.skipIf(not have_pandas, pandas_requirement_message) # type: ignore
@@ -1242,8 +1242,8 @@ def test_to_pandas_from_null_dataframe(self):
12421242
self.assertEqual(types[3], np.float64)
12431243
self.assertEqual(types[4], np.float32)
12441244
self.assertEqual(types[5], np.float64)
1245-
self.assertEqual(types[6], np.object)
1246-
self.assertEqual(types[7], np.object)
1245+
self.assertEqual(types[6], object)
1246+
self.assertEqual(types[7], object)
12471247
self.assertTrue(np.can_cast(np.datetime64, types[8]))
12481248
self.assertTrue(np.can_cast(np.datetime64, types[9]))
12491249
self.assertTrue(np.can_cast(np.timedelta64, types[10]))

0 commit comments

Comments
 (0)
0