10000 Impl Series.skew() by Rubtsowa · Pull Request #813 · IntelPython/sdc · GitHub
[go: up one dir, main page]

Skip to content
This repository was archived by the owner on Feb 2, 2024. It is now read-only.

Impl Series.skew() #813

Merged
merged 16 commits into from
Apr 27, 2020
Merged

Conversation

Rubtsowa
Copy link
Contributor
@Rubtsowa Rubtsowa commented Apr 21, 2020
name nthreads type size median min max compile boxing
Series.skew 1 Python 100000000 5.041041 4.553035 6.009044 NaN NaN
    SDC 100000000 1.082 1.066 1.091 0.205984 0.00001
Series.skew 4 Python 100000000 6.681195 4.275043 14.061077 NaN NaN
    SDC 100000000 1.07 0.99 1.135 0.329002 0.00001

@pep8speaks
Copy link
pep8speaks commented Apr 21, 2020

Hello @Rubtsowa! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-24 16:12:26 UTC

else:
_skipna = skipna

infinite_mask = numpy.isfinite(self._data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. that's kills performance and scalability


infinite_mask = numpy.isfinite(self._data)
len_val = len(infinite_mask)
data = self._data[infinite_mask]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's too, actually

@AlexanderKalistratov
Copy link
Collaborator

@Rubtsowa looks good. Could you please remeasure performance?

@Rubtsowa
Copy link
Contributor Author
name nthreads type size median min max compile boxing
Series.skew 1 Python 100000000 9.940206 5.385197 19.73827 NaN NaN
    SDC 100000000 0.989 0.711 1.241 0.167002 0
Series.skew 4 SDC 100000000 0.731 0.719 0.847 0.105322 0.000417

@Rubtsowa
Copy link
Contributor Author
name nthreads type size median min max compile boxing
Series.skew 1 Python 100000000 6.973869 4.221513 9.153432 NaN NaN
    SDC 100000000 0.597 0.595 0.604 0.12701 0.00008
Series.skew 4 SDC 100000000 0.199 0.197 0.208 0.134013 0.000026

return numpy.nan

n = nfinite
m2 = (square_sum - _sum * _sum / n) / n
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we could move this formula in somewhere and then use in Series/Rolling/GroupBy methods

@densmirn ?

@Rubtsowa
Copy link
Contributor Author
name nthreads type size median min max compile boxing
Series.skew(skipna=False) 1 Python 100000000 3.396047 3.068078 4.135047 NaN NaN
    SDC 100000000 0.165 0.165 0.171 0.200001 0.000002
  4 SDC 100000000 0.08 0.078 0.09 0.211993 0.000009

def skew_impl(arr):
len_val = len(arr)
n = 0
_sum = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's initialize _sum, square_sum and cube_sum with float values 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will lead to errors

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of errors?

@@ -135,6 +135,8 @@ def _test_case(self, pyfunc, name, total_data_length, input_data=None, data_num=
TC(name='shape', size=[10 ** 7], call_expr='data.shape', usecase_params='data'),
TC(name='shift', size=[10 ** 8]),
TC(name='size', size=[10 ** 7], call_expr='data.size', usecase_params='data'),
TC(name='skew', size=[10 ** 8], params='skipna=True'),
TC(name='skew', size=[10 ** 8], params='skipna=False', input_data=[test_global_input_data_float64[0]]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think using test_global_input_data_float64[0] is actually a good idea. It contains max_float values which makes impossible to compare results and may cause some other issues.

@@ -0,0 +1,41 @@
# -*- coding: utf-8 -*-
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's name this file 'statistics'

@@ -0,0 +1,41 @@
# -*- coding: utf-8 -*-
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

statistics not statics

@AlexanderKalistratov AlexanderKalistratov merged commit 0e155d4 into IntelPython:master Apr 27, 2020
@Rubtsowa Rubtsowa deleted the series_skew branch April 27, 2020 09:45
akharche added a commit that referenced this pull request May 15, 2020
* Df.at impl (#738)

* Series.add / Series.lt with fill_value (#655)

* Impl Series.skew() (#813)

* Run tests in separate processes (#833)

* Run tests in separate processes

* Take tests list from sdc/tests/__init__.py

* change README (#818)

* change README

* change README for doc

* add refs

* change ref

* change ref

* change ref

* change readme

* Improve boxing (#832)

* Specify sdc version from channel for examples testing (#837)

* Specify sdc version from channel for examples testing

It occurs that conda resolver can take Intel SDC package
not from first channel where it is found.
Specify particular SDC version to avoid this in examples
for now.
Also print info for environment creation and package installing

* Fix incerrectly used f-string

* Fix log_info call

* Numba 0.49.0 all (#824)

* Fix run tests

Remove import of _getitem_array1d

* expectedFailure

* expectedFailure-2

* expectedFailure-3

* Conda recipe numba==0.49

* expectedFailure-4

* Refactor imports from Numba

* Unskip tests

* Fix using of numpy_support.from_dtype()

* Unskip tests

* Fix DataFrame tests with rewrite IR without Del statements

* Unskip tests

* Fix corr_overload with type inference error for none < 1

* Fix hpat_pandas_series_cov with type inference error for none < 2

* Unskip tests

* Unskip tests

* Fixed iternext_series_array with using _getitem_array1d.

_getitem_array1d is replaced with _getitem_array_single_int in Numba 0.49.

* Unskip tests

* Unskip old test

* Fix Series.at

* Unskip tests

* Add decrefs in boxing (#836)

* Adding extension type for pd.RangeIndex (#820)

* Adding extension type for pd.RangeIndex

This commit adds Numba extension types for pandas.RangeIndex class,
allowing creation of pd.RangeIndex objects and passing and returning them
to/from nopython functions.

* Applying review comments

* Fix for PR 831 (#839)

* Update pyarrow version to 0.17.0

Update recipe, code and docs.

* Disable intel channel

* Disable intel channel for testing

* Fix remarks

Co-authored-by: Vyacheslav Smirnov <vyacheslav.s.smirnov@intel.com>

* Update to Numba 0.49.1 (#838)

* Update to Numba 0.49.1

* Fix requirements.txt

* Add travis

Co-authored-by: Elena Totmenina <totmeninal@mail.ru>
Co-authored-by: Rubtsowa <36762665+Rubtsowa@users.noreply.github.com>
Co-authored-by: Sergey Pokhodenko <sergey.pokhodenko@intel.com>
Co-authored-by: Vyacheslav-Smirnov <51660067+Vyacheslav-Smirnov@users.noreply.github.com>
Co-authored-by: Alexey Kozlov <52973316+kozlov-alexey@users.noreply.github.com>
Co-authored-by: Vyacheslav Smirnov <vyacheslav.s.smirnov@intel.com>
densmirn added a commit that referenced this pull request May 30, 2020
* Turn on Azure CI for branch (#822)

* Redesign DataFrame structure (#817)

* Merge master (#840)

* Df.at impl (#738)

* Series.add / Series.lt with fill_value (#655)

* Impl Series.skew() (#813)

* Run tests in separate processes (#833)

* Run tests in separate processes

* Take tests list from sdc/tests/__init__.py

* change README (#818)

* change README

* change README for doc

* add refs

* change ref

* change ref

* change ref

* change readme

* Improve boxing (#832)

* Specify sdc version from channel for examples testing (#837)

* Specify sdc version from channel for examples testing

It occurs that conda resolver can take Intel SDC package
not from first channel where it is found.
Specify particular SDC version to avoid this in examples
for now.
Also print info for environment creation and package installing

* Fix incerrectly used f-string

* Fix log_info call

* Numba 0.49.0 all (#824)

* Fix run tests

Remove import of _getitem_array1d

* expectedFailure

* expectedFailure-2

* expectedFailure-3

* Conda recipe numba==0.49

* expectedFailure-4

* Refactor imports from Numba

* Unskip tests

* Fix using of numpy_support.from_dtype()

* Unskip tests

* Fix DataFrame tests with rewrite IR without Del statements

* Unskip tests

* Fix corr_overload with type inference error for none < 1

* Fix hpat_pandas_series_cov with type inference error for none < 2

* Unskip tests

* Unskip tests

* Fixed iternext_series_array with using _getitem_array1d.

_getitem_array1d is replaced with _getitem_array_single_int in Numba 0.49.

* Unskip tests

* Unskip old test

* Fix Series.at

* Unskip tests

* Add decrefs in boxing (#836)

* Adding extension type for pd.RangeIndex (#820)

* Adding extension type for pd.RangeIndex

This commit adds Numba extension types for pandas.RangeIndex class,
allowing creation of pd.RangeIndex objects and passing and returning them
to/from nopython functions.

* Applying review comments

* Fix for PR 831 (#839)

* Update pyarrow version to 0.17.0

Update recipe, code and docs.

* Disable intel channel

* Disable intel channel for testing

* Fix remarks

Co-authored-by: Vyacheslav Smirnov <vyacheslav.s.smirnov@intel.com>

* Update to Numba 0.49.1 (#838)

* Update to Numba 0.49.1

* Fix requirements.txt

* Add travis

Co-authored-by: Elena Totmenina <totmeninal@mail.ru>
Co-authored-by: Rubtsowa <36762665+Rubtsowa@users.noreply.github.com>
Co-authored-by: Sergey Pokhodenko <sergey.pokhodenko@intel.com>
Co-authored-by: Vyacheslav-Smirnov <51660067+Vyacheslav-Smirnov@users.noreply.github.com>
Co-authored-by: Alexey Kozlov <52973316+kozlov-alexey@users.noreply.github.com>
Co-authored-by: Vyacheslav Smirnov <vyacheslav.s.smirnov@intel.com>

* Re-implement df.getitem based on new structure (#845)

* Re-implement df.getitem based on new structure

* Re-implemented remaining getitem overloads, add tests

* Re-implement df.values based on new structure (#846)

* Re-implement df.pct_change based on new structure (#847)

* Re-implement df.drop based on new structure (#848)

* Re-implement df.append based on new structure (#857)

* Re-implement df.reset_index based on new structure (#849)

* Re-implement df._set_column based on new strcture (#850)

* Re-implement df.rolling methods based on new structure (#852)

* Re-implement df.index based on new structure (#853)

* Re-implement df.copy based on new structure (#854)

* Re-implement df.isna based on new structure (#856)

* Re-implement df.at/iat/loc/iloc based on new structure (#858)

* Re-implement df.head based on new structure (#855)

* Re-implement df.head based on new structure

* Simplify codegen docstring

* Re-implement df.groupby methods based on new structure (#859)

* Re-implement dataframe boxing based on new structure (#861)

* Re-implement DataFrame unboxing (#860)

* Boxing draft

Merge branch 'master' of https://github.com/IntelPython/sdc into merge_master

# Conflicts:
#	sdc/hiframes/pd_dataframe_ext.py
#	sdc/tests/test_dataframe.py

* Implement unboxing in new structure

* Improve variable names + add error handling

* Return error status

* Move getting list size to if_ok block

* Unskipped unexpected success tests

* Unskipped unexpected success tests in GroupBy

* Remove decorators

* Change to incref False

* Skip tests failed due to unimplemented df structure

* Bug in rolling

* Fix rolling (#865)

* Undecorate tests on reading CSV (#866)

* Re-implement df structure: enable rolling tests that pass (#867)

* Re-implement df structure: refactor len (#868)

* Re-implement df structure: refactor len

* Undecorated all the remaining methods

Co-authored-by: Denis <denis.smirnov@intel.com>

* Merge master to feature/dataframe_model_refactoring (#869)

* Enable CI on master

Co-authored-by: Angelina Kharchevnikova <angelina.kharchevnikova@intel.com>
Co-authored-by: Elena Totmenina <totmeninal@mail.ru>
Co-authored-by: Rubtsowa <36762665+Rubtsowa@users.noreply.github.com>
Co-authored-by: Sergey Pokhodenko <sergey.pokhodenko@intel.com>
Co-authored-by: Vyacheslav-Smirnov <51660067+Vyacheslav-Smirnov@users.noreply.github.com>
Co-authored-by: Alexey Kozlov <52973316+kozlov-alexey@users.noreply.github.com>
Co-authored-by: Vyacheslav Smirnov <vyacheslav.s.smirnov@intel.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0