8000 ENH: Add the geometric standard deviation function by Kai-Striega · Pull Request #9810 · scipy/scipy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Add the geometric standard deviation function #9810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 2, 2019

Conversation

Kai-Striega
Copy link
Member
@Kai-Striega Kai-Striega commented Feb 13, 2019

Redo of #9395.

This PR implements a geometric standard deviation function and closes #9154. The geometric standard deviation describes the spread of a set of numbers where the geometric mean is preferred.

The original PR included several revisions handling masked arrays, mstats and invalid values. These revisions obfuscate the commit history and are starting to repeat the original commits.

Several issues occurred checking for and handling invalid values (see #9558 and numpy/numpy#4959) . In particular the use of nan_policy. The current implementation does not use nan_policy. It attempts a naive calculation of the geometric standard deviation. If this fails it checks for invalid arguments, raising a descriptive error message.

Another discussion centered around the use of the mstats module. The original PR included support for both regular and masked arrays. Later changing to include an implementation in both stats and mstats. This later changed back to only include a stats version.

@chrisb83 @WarrenWeckesser would you be willing to take a look over this?

@Kai-Striega Kai-Striega added scipy.stats enhancement A new feature or improvement labels Feb 13, 2019
@Kai-Striega Kai-Striega changed the title ENH: Add the geometric standard deviation ENH: Add the geometric standard deviation function Feb 13, 2019
Copy link
Member
@chrisb83 chrisb83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in my view.

Two comments:

  • function can handle normal arrays and masked arrays; no separate mstats version. I think this is a clean solution.
  • the try … catch construction is normally not used in other stats functions that rather test a certain condition such as if any(a < 0) raise .... my understanding is that try catch is slower (https://docs.python.org/2/faq/design.html#how-fast-are-exceptions), but not sure if that really matters here. any views?

@Kai-Striega
Copy link
Member Author
Kai-Striega commented Feb 25, 2019

The idea behind the try..except statement is to avoid checking the input arrays twice. This way (should) work faster when a valid array is passed. But would be significantly slower when it wouldn't work, to me, this seemed like a good trade-off.

I've implemented them as checks to avoid the try..except statement. Here are some timings (units of seconds as floats for 30 runs) for a 1000x1000 array of np.float64.

Repeating 5 runs of 30 trials

Array is valid: 
Time with checks:  [1.203507800000807, 1.1911648000022979, 1.1953944000015326, 1.1901376999994682, 1.191319300000032]
Time with exceptions:  [0.6507973999978276, 0.63852659999975, 0.6370251000007556, 0.6375709000021743, 0.6372553000001062]

Array contains zero:
Time with checks:  [0.0644265999981144, 0.06380280000303173, 0.06387930000346387, 0.06453880000117351, 0.0649091999985103]
Time with exceptions:  [0.5850529999988794, 0.5865052000008291, 0.5919960999999603, 0.5863754000019981, 0.5864228000027651]

Array contains negative:
Time with checks:  [0.034902499999589054, 0.03421309999976074, 0.03458369999862043, 0.034749700000247685, 0.03605949999837321]
Time with exceptions:  [0.528536300000269, 0.5280047000014747, 0.5277626999995846, 0.5291075000022829, 
8000
0.5284775000000081]

Array contains inf:
Time with checks:  [0.018257800002174918, 0.01787029999832157, 0.017890900002385024, 0.01802010000028531, 0.01797169999917969]
Time with exceptions:  [0.5066953000023204, 0.5100607999993372, 0.508547900000849, 0.5079172999976436, 0.5103061000008893]

For the valid case, using try..except is ~2x as fast as using checks. HOWEVER for invalid cases try..except is much slower than trying to catch errors. As it stands I'm not sure which one would be preferred. It depends on whether we expect the user to provide "correct" input, and to be able to provide a faster function.

What do you think?

@chrisb83
Copy link
Member

Thanks for the additional info. I think the trade-off higher speed for correct input vs slower speed for incorrect input is fine. Let's see what others say

Copy link
Member
@chrisb83 chrisb83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good in my view

@Kai-Striega
Copy link
Member Author

@tylerjereddy any thoughts on using try..except vs checks? Otherwise, I'll look to merge it soon.

Copy link
Contributor
@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like domain expert review is positive now. The try/except tradeoffs seem sensible to me too. We sometimes like to add asv benchmarks for new functions where there are performance concerns.

I checked the line coverage of the addition locally--there's a partial hit and one miss, but I'm not saying you have to hit 100% on this PR; here's the screen cap of the html cov with pertinent lines & cmd used:

python runtests.py --mode=full -t "scipy/stats/tests/test_stats.py" -- --cov-report html --cov=scipy/stats

image

Docstring render in CircleCI looks nice -- we sometimes use versionadded for new functions I think

@tylerjereddy tylerjereddy added this to the 1.3.0 milestone Feb 27, 2019
@rgommers
Copy link
Member
rgommers commented Mar 2, 2019

@Kai-Striega since this now has 3 merges of master into your branch, it would be useful to rebase to get rid of those. If it doesn't rebase cleanly, perhaps squash and rebase?

@Kai-Striega
Copy link
Member Author

@tylerjereddy I included the timing to demonstrate the difference between checking and try..except. The performance (shouldn't) really be a concern and I don't think adding asv benchmarks are required.

I've added a versionadded of 1.3.0.

@rgommers will do.

This PR implements a geometric standard deviation function. The
geometric standard deviation describes the spread of a set of
numbers where the geometric mean is preferred.
Minor simplification to use np.asanyarray in favour of checking if
passed value is not an array, then manually converting to an array.
@Kai-Striega
Copy link
Member Author

@rgommers does the version history look better now? If so could you merge it once the tests pass?

@rgommers
Copy link
Member
rgommers commented Mar 2, 2019

Thanks, version history looks good now.

Had a quick look at the try-except, that seems fine here. Normally we do explicit checks, but when the checks become expensive compared to the one-liner that's the actual functionality, then I'm fine with try-except to speed that up.

Everyone is happy it seems, so in it goes. Thanks @Kai-Striega & reviewers!

@rgommers rgommers merged commit 880bcd4 into scipy:master Mar 2, 2019
@rgommers
Copy link
Member
rgommers commented Mar 2, 2019

@Kai-Striega Kai-Striega deleted the enh_gstd branch March 2, 2019 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Geometric standard deviation
4 participants
0