8000 MAINT: stats.DiscreteDistribution: fix inversion methods by mdhaber · Pull Request #22970 · scipy/scipy · GitHub
[go: up one dir, main page]

Skip to content

MAINT: stats.DiscreteDistribution: fix inversion methods #22970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 15, 2025

Conversation

mdhaber
Copy link
Contributor
@mdhaber mdhaber commented May 11, 2025

Reference issue

gh-22312

What does this implement/fix?

gh-22312 noted a problem with method='inversion' for discrete distributions when the correct inverse is the left end of the support. This fixes the bug.

Additional information

It may still be slow because it waits for bracket finding to fail. An enhancement (which I might include before merge) would be to skip inversion when the correct result is the left end of the support.

@steppi were there any other tests we skipped due to this issue that we should unskip?

@github-actions github-actions bot added scipy.stats maintenance Items related to regular maintenance tasks labels May 11, 2025
@mdhaber mdhaber added this to the 1.16.0 milestone May 11, 2025
@mdhaber mdhaber marked this pull request as ready for review May 14, 2025 03:58
@mdhaber mdhaber requested a review from ev-br as a code owner May 14, 2025 03:58
@mdhaber mdhaber requested review from j-bowhay and removed request for ev-br May 14, 2025 03:59
Copy link
Member
@j-bowhay j-bowhay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time looking through this and I am happy that this matches what we discussed yesterday

@j-bowhay
Copy link
Member

Just a merge conflict to resolve

Copy link
Contributor
@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I rebase this branch locally and resolve merge conflicts I consistently see a single test failure locally, reproducible via spin test -j 12 -t scipy/stats/tests/test_continuous.py::TestMakeDistribution::test_rv_generic --durations=10.

___________________________________ TestMakeDistribution.test_rv_generic[128-distdata128] _______________________________________
[gw1] darwin -- Python 3.12.3 /Users/treddy/python_venvs/py_312_scipy_dev/bin/python3

self = <scipy.stats.tests.test_continuous.TestMakeDistribution object at 0x11b53ec60>, i = 128
distdata = ['nchypergeom_fisher', (140, 80, 60, 0.5)]

    @pytest.mark.parametrize('i, distdata', enumerate(distcont + distdiscrete))
    def test_rv_generic(self, i, distdata):
        distname = distdata[0]
    
        slow = {'argus', 'exponpow', 'exponweib', 'genexpon', 'gompertz', 'halfgennorm',
                'johnsonsb', 'kappa4', 'ksone', 'kstwo', 'kstwobign', 'norminvgauss',
                'powerlognorm', 'powernorm', 'recipinvgauss', 'studentized_range',
                'vonmises_line', # continuous
                'skellam'}  # discrete
        if not int(os.environ.get('SCIPY_XSLOW', '0')) and distname in slow:
            pytest.skip('Skipping as XSLOW')
    
        if distname in {              # skip these distributions
            'levy_stable',            # private methods seem to require >= 1d args
            'vonmises',               # circular distribution; shouldn't work
            'poisson_binom',          # vector shape parameter
            'hypergeom',              # distribution functions need interpolation
        }:
            return
    
        # skip single test, mostly due to slight disagreement
        custom_tolerances = {'ksone': 1e-5, 'kstwo': 1e-5}  # discontinuous PDF
        skip_entropy = {'kstwobign', 'pearson3'}  # tolerance issue
        skip_skewness = {'exponpow', 'ksone', 'nchypergeom_wallenius'}  # tolerance
        skip_kurtosis = {'chi', 'exponpow', 'invgamma',  # tolerance
                         'johnsonsb', 'ksone', 'kstwo',  # tolerance
                         'nchypergeom_wallenius'}  # tolerance
        skip_logccdf = {'arcsine', 'skewcauchy', 'trapezoid', 'triang'}  # tolerance
        skip_raw = {2: {'alpha', 'foldcauchy', 'halfcauchy', 'levy', 'levy_l'},
                    3: {'pareto'},  # stats.pareto is just wrong
                    4: {'invgamma'}}  # tolerance issue
        skip_standardized = {'exponpow', 'ksone', 'nchypergeom_wallenius'}  # tolerances
    
        dist = getattr(stats, distname)
        params = dict(zip(dist.shapes.split(', '), distdata[1])) if dist.shapes else {}
        rng = np.random.default_rng(7548723590230982)
        CustomDistribution = stats.make_distribution(dist)
        X = CustomDistribution(**params)
        Y = dist(**params)
        x = X.sample(shape=10, rng=rng)
        p = X.cdf(x)
        rtol = custom_tolerances.get(distname, 1e-7)
        atol = 1e-12
    
        with np.errstate(divide='ignore', invalid='ignore'):
            m, v, s, k = Y.stats('mvsk')
            assert_allclose(X.support(), Y.support())
            if distname not in skip_entropy:
                assert_allclose(X.entropy(), Y.entropy(), rtol=rtol)
            if isinstance(Y, stats.rv_discrete):
                # some continuous distributions have trouble with `logentropy` because
                # it uses complex numbers
                assert_allclose(np.exp(X.logentropy()), Y.entropy(), rtol=rtol)
            assert_allclose(X.median(), Y.median(), rtol=rtol)
            assert_allclose(X.mean(), m, rtol=rtol, atol=atol)
            assert_allclose(X.variance(), v, rtol=rtol, atol=atol)
            if distname not in skip_skewness:
                assert_allclose(X.skewness(), s, rtol=rtol, atol=atol)
            if distname not in skip_kurtosis:
                assert_allclose(X.kurtosis(convention='excess'), k,
                                rtol=rtol, atol=atol)
            if isinstance(dist, stats.rv_continuous):
                assert_allclose(X.logpdf(x), Y.logpdf(x), rtol=rtol)
                assert_allclose(X.pdf(x), Y.pdf(x), rtol=rtol)
            else:
                assert_allclose(X.logpmf(x), Y.logpmf(x), rtol=rtol)
                assert_allclose(X.pmf(x), Y.pmf(x), rtol=rtol)
            assert_allclose(X.logcdf(x), Y.logcdf(x), rtol=rtol)
            assert_allclose(X.cdf(x), Y.cdf(x), rtol=rtol)
            if distname not in skip_logccdf:
                assert_allclose(X.logccdf(x), Y.logsf(x), rtol=rtol)
            assert_allclose(X.ccdf(x), Y.sf(x), rtol=rtol)
    
            # old infrastructure convention for ppf(p=0) and isf(p=1) is different than
            # new infrastructure. Adjust reference values accordingly.
            a, _ = Y.support()
            ref_ppf = Y.ppf(p)
            ref_ppf[p == 0] = a
            ref_isf = Y.isf(p)
            ref_isf[p == 1] = a
    
>           assert_allclose(X.icdf(p), ref_ppf, rtol=rtol)
E           AssertionError: 
E           Not equal to tolerance rtol=1e-07, atol=0
E           
E           Mismatched elements: 6 / 10 (60%)
E           Max absolute difference among violations: 1.
E           Max relative difference among violations: 0.03448276
E            ACTUAL: array([30., 27., 30., 31., 28., 29., 29., 29., 32., 26.])
E            DESIRED: array([31., 27., 31., 31., 29., 30., 30., 30., 32., 26.])

CustomDistribution = <class 'scipy.stats._distribution_infrastructure._make_distribution_rv_generic.<locals>.CustomDistribution'>
X          = NoncentralHypergeometricFisher(M=np.float64(140.0), N=np.float64(60.0), n=np.float64(80.0), odds=np.float64(0.5))
Y          = <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x11b33f470>
_          = np.int64(60)
a          = np.int64(0)
atol       = 1e-12
custom_tolerances = {'ksone': 1e-05, 'kstwo': 1e-05}
dist       = <scipy.stats._discrete_distns.nchypergeom_fisher_gen object at 0x11a73d430>
distdata   = ['nchypergeom_fisher', (140, 80, 60, 0.5)]
distname   = 'nchypergeom_fisher'
i          = 128
k          = np.float64(-0.015031959595980027)
m          = np.float64(28.44642744936175)
p          = array([0.76340047, 0.37233802, 0.76340047, 0.85618448, 0.50915771,
       0.64451321, 0.64451321, 0.64451321, 0.92056245, 0.24967749])
params     = {'M': 140, 'N': 60, 'n': 80, 'odds': 0.5}
ref_isf    = array([26., 29., 26., 25., 28., 27., 27., 27., 24., 30.])
ref_ppf    = array([31., 27., 31., 31., 29., 30., 30., 30., 32., 26.])
rng        = Generator(PCG64) at 0x10C93A340
rtol       = 1e-07
s          = np.float64(0.025421969966314347)
self       = <scipy.stats.tests.test_continuous.TestMakeDistribution object at 0x11b53ec60>
skip_entropy = {'kstwobign', 'pearson3'}
skip_kurtosis = {'chi', 'exponpow', 'invgamma', 'johnsonsb', 'ksone', 'kstwo', ...}
skip_logccdf = {'arcsine', 'skewcauchy', 'trapezoid', 'triang'}
skip_raw   = {2: {'alpha', 'foldcauchy', 'halfcauchy', 'levy', 'levy_l'}, 3: {'pareto'}, 4: {'invgamma'}}
skip_skewness = {'exponpow', 'ksone', 'nchypergeom_wallenius'}
skip_standardized = {'exponpow', 'ksone', 'nchypergeom_wallenius'}
slow       = {'argus', 'exponpow', 'exponweib', 'genexpon', 'gompertz', 'halfgennorm', ...}
v          = np.float64(8.300922457539917)
x          = array([30., 27., 30., 31., 28., 29., 29., 29., 32., 26.])

scipy/stats/tests/test_continuous.py:1189: AssertionError

The 3 slowest test durations on that call are also all > 12 seconds on a fairly recent M3 Mac:

======================================================= slowest 10 durations =======================================================
25.08s call     build-install/usr/lib/python3.12/site-packages/scipy/stats/tests/test_continuous.py::TestMakeDistribution::test_rv_generic[120-distdata120]
13.45s call     build-install/usr/lib/python3.12/site-packages/scipy/stats/tests/test_continuous.py::TestMakeDistribution::test_rv_generic[138-distdata138]
12.89s call     build-install/usr/lib/python3.12/site-packages/scipy/stats/tests/test_continuous.py::TestMakeDistribution::test_rv_generic[130-distdata130]

On latest main, the same test incantation has no test slower than 0.20 seconds.

Matt and I are still in the same physical location--I'll see if we can chat after supper today.

@mdhaber
Copy link
Contributor Author
mdhaber commented May 15, 2025

Thanks @tylerjereddy! I skipped the slow tests and prevented the failing nchypergeom distributions from being accepted by make_distribution - they can't be evaluated at non-integral x, so I wouldn't expect inversion to work properly.

@tylerjereddy
Copy link
Contributor

Jake wasn't aware of a few other recent changes, but I'll consider his approving review still valid given the minor review commits that Matt applied after my review and the passing CI.

I poked around locally with my above incantation and it doesn't seem to cause problems with failure or performance of tests now.

Thanks Matt

@tylerjereddy tylerjereddy merged commit 73ec34c into scipy:main May 15, 2025
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0