MAINT: stats.DiscreteDistribution: fix inversion methods #22970

mdhaber · 2025-05-11T02:06:11Z

Reference issue

What does this implement/fix?

gh-22312 noted a problem with method='inversion' for discrete distributions when the correct inverse is the left end of the support. This fixes the bug.

Additional information

It may still be slow because it waits for bracket finding to fail. An enhancement (which I might include before merge) would be to skip inversion when the correct result is the left end of the support.

@steppi were there any other tests we skipped due to this issue that we should unskip?

j-bowhay

I spent some time looking through this and I am happy that this matches what we discussed yesterday

j-bowhay · 2025-05-14T18:31:33Z

Just a merge conflict to resolve

tylerjereddy

When I rebase this branch locally and resolve merge conflicts I consistently see a single test failure locally, reproducible via spin test -j 12 -t scipy/stats/tests/test_continuous.py::TestMakeDistribution::test_rv_generic --durations=10.

___________________________________ TestMakeDistribution.test_rv_generic[128-distdata128] _______________________________________
[gw1] darwin -- Python 3.12.3 /Users/treddy/python_venvs/py_312_scipy_dev/bin/python3

self = <scipy.stats.tests.test_continuous.TestMakeDistribution object at 0x11b53ec60>, i = 128
distdata = ['nchypergeom_fisher', (140, 80, 60, 0.5)]

    @pytest.mark.parametrize('i, distdata', enumerate(distcont + distdiscrete))
    def test_rv_generic(self, i, distdata):
        distname = distdata[0]
    
        slow = {'argus', 'exponpow', 'exponweib', 'genexpon', 'gompertz', 'halfgennorm',
                'johnsonsb', 'kappa4', 'ksone', 'kstwo', 'kstwobign', 'norminvgauss',
                'powerlognorm', 'powernorm', 'recipinvgauss', 'studentized_range',
                'vonmises_line', # continuous
                'skellam'}  # discrete
        if not int(os.environ.get('SCIPY_XSLOW', '0')) and distname in slow:
            pytest.skip('Skipping as XSLOW')
    
        if distname in {              # skip these distributions
            'levy_stable',            # private methods seem to require >= 1d args
            'vonmises',               # circular distribution; shouldn't work
            'poisson_binom',          # vector shape parameter
            'hypergeom',              # distribution functions need interpolation
        }:
            return
    
        # skip single test, mostly due to slight disagreement
        custom_tolerances = {'ksone': 1e-5, 'kstwo': 1e-5}  # discontinuous PDF
        skip_entropy = {'kstwobign', 'pearson3'}  # tolerance issue
        skip_skewness = {'exponpow', 'ksone', 'nchypergeom_wallenius'}  # tolerance
        skip_kurtosis = {'chi', 'exponpow', 'invgamma',  # tolerance
                         'johnsonsb', 'ksone', 'kstwo',  # tolerance
                         'nchypergeom_wallenius'}  # tolerance
        skip_logccdf = {'arcsine', 'skewcauchy', 'trapezoid', 'triang'}  # tolerance
        skip_raw = {2: {'alpha', 'foldcauchy', 'halfcauchy', 'levy', 'levy_l'},
                    3: {'pareto'},  # stats.pareto is just wrong
                    4: {'invgamma'}}  # tolerance issue
        skip_standardized = {'exponpow', 'ksone', 'nchypergeom_wallenius'}  # tolerances
    
        dist = getattr(stats, distname)
        params = dict(zip(dist.shapes.split(', '), distdata[1])) if dist.shapes else {}
        rng = np.random.default_rng(7548723590230982)
        CustomDistribution = stats.make_distribution(dist)
        X = CustomDistribution(**params)
        Y = dist(**params)
        x = X.sample(shape=10, rng=rng)
        p = X.cdf(x)
        rtol = custom_tolerances.get(distname, 1e-7)
        atol = 1e-12
    
        with np.errstate(divide='ignore', invalid='ignore'):
            m, v, s, k = Y.stats('mvsk')
            assert_allclose(X.support(), Y.support())
            if distname not in skip_entropy:
                assert_allclose(X.entropy(), Y.entropy(), rtol=rtol)
            if isinstance(Y, stats.rv_discrete):
                # some continuous distributions have trouble with `logentropy` because
                # it uses complex numbers
                assert_allclose(np.exp(X.logentropy()), Y.entropy(), rtol=rtol)
            assert_allclose(X.median(), Y.median(), rtol=rtol)
            assert_allclose(X.mean(), m, rtol=rtol, atol=atol)
            assert_allclose(X.variance(), v, rtol=rtol, atol=atol)
            if distname not in skip_skewness:
                assert_allclose(X.skewness(), s, rtol=rtol, atol=atol)
            if distname not in skip_kurtosis:
                assert_allclose(X.kurtosis(convention='excess'), k,
                                rtol=rtol, atol=atol)
            if isinstance(dist, stats.rv_continuous):
                assert_allclose(X.logpdf(x), Y.logpdf(x), rtol=rtol)
                assert_allclose(X.pdf(x), Y.pdf(x), rtol=rtol)
            else:
                assert_allclose(X.logpmf(x), Y.logpmf(x), rtol=rtol)
                assert_allclose(X.pmf(x), Y.pmf(x), rtol=rtol)
            assert_allclose(X.logcdf(x), Y.logcdf(x), rtol=rtol)
            assert_allclose(X.cdf(x), Y.cdf(x), rtol=rtol)
            if distname not in skip_logccdf:
                assert_allclose(X.logccdf(x), Y.logsf(x), rtol=rtol)
            assert_allclose(X.ccdf(x), Y.sf(x), rtol=rtol)
    
            # old infrastructure convention for ppf(p=0) and isf(p=1) is different than
            # new infrastructure. Adjust reference values accordingly.
            a, _ = Y.support()
            ref_ppf = Y.ppf(p)
            ref_ppf[p == 0] = a
            ref_isf = Y.isf(p)
            ref_isf[p == 1] = a
    
>           assert_allclose(X.icdf(p), ref_ppf, rtol=rtol)
E           AssertionError: 
E           Not equal to tolerance rtol=1e-07, atol=0
E           
E           Mismatched elements: 6 / 10 (60%)
E           Max absolute difference among violations: 1.
E           Max relative difference among violations: 0.03448276
E            ACTUAL: array([30., 27., 30., 31., 28., 29., 29., 29., 32., 26.])
E            DESIRED: array([31., 27., 31., 31., 29., 30., 30., 30., 32., 26.])

CustomDistribution = <class 'scipy.stats._distribution_infrastructure._make_distribution_rv_generic.<locals>.CustomDistribution'>
X          = NoncentralHypergeometricFisher(M=np.float64(140.0), N=np.float64(60.0), n=np.float64(80.0), odds=np.float64(0.5))
Y          = <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x11b33f470>
_          = np.int64(60)
a          = np.int64(0)
atol       = 1e-12
custom_tolerances = {'ksone': 1e-05, 'kstwo': 1e-05}
dist       = <scipy.stats._discrete_distns.nchypergeom_fisher_gen object at 0x11a73d430>
distdata   = ['nchypergeom_fisher', (140, 80, 60, 0.5)]
distname   = 'nchypergeom_fisher'
i          = 128
k          = np.float64(-0.015031959595980027)
m          = np.float64(28.44642744936175)
p          = array([0.76340047, 0.37233802, 0.76340047, 0.85618448, 0.50915771,
       0.64451321, 0.64451321, 0.64451321, 0.92056245, 0.24967749])
params     = {'M': 140, 'N': 60, 'n': 80, 'odds': 0.5}
ref_isf    = array([26., 29., 26., 25., 28., 27., 27., 27., 24., 30.])
ref_ppf    = array([31., 27., 31., 31., 29., 30., 30., 30., 32., 26.])
rng        = Generator(PCG64) at 0x10C93A340
rtol       = 1e-07
s          = np.float64(0.025421969966314347)
self       = <scipy.stats.tests.test_continuous.TestMakeDistribution object at 0x11b53ec60>
skip_entropy = {'kstwobign', 'pearson3'}
skip_kurtosis = {'chi', 'exponpow', 'invgamma', 'johnsonsb', 'ksone', 'kstwo', ...}
skip_logccdf = {'arcsine', 'skewcauchy', 'trapezoid', 'triang'}
skip_raw   = {2: {'alpha', 'foldcauchy', 'halfcauchy', 'levy', 'levy_l'}, 3: {'pareto'}, 4: {'invgamma'}}
skip_skewness = {'exponpow', 'ksone', 'nchypergeom_wallenius'}
skip_standardized = {'exponpow', 'ksone', 'nchypergeom_wallenius'}
slow       = {'argus', 'exponpow', 'exponweib', 'genexpon', 'gompertz', 'halfgennorm', ...}
v          = np.float64(8.300922457539917)
x          = array([30., 27., 30., 31., 28., 29., 29., 29., 32., 26.])

scipy/stats/tests/test_continuous.py:1189: AssertionError

The 3 slowest test durations on that call are also all > 12 seconds on a fairly recent M3 Mac:

======================================================= slowest 10 durations =======================================================
25.08s call     build-install/usr/lib/python3.12/site-packages/scipy/stats/tests/test_continuous.py::TestMakeDistribution::test_rv_generic[120-distdata120]
13.45s call     build-install/usr/lib/python3.12/site-packages/scipy/stats/tests/test_continuous.py::TestMakeDistribution::test_rv_generic[138-distdata138]
12.89s call     build-install/usr/lib/python3.12/site-packages/scipy/stats/tests/test_continuous.py::TestMakeDistribution::test_rv_generic[130-distdata130]

On latest main, the same test incantation has no test slower than 0.20 seconds.

Matt and I are still in the same physical location--I'll see if we can chat after supper today.

mdhaber · 2025-05-15T07:35:48Z

Thanks @tylerjereddy! I skipped the slow tests and prevented the failing nchypergeom distributions from being accepted by make_distribution - they can't be evaluated at non-integral x, so I wouldn't expect inversion to work properly.

tylerjereddy · 2025-05-15T21:05:44Z

Jake wasn't aware of a few other recent changes, but I'll consider his approving review still valid given the minor review commits that Matt applied after my review and the passing CI.

I poked around locally with my above incantation and it doesn't seem to cause problems with failure or performance of tests now.

Thanks Matt

MAINT: stats.DiscreteDistribution: fix inversion methods

af3f28b

github-actions bot added scipy.stats maintenance Items related to regular maintenance tasks labels May 11, 2025

mdhaber added this to the 1.16.0 milestone May 11, 2025

mdhaber marked this pull request as ready for review May 14, 2025 03:58

mdhaber requested a review from ev-br as a code owner May 14, 2025 03:58

mdhaber requested review from j-bowhay and removed request for ev-br May 14, 2025 03:59

j-bowhay approved these changes May 14, 2025

View reviewed changes

tylerjereddy reviewed May 15, 2025

View reviewed changes

MAINT: stats.make_distribution: fix slow and failing distributions

111135c

TST: stats.make_distribution: mark as slow

a08e18c

tylerjereddy merged commit 73ec34c into scipy:main May 15, 2025
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MAINT: stats.DiscreteDistribution: fix inversion methods #22970

MAINT: stats.DiscreteDistribution: fix inversion methods #22970

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MAINT: stats.DiscreteDistribution: fix inversion methods #22970

MAINT: stats.DiscreteDistribution: fix inversion methods #22970

Uh oh!

Conversation

Uh oh!

Reference issue

What does this implement/fix?

Additional information

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!