8000 ⚠️ CI failed on Linux.ubuntu_atlas ⚠️ · Issue #24131 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

⚠️ CI failed on Linux.ubuntu_atlas ⚠️ #24131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
scikit-learn-bot opened this issue Aug 6, 2022 · 9 comments
Closed

⚠️ CI failed on Linux.ubuntu_atlas ⚠️ #24131

scikit-learn-bot opened this issue Aug 6, 2022 · 9 comments

Comments

@scikit-learn-bot
Copy link
Contributor
scikit-learn-bot commented Aug 6, 2022

CI is still failing on Linux.ubuntu_atlas (Aug 30, 2022)

  • Test Collection Failure
@github-actions github-actions bot added the Needs Triage Issue requires triage label Aug 6, 2022
@scikit-learn-bot
Copy link
Contributor Author
scikit-learn-bot commented Aug 7, 2022

CI is no longer failing! ✅

Successful run on Nov 25, 2022

@lesteve lesteve added Build / CI and removed Needs Triage Issue requires triage labels Aug 8, 2022
@lesteve
Copy link
Member
lesteve commented Aug 8, 2022

Seems like a real issue for global_random_seed=20. On my Ubuntu 20.04 machine (making sure atlas is used by chosing atlas with sudo update-alternatives --config libblas.so.3-x86_64-linux-gnu) I get an error rather than a crash:

SKLEARN_RUN_FLOAT32_TESTS=1 SKLEARN_TESTS_GLOBAL_RANDOM_SEED=20 pytest sklearn/decomposition/tests/test_fastica.py -k test_fastica_simple -k False 

With an error like this which is due to some linalg operation being called on a array with NaNs:

ValueError: On entry to SLASCL parameter number 4 had an illegal value
full pytest failure info
_______________________________________ test_fastica_simple[20-float32-False] _______________________________________

add_noise = False, global_random_seed = 20, global_dtype = <class 'numpy.float32'>

    @pytest.mark.filterwarnings(
        "ignore:Starting in v1.3, whiten='unit-variance' will be used by default."
    )
    @pytest.mark.parametrize("add_noise", [True, False])
    def test_fastica_simple(add_noise, global_random_seed, global_dtype):
        # Test the FastICA algorithm on very simple data.
        rng = np.random.RandomState(global_random_seed)
        n_samples = 1000
        # Generate two sources:
        s1 = (2 * np.sin(np.linspace(0, 100, n_samples)) > 0) - 1
        s2 = stats.t.rvs(1, size=n_samples, random_state=global_random_seed)
        s = np.c_[s1, s2].T
        center_and_norm(s)
        s = s.astype(global_dtype)
        s1, s2 = s
    
        # Mixing angle
        phi = 0.6
        mixing = np.array([[np.cos(phi), np.sin(phi)], [np.sin(phi), -np.cos(phi)]])
        mixing = mixing.astype(global_dtype)
        m = np.dot(mixing, s)
    
        if add_noise:
            m += 0.1 * rng.randn(2, 1000)
    
        center_and_norm(m)
    
        # function as fun arg
        def g_test(x):
            return x**3, (3 * x**2).mean(axis=-1)
    
        algos = ["parallel", "deflation"]
        nls = ["logcosh", "exp", "cube", g_test]
        whitening = ["arbitrary-variance", "unit-variance", False]
        for algo, nl, whiten in itertools.product(algos, nls, whitening):
            if whiten:
                kwargs = dict(X=m.T, fun=nl, whiten=whiten, algorithm=algo, random_state=rng)
                import copy
                kwargs_copy = copy.deepcopy(kwargs)
                try:
>                   k_, mixing_, s_ = fastica(
                       **kwargs
                    )

sklearn/decomposition/tests/test_fastica.py:113: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/decomposition/_fastica.py:324: in fastica
    S = est._fit_transform(X, compute_sources=compute_sources)
sklearn/decomposition/_fastica.py:685: in _fit_transform
    self.mixing_ = linalg.pinv(self.components_, check_finite=False)
/usr/lib/python3/dist-packages/scipy/linalg/basic.py:1308: in pinv
    x, resids, rank, s = lstsq(a, b, cond=cond, check_finite=False)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

a = array([[-0.0178364 ,  0.02612615],
       [        nan,         nan]], dtype=float32)
b = array([[1., 0.],
       [0., 1.]], dtype=float32), cond = 2.384185791015625e-07, overwrite_a = False
overwrite_b = False, check_finite = False, lapack_driver = None

    def lstsq(a, b, cond=None, overwrite_a=False, overwrite_b=False,
              check_finite=True, lapack_driver=None):
        """
        Compute least-squares solution to equation Ax = b.
    
        Compute a vector x such that the 2-norm ``|b - A x|`` is minimized.
    
        Parameters
        ----------
        a : (M, N) array_like
            Left hand side array
        b : (M,) or (M, K) array_like
            Right hand side array
        cond : float, optional
            Cutoff for 'small' singular values; used to determine effective
            rank of a. Singular values smaller than
            ``rcond * largest_singular_value`` are considered zero.
        overwrite_a : bool, optional
            Discard data in `a` (may enhance performance). Default is False.
        overwrite_b : bool, optional
            Discard data in `b` (may enhance performance). Default is False.
        check_finite : bool, optional
            Whether to check that the input matrices contain only finite numbers.
            Disabling may give a performance gain, but may result in problems
            (crashes, non-termination) if the inputs do contain infinities or NaNs.
        lapack_driver : str, optional
            Which LAPACK driver is used to solve the least-squares problem.
            Options are ``'gelsd'``, ``'gelsy'``, ``'gelss'``. Default
            (``'gelsd'``) is a good choice.  However, ``'gelsy'`` can be slightly
            faster on many problems.  ``'gelss'`` was used historically.  It is
            generally slow but uses less memory.
    
            .. versionadded:: 0.17.0
    
        Returns
        -------
        x : (N,) or (N, K) ndarray
            Least-squares solution.  Return shape matches shape of `b`.
        residues : (K,) ndarray or float
            Square of the 2-norm for each column in ``b - a x``, if ``M > N`` and
            ``ndim(A) == n`` (returns a scalar if b is 1-D). Otherwise a
            (0,)-shaped array is returned.
        rank : int
            Effective rank of `a`.
        s : (min(M, N),) ndarray or None
            Singular values of `a`. The condition number of a is
            ``abs(s[0] / s[-1])``.
    
        Raises
        ------
        LinAlgError
            If computation does not converge.
    
        ValueError
            When parameters are not compatible.
    
        See Also
        --------
        scipy.optimize.nnls : linear least squares with non-negativity constraint
    
        Notes
        -----
        When ``'gelsy'`` is used as a driver, `residues` is set to a (0,)-shaped
        array and `s` is always ``None``.
    
        Examples
        --------
        >>> from scipy.linalg import lstsq
        >>> import matplotlib.pyplot as plt
    
        Suppose we have the following data:
    
        >>> x = np.array([1, 2.5, 3.5, 4, 5, 7, 8.5])
        >>> y = np.array([0.3, 1.1, 1.5, 2.0, 3.2, 6.6, 8.6])
    
        We want to fit a quadratic polynomial of the form ``y = a + b*x**2``
        to this data.  We first form the "design matrix" M, with a constant
        column of 1s and a column containing ``x**2``:
    
        >>> M = x[:, np.newaxis]**[0, 2]
        >>> M
        array([[  1.  ,   1.  ],
               [  1.  ,   6.25],
               [  1.  ,  12.25],
               [  1.  ,  16.  ],
               [  1.  ,  25.  ],
               [  1.  ,  49.  ],
               [  1.  ,  72.25]])
    
        We want to find the least-squares solution to ``M.dot(p) = y``,
        where ``p`` is a vector with length 2 that holds the parameters
        ``a`` and ``b``.
    
        >>> p, res, rnk, s = lstsq(M, y)
        >>> p
        array([ 0.20925829,  0.12013861])
    
        Plot the data and the fitted curve.
    
        >>> plt.plot(x, y, 'o', label='data')
        >>> xx = np.linspace(0, 9, 101)
        >>> yy = p[0] + p[1]*xx**2
        >>> plt.plot(xx, yy, label='least squares fit, $y = a + bx^2$')
        >>> plt.xlabel('x')
        >>> plt.ylabel('y')
        >>> plt.legend(framealpha=1, shadow=True)
        >>> plt.grid(alpha=0.25)
        >>> plt.show()
    
        """
        a1 = _asarray_validated(a, check_finite=check_finite)
        b1 = _asarray_validated(b, check_finite=check_finite)
        if len(a1.shape) != 2:
            raise ValueError('Input array a should be 2-D')
        m, n = a1.shape
        if len(b1.shape) == 2:
            nrhs = b1.shape[1]
        else:
            nrhs = 1
        if m != b1.shape[0]:
            raise ValueError('Shape mismatch: a and b should have the same number'
                             ' of rows ({} != {}).'.format(m, b1.shape[0]))
        if m == 0 or n == 0:  # Zero-sized problem, confuses LAPACK
            x = np.zeros((n,) + b1.shape[1:], dtype=np.common_type(a1, b1))
            if n == 0:
                residues = np.linalg.norm(b1, axis=0)**2
            else:
                residues = np.empty((0,))
            return x, residues, 0, np.empty((0,))
    
        driver = lapack_driver
        if driver is None:
            driver = lstsq.default_lapack_driver
        if driver not in ('gelsd', 'gelsy', 'gelss'):
            raise ValueError('LAPACK driver "%s" is not found' % driver)
    
        lapack_func, lapack_lwork = get_lapack_funcs((driver,
                                                     '%s_lwork' % driver),
                                                     (a1, b1))
        real_data = True if (lapack_func.dtype.kind == 'f') else False
    
        if m < n:
            # need to extend b matrix as it will be filled with
            # a larger solution matrix
            if len(b1.shape) == 2:
                b2 = np.zeros((n, nrhs), dtype=lapack_func.dtype)
                b2[:m, :] = b1
            else:
                b2 = np.zeros(n, dtype=lapack_func.dtype)
                b2[:m] = b1
            b1 = b2
    
        overwrite_a = overwrite_a or _datacopied(a1, a)
        overwrite_b = overwrite_b or _datacopied(b1, b)
    
        if cond is None:
            cond = np.finfo(lapack_func.dtype).eps
    
        if driver in ('gelss', 'gelsd'):
            if driver == 'gelss':
                lwork = _compute_lwork(lapack_lwork, m, n, nrhs, cond)
                v, x, s, rank, work, info = lapack_func(a1, b1, cond, lwork,
                                                        overwrite_a=overwrite_a,
                                                        overwrite_b=overwrite_b)
    
            elif driver == 'gelsd':
                if real_data:
                    lwork, iwork = _compute_lwork(lapack_lwork, m, n, nrhs, cond)
>                   x, s, rank, info = lapack_func(a1, b1, lwork,
                                                   iwork, cond, False, False)
E                   ValueError: On entry to SLASCL parameter number 4 had an illegal value

/usr/lib/python3/dist-packages/scipy/linalg/basic.py:1210: ValueError

Trying to debug a bit further the code in fastica

w1 = (X * gwtx).mean(axis=1) - g_wtx.mean() * w
_gs_decorrelation(w1, W, j)
w1 /= np.sqrt((w1**2).sum())

atlas

at one point w1 is exactly 0 and you divide by 0 so you get NaNs from there on:

component 0 iteration 0 w1 after decorrelation: [ 0.23501348 -0.19262002]
component 0 iteration 1 w1 before decorrelation: [0.04253763 0.40387636]
component 0 iteration 1 w1 after decorrelation: [0.04253763 0.40387636]
component 0 iteration 2 w1 before decorrelation: [ 0.13986371 -0.6734375 ]
component 0 iteration 2 w1 after decorrelation: [ 0.13986371 -0.6734375 ]
component 0 iteration 3 w1 before decorrelation: [-0.14534193  0.7702849 ]
component 0 iteration 3 w1 after decorrelation: [-0.14534193  0.7702849 ]
component 0 iteration 4 w1 before decorrelation: [ 0.14539295 -0.77054155]
component 0 iteration 4 w1 after decorrelation: [ 0.14539295 -0.77054155]
component 1 iteration 0 w1 before decorrelation: [-0.14537632  0.7704534 ]
component 1 iteration 0 w1 after decorrelation: [0. 0.]
component 1 iteration 1 w1 before decorrelation: [nan nan]
.
.
.
keeps going with nans until max_iter

openblas

instead of exactly zero you get a small value of the order of 1e-8 so smaller than the float32 precision)

component 0 iteration 0 w1 after decorrelation: [ 0.23501337 -0.19262013]
component 0 iteration 1 w1 before decorrelation: [0.04253733 0.4038767 ]
component 0 iteration 1 w1 after decorrelation: [0.04253733 0.4038767 ]
component 0 iteration 2 w1 before decorrelation: [ 0.13986367 -0.6734381 ]
component 0 iteration 2 w1 after decorrelation: [ 0.13986367 -0.6734381 ]
component 0 iteration 3 w1 before decorrelation: [-0.1453419   0.77028495]
component 0 iteration 3 w1 after decorrelation: [-0.1453419   0.77028495]
component 0 iteration 4 w1 before decorrelation: [ 0.14539292 -0.77054155]
component 0 iteration 4 w1 after decorrelation: [ 0.14539292 -0.77054155]
component 1 iteration 0 w1 before decorrelation: [-0.1453763  0.7704534]
component 1 iteration 0 w1 after decorrelation: [-1.4901161e-08  0.0000000e+00]
component 1 iteration 1 w1 before decorrelation: [-0.3199173  -0.06967697]
.
.
.

Suggestions more than welcome about how to fix or work around this issue!

@lesteve
Copy link
Member
lesteve commented Aug 11, 2022

Edit: should be fixed by #24168

Hmmm the latest failure is a separate one:

test_pairwise_distances_argkmin[80-float32-parallel_on_X-cityblock-1000000.0-500]
test_pairwise_distances_argkmin[80-float32-parallel_on_Y-cityblock-1000000.0-500]

Maybe related to #23865, cc @jjerphan for more visibility.

I can reproduce both failures on a Ubuntu 20.04 box with atlas with GLOBAL_RANDOM_SEED=72 (for some reason with 80 only the parallel_on_X one fails ...):

 SKLEARN_RUN_FLOAT32_TESTS=1 SKLEARN_TESTS_GLOBAL_RANDOM_SEED=72 pytest sklearn/metrics/tests/test_pairwise_distances_reduction.py -k 'parallel_on_ and float32 and cityblock and 1000000.0'
Full pytest failure
========================================================================= FAILURES ==========================================================================
_____________________________________ test_pairwise_distances_argkmin[72-float32-parallel_on_X-cityblock-1000000.0-500] _____________________________________

global_random_seed = 72, n_features = 500, translation = 1000000.0, metric = 'cityblock', strategy = 'parallel_on_X', dtype = <class 'numpy.float32'>
n_samples = 100, k = 10

    @pytest.mark.filterwarnings("ignore:WMinkowskiDistance:FutureWarning:sklearn")
    @pytest.mark.parametrize("n_features", [50, 500])
    @pytest.mark.parametrize("translation", [0, 1e6])
    @pytest.mark.parametrize("metric", CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS)
    @pytest.mark.parametrize("strategy", ("parallel_on_X", "parallel_on_Y"))
    @pytest.mark.parametrize("dtype", [np.float64, np.float32])
    def test_pairwise_distances_argkmin(
        global_random_seed,
        n_features,
        translation,
        metric,
        strategy,
        dtype,
        n_samples=100,
        k=10,
    ):
        # TODO: can we easily fix this discrepancy?
        edge_cases = [
            (np.float32, "chebyshev", 1000000.0),
            (np.float32, "chebyshev", 1000000.0),
        ]
        if (dtype, metric, translation) in edge_cases:
            pytest.xfail("Numerical differences lead to small differences in results.")
    
        rng = np.random.RandomState(global_random_seed)
        spread = 1000
        X = translation + rng.rand(n_samples, n_features).astype(dtype) * spread
        Y = translation + rng.rand(n_samples, n_features).astype(dtype) * spread
    
        # Haversine distance only accepts 2D data
        if metric == "haversine":
            X = np.ascontiguousarray(X[:, :2])
            Y = np.ascontiguousarray(Y[:, :2])
    
        metric_kwargs = _get_metric_params_list(metric, n_features)[0]
    
        # Reference for argkmin results
        if metric == "euclidean":
            # Compare to scikit-learn GEMM optimized implementation
            dist_matrix = euclidean_distances(X, Y)
        else:
            dist_matrix = cdist(X, Y, metric=metric, **metric_kwargs)
        # Taking argkmin (indices of the k smallest values)
        argkmin_indices_ref = np.argsort(dist_matrix, axis=1)[:, :k]
        # Getting the associated distances
        argkmin_distances_ref = np.zeros(argkmin_indices_ref.shape, dtype=np.float64)
        for row_idx in range(argkmin_indices_ref.shape[0]):
            argkmin_distances_ref[row_idx] = dist_matrix[
                row_idx, argkmin_indices_ref[row_idx]
            ]
    
        argkmin_distances, argkmin_indices = PairwiseDistancesArgKmin.compute(
            X,
            Y,
            k,
            metric=metric,
            metric_kwargs=metric_kwargs,
            return_distance=True,
            # So as to have more than a chunk, forcing parallelism.
            chunk_size=n_samples // 4,
            strategy=strategy,
        )
    
>       ASSERT_RESULT[(PairwiseDistancesArgKmin, dtype)](
            argkmin_distances,
            argkmin_distances_ref,
            argkmin_indices,
            argkmin_indices_ref,
        )

sklearn/metrics/tests/test_pairwise_distances_reduction.py:919: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

ref_dist = array([[155344.875 , 155930.1875, 156232.6875, 157066.75  , 157164.8125,
        157680.3125, 157733.4375, 158197.25  ...4375, 154084.75  , 154663.4375, 155331.    ,
        156545.8125, 156738.125 , 157577.4375, 157760.375 , 158019.8125]])
dist = array([[155344.875 , 155930.1875, 156232.6875, 157066.75  , 157164.8125,
        157680.3125, 157733.4375, 158197.25  ...4375, 154084.75  , 154663.4375, 155331.    ,
        156545.8125, 156738.125 , 157577.4375, 157760.375 , 158019.8125]])
ref_indices = array([[66, 61, 96, 60, 76, 22, 38, 89, 97, 37],
       [59, 26, 47, 50,  4, 14, 69, 46, 60, 76],
       [27, 38, 22, ...38, 37, 34, 70, 87],
       [29, 78, 22, 27, 73, 56, 45, 15, 97,  0],
       [60, 98, 16, 61, 66, 76,  6, 68, 26, 59]])
indices = array([[66, 61, 96, 60, 76, 22, 38, 89, 97, 37],
       [59, 26, 47, 50,  4, 14, 69, 46, 60, 76],
       [27, 38, 22, ...38, 37, 34, 70, 87],
       [29, 78, 22, 27, 73, 56, 45, 15, 97,  0],
       [60, 98, 16, 61, 66, 76,  6, 68, 26, 59]])
rtol = 0.0001

    def assert_argkmin_results_quasi_equality(
        ref_dist,
        dist,
        ref_indices,
        indices,
        rtol=1e-4,
    ):
        """Assert that argkmin results are valid up to:
          - relative tolerance on computed distance values
          - permutations of indices for distances values that differ up to
            a precision level
    
        To be used for testing neighbors queries on float32 datasets: we
        accept neighbors rank swaps only if they are caused by small
        rounding errors on the distance computations.
        """
        is_sorted = lambda a: np.all(a[:-1] <= a[1:])
    
        n_significant_digits = -(int(floor(log10(abs(rtol)))) + 1)
    
        assert (
            ref_dist.shape == dist.shape == ref_indices.shape == indices.shape
        ), "Arrays of results have various shapes."
    
        n_queries, n_neighbors = ref_dist.shape
    
        # Asserting equality results one row at a time
        for query_idx in range(n_queries):
            ref_dist_row = ref_dist[query_idx]
            dist_row = dist[query_idx]
    
            assert is_sorted(
                ref_dist_row
            ), f"Reference distances aren't sorted on row {query_idx}"
            assert is_sorted(dist_row), f"Distances aren't sorted on row {query_idx}"
    
            assert_allclose
8000
(ref_dist_row, dist_row, rtol=rtol)
    
            ref_indices_row = ref_indices[query_idx]
            indices_row = indices[query_idx]
    
            # Grouping indices by distances using sets on a rounded distances up
            # to a given number of decimals of significant digits derived from rtol.
            reference_neighbors_groups = defaultdict(set)
            effective_neighbors_groups = defaultdict(set)
    
            for neighbor_rank in range(n_neighbors):
                rounded_dist = relative_rounding(
                    ref_dist_row[neighbor_rank],
                    n_significant_digits=n_significant_digits,
                )
                reference_neighbors_groups[rounded_dist].add(ref_indices_row[neighbor_rank])
                effective_neighbors_groups[rounded_dist].add(indices_row[neighbor_rank])
    
            # Asserting equality of groups (sets) for each distance
            msg = (
                f"Neighbors indices for query {query_idx} are not matching "
                f"when rounding distances at {n_significant_digits} significant digits "
                f"derived from rtol={rtol:.1e}"
            )
            for rounded_distance in reference_neighbors_groups.keys():
>               assert (
                    reference_neighbors_groups[rounded_distance]
                    == effective_neighbors_groups[rounded_distance]
                ), msg
E               AssertionError: Neighbors indices for query 39 are not matching when rounding distances at 3 significant digits derived from rtol=1.0e-04
E               assert {33, 60, 76, 99} == {41, 60, 76, 99}
E                 Extra items in the left set:
E                 33
E                 Extra items in the right set:
E                 41
E                 Use -v to get more diff

sklearn/metrics/tests/test_pairwise_distances_reduction.py:181: AssertionError
_____________________________________ test_pairwise_distances_argkmin[72-float32-parallel_on_Y-cityblock-1000000.0-500] _____________________________________

global_random_seed = 72, n_features = 500, translation = 1000000.0, metric = 'cityblock', strategy = 'parallel_on_Y', dtype = <class 'numpy.float32'>
n_samples = 100, k = 10

    @pytest.mark.filterwarnings("ignore:WMinkowskiDistance:FutureWarning:sklearn")
    @pytest.mark.parametrize("n_features", [50, 500])
    @pytest.mark.parametrize("translation", [0, 1e6])
    @pytest.mark.parametrize("metric", CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS)
    @pytest.mark.parametrize("strategy", ("parallel_on_X", "parallel_on_Y"))
    @pytest.mark.parametrize("dtype", [np.float64, np.float32])
    def test_pairwise_distances_argkmin(
        global_random_seed,
        n_features,
        translation,
        metric,
        strategy,
        dtype,
        n_samples=100,
        k=10,
    ):
        # TODO: can we easily fix this discrepancy?
        edge_cases = [
            (np.float32, "chebyshev", 1000000.0),
            (np.float32, "chebyshev", 1000000.0),
        ]
        if (dtype, metric, translation) in edge_cases:
            pytest.xfail("Numerical differences lead to small differences in results.")
    
        rng = np.random.RandomState(global_random_seed)
        spread = 1000
        X = translation + rng.rand(n_samples, n_features).astype(dtype) * spread
        Y = translation + rng.rand(n_samples, n_features).astype(dtype) * spread
    
        # Haversine distance only accepts 2D data
        if metric == "haversine":
            X = np.ascontiguousarray(X[:, :2])
            Y = np.ascontiguousarray(Y[:, :2])
    
        metric_kwargs = _get_metric_params_list(metric, n_features)[0]
    
        # Reference for argkmin results
        if metric == "euclidean":
            # Compare to scikit-learn GEMM optimized implementation
            dist_matrix = euclidean_distances(X, Y)
        else:
            dist_matrix = cdist(X, Y, metric=metric, **metric_kwargs)
        # Taking argkmin (indices of the k smallest values)
        argkmin_indices_ref = np.argsort(dist_matrix, axis=1)[:, :k]
        # Getting the associated distances
        argkmin_distances_ref = np.zeros(argkmin_indices_ref.shape, dtype=np.float64)
        for row_idx in range(argkmin_indices_ref.shape[0]):
            argkmin_distances_ref[row_idx] = dist_matrix[
                row_idx, argkmin_indices_ref[row_idx]
            ]
    
        argkmin_distances, argkmin_indices = PairwiseDistancesArgKmin.compute(
            X,
            Y,
            k,
            metric=metric,
            metric_kwargs=metric_kwargs,
            return_distance=True,
            # So as to have more than a chunk, forcing parallelism.
            chunk_size=n_samples // 4,
            strategy=strategy,
        )
    
>       ASSERT_RESULT[(PairwiseDistancesArgKmin, dtype)](
            argkmin_distances,
            argkmin_distances_ref,
            argkmin_indices,
            argkmin_indices_ref,
        )

sklearn/metrics/tests/test_pairwise_distances_reduction.py:919: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

ref_dist = array([[155344.875 , 155930.1875, 156232.6875, 157066.75  , 157164.8125,
        157680.3125, 157733.4375, 158197.25  ...4375, 154084.75  , 154663.4375, 155331.    ,
        156545.8125, 156738.125 , 157577.4375, 157760.375 , 158019.8125]])
dist = array([[155344.875 , 155930.1875, 156232.6875, 157066.75  , 157164.8125,
        157680.3125, 157733.4375, 158197.25  ...4375, 154084.75  , 154663.4375, 155331.    ,
        156545.8125, 156738.125 , 157577.4375, 157760.375 , 158019.8125]])
ref_indices = array([[66, 61, 96, 60, 76, 22, 38, 89, 97, 37],
       [59, 26, 47, 50,  4, 14, 69, 46, 60, 76],
       [27, 38, 22, ...38, 37, 34, 70, 87],
       [29, 78, 22, 27, 73, 56, 45, 15, 97,  0],
       [60, 98, 16, 61, 66, 76,  6, 68, 26, 59]])
indices = array([[66, 61, 96, 60, 76, 22, 38, 89, 97, 37],
       [59, 26, 47, 50,  4, 14, 69, 46, 60, 76],
       [27, 38, 22, ...38, 37, 34, 70, 87],
       [29, 78, 22, 27, 73, 56, 45, 15, 97,  0],
       [60, 98, 16, 61, 66, 76,  6, 68, 26, 59]])
rtol = 0.0001

    def assert_argkmin_results_quasi_equality(
        ref_dist,
        dist,
        ref_indices,
        indices,
        rtol=1e-4,
    ):
        """Assert that argkmin results are valid up to:
          - relative tolerance on computed distance values
          - permutations of indices for distances values that differ up to
            a precision level
    
        To be used for testing neighbors queries on float32 datasets: we
        accept neighbors rank swaps only if they are caused by small
        rounding errors on the distance computations.
        """
        is_sorted = lambda a: np.all(a[:-1] <= a[1:])
    
        n_significant_digits = -(int(floor(log10(abs(rtol)))) + 1)
    
        assert (
            ref_dist.shape == dist.shape == ref_indices.shape == indices.shape
        ), "Arrays of results have various shapes."
    
        n_queries, n_neighbors = ref_dist.shape
    
        # Asserting equality results one row at a time
        for query_idx in range(n_queries):
            ref_dist_row = ref_dist[query_idx]
            dist_row = dist[query_idx]
    
            assert is_sorted(
                ref_dist_row
            ), f"Reference distances aren't sorted on row {query_idx}"
            assert is_sorted(dist_row), f"Distances aren't sorted on row {query_idx}"
    
            assert_allclose(ref_dist_row, dist_row, rtol=rtol)
    
            ref_indices_row = ref_indices[query_idx]
            indices_row = indices[query_idx]
    
            # Grouping indices by distances using sets on a rounded distances up
            # to a given number of decimals of significant digits derived from rtol.
            reference_neighbors_groups = defaultdict(set)
            effective_neighbors_groups = defaultdict(set)
    
            for neighbor_rank in range(n_neighbors):
                rounded_dist = relative_rounding(
                    ref_dist_row[neighbor_rank],
                    n_significant_digits=n_significant_digits,
                )
                reference_neighbors_groups[rounded_dist].add(ref_indices_row[neighbor_rank])
                effective_neighbors_groups[rounded_dist].add(indices_row[neighbor_rank])
    
            # Asserting equality of groups (sets) for each distance
            msg = (
                f"Neighbors indices for query {query_idx} are not matching "
                f"when rounding distances at {n_significant_digits} significant digits "
                f"derived from rtol={rtol:.1e}"
            )
            for rounded_distance in reference_neighbors_groups.keys():
>               assert (
                    reference_neighbors_groups[rounded_distance]
                    == effective_neighbors_groups[rounded_distance]
                ), msg
E               AssertionError: Neighbors indices for query 39 are not matching when rounding distances at 3 significant digits derived from rtol=1.0e-04
E               assert {33, 60, 76, 99} == {41, 60, 76, 99}
E                 Extra items in the left set:
E                 33
E                 Extra items in the right set:
E                 41
E                 Use -v to get more diff

sklearn/metrics/tests/test_pairwise_distances_reduction.py:181: AssertionError
======================================================== 2 failed, 6 passed, 426 deselected in 0.53s ========================================================

@thomasjpfan
Copy link
Member

As noted in #24131 (comment), this was fixed in #24168

@lesteve
Copy link
Member
lesteve commented Aug 16, 2022

🤔 there were two separate issues here. The pairwise distances one was fixed but the fastica one debugged in #24131 (comment) still exists.

I'll reopen this one for now. Do you think it is better to open a separate issue about it?

@lesteve lesteve reopened this Aug 16, 2022
@thomasjpfan
Copy link
Member

Ah I see, there are two issues. I'm okay with keeping this open.

@ogrisel
Copy link
Member
ogrisel commented Aug 25, 2022

@jjerphan to help understand the nature of the failures found by assert_argkmin_results_quasi_equality it would help to report the original distances (before rounding) for the nighbor indices in the 2 rounded dist groups in the AssertionError message.

@jjerphan
Copy link
Member

@jjerphan to help understand the nature of the failures found by assert_argkmin_results_quasi_equality it would help to report the original distances (before rounding) for the nighbor indices in the 2 rounded dist groups in the AssertionError message.

That's a good suggestion. I've added an item in the description of #22587.

@lesteve
Copy link
Member
lesteve commented Nov 25, 2022

Closing since #24198 has been merged it should not happen again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
0