8000 KNeighborsRegressor gives different results for different n_jobs values · Issue #12672 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
KNeighborsRegressor gives different results for different n_jobs values #12672
Closed
@mcorella-geoblink

Description

@mcorella-geoblink

Description

When using 'seuclidean' distance metric, the algorithm produces different predictions for different values of the n_jobs parameter if no V is passed as additional metric_params. This implies that if configured with n_jobs=-1 two different machines show different results depending on the number of cores. The same happens for 'mahalanobis' distance metric if no V and VI are passed as metric_params.

Steps/Code to Reproduce

# Import required packages
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor

# Prepare the dataset
dataset = load_boston()
target = dataset.target
data = pd.DataFrame(dataset.data, columns=dataset.feature_names)

# Split the dataset
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)

# Create a regressor with seuclidean distance and passing V as additional argument
model_n_jobs_1 = KNeighborsRegressor(n_jobs=1, algorithm='brute', metric='seuclidean')
model_n_jobs_1.fit(X_train, y_train)
np.sum(model_n_jobs_1.predict(X_test)) # --> 2127.99999

# Create a regressor with seuclidean distance and passing V as additional argument
model_n_jobs_3 = KNeighborsRegressor(n_jobs=3, algorithm='brute', metric='seuclidean')
model_n_jobs_3.fit(X_train, y_train)
np.sum(model_n_jobs_3.predict(X_test)) # --> 2129.38

# Create a regressor with seuclidean distance and passing V as additional argument
model_n_jobs_all = KNeighborsRegressor(n_jobs=-1, algorithm='brute', metric='seuclidean')
model_n_jobs_all.fit(X_train, y_train)
np.sum(model_n_jobs_all.predict(X_test)) # --> 2125.29999

Expected Results

The prediction should be always the same and not depend on the value passed to the n_jobs parameter.

Actual Results

The prediction value changes depending on the value passed to n_jobs which, in case of n_jobs=-1, makes the prediction depend on the number of cores of the machine running the code.

Versions

System

python: 3.6.6 (default, Jun 28 2018, 04:42:43)  [GCC 5.4.0 20160609]
executable: /home/mcorella/.local/share/virtualenvs/outlier_detection-8L4UL10d/bin/python3.6
machine: Linux-4.15.0-39-generic-x86_64-with-Ubuntu-16.04-xenial

BLAS

macros: NO_ATLAS_INFO=1, HAVE_CBLAS=None
lib_dirs: /usr/lib
cblas_libs: cblas

Python deps

pip: 18.1
setuptools: 40.5.0
sklearn: 0.20.0
numpy: 1.15.4
scipy: 1.1.0
Cython: None
pandas: 0.23.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0