8000 `transform_output` set in `config_context` not preserved in the Transformer object? · Issue #25287 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

transform_output set in config_context not preserved in the Transformer object? #25287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ravwojdyla opened this issue Jan 4, 2023 · 5 comments · Fixed by #25289
Closed

transform_output set in config_context not preserved in the Transformer object? #25287

ravwojdyla opened this issue Jan 4, 2023 · 5 comments · Fixed by #25289

Comments

@ravwojdyla
Copy link
Contributor
ravwojdyla commented Jan 4, 2023

Describe the bug

This is related to: #23734 (btw I love this enhancement!), when config_context is used the Transformers created within the context do not register/memoize the transform output. This may be expected, tho I could not find that explicitly in the documentation?


Could the solution be to add default init to _SetOutputMixin to capture transform_output if set?

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        transform_output = get_config()["transform_output"]
        if transform_output != "default":
            self.set_output(transform=transform_output)

Steps/Code to Reproduce

This works as expected:

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

X, y = load_iris(as_frame=True, return_X_y=True)
with sklearn.config_context(transform_output="pandas"):
    scaler = StandardScaler()
    x = scaler.fit_transform(X_test)
    print(type(x))  # <class 'pandas.core.frame.DataFrame'>

But:

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

X, y = load_iris(as_frame=True, return_X_y=True)
with sklearn.config_context(transform_output="pandas"):
    scaler = StandardScaler()

x = scaler.fit_transform(X_test)
print(type(x))  # <class 'numpy.ndarray'>

So when fit_transform is not under config_context(transform_output="pandas") the output defaults to numpy array (default output). StandardScaler() constructor doesn't register the config at the construction time.

This is slightly confusing because:

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

X, y = load_iris(as_frame=True, return_X_y=True)
scaler = StandardScaler().set_output(transform="pandas")

x = scaler.fit_transform(X_test)
print(type(x))  # <class 'pandas.core.frame.DataFrame'>

Expected Results

As a user I would expect that config_context(transform_output="pandas") is memoized/registered during transformer construction. Similar to explicitly calling set_output on a transformer.

Actual Results

See above.

Versions

System:
    python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0]
executable: ...
   machine: Linux-5.10.0-20-cloud-amd64-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.2.0
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.23.5
        scipy: 1.9.3
       Cython: None
       pandas: 1.5.2
   matplotlib: 3.6.2
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: ...
        version: None
    num_threads: 16

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: ...
        version: 0.3.21
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 16

       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: ...
        version: None
    num_threads: 16
@ravwojdyla ravwojdyla added Bug Needs Triage Issue requires triage labels Jan 4, 2023
@ravwojdyla ravwojdyla changed the title config_context set transform_output not preserved in the Transformer object? transform_output set in config_context not preserved in the Transformer object? Jan 4, 2023
@betatim
Copy link
Member
betatim commented Jan 4, 2023

I agree that this can be a bit confusing, however I don't think this is a bug. A context manager is a tool to use if you want to do setup some state of the world that only applies within that context. A different way of thinking of the config context is as:

old_foo = sklearn.get_config()['foo']
config.set_config(foo='bar')
# do something while foo is configured to bar
...
config.set_config(foo=old_foo)

If you want the configuration to persist beyond the context, Estimator.set_output() is the way to do it.


Context managers are like Las Vegas: what happens in Vegas, stays in Vegas ;)

@betatim betatim removed the Bug label Jan 4, 2023
@betatim
Copy link
Member
betatim commented Jan 4, 2023

Maybe something to do thought is to improve the documentation to explain a bit more about the interplay of a config context and set_output(). I think the output case is one of the few (the only?) cases where you can use the config to set something, as well as a method of an estimator.

@ravwojdyla
Copy link
Contributor Author

@betatim thanks for the prompt response, much appreciated :) For some more context on:

If you want the configuration to persist beyond the context, Estimator.set_output() is the way to do it.

In our use case we have a number of separate Pipelines, that we construct in one place, and use in a separate place, we want all of them to use pandas output. My first intuition was to just construct those Pipeline objects within the options context manager, expecting it would be preserved in the pipeline objects (thus this issue), I quite liked that syntax btw.

Instead we could:

  • call set_output on all of them separately (maybe in some "smart" way that effectively just calls set_output)
  • have a PandasPipeline class that inherits from Pipeline and sets output to pandas

#25288 implements the original idea, but I'm happy to adjust it to update the documentation.

@glemaitre
Copy link
Member
glemaitre commented Jan 4, 2023

The behaviour is the one expected when using context manager:

In [5]: with open("xxx.file", "r") as f:
   ...:     f.read()
   ...: print(f.closed)
True

Here, the context manager is in charge of closing the file. Basically, you cannot expect it to do whatever on f once you are out of the context manager. In this context, if you want to do something then you would store the handle with f = open(...) and manage yourself the state of f.

In the problem above, you would use set_output on the instance or set_config to set globally to get the expected results.

So I will close #25288 but I am happy to see an improved documentation.

@ravwojdyla
Copy link
Contributor Author

Sounds good, thanks for quick feedback. Pushed update to the documentation at: #25289 PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
0