8000 A `pip resolve` command to convert to transitive == requirements very fast by scanning wheels for static dependency info (WORKING PROTOTYPE!) · Issue #7819 · pypa/pip · GitHub
[go: up one dir, main page]

Skip to content
A pip resolve command to convert to transitive == requirements very fast by scanning wheels for static dependency info (WORKING PROTOTYPE!) #7819
@cosmicexplorer

Description

@cosmicexplorer

Please let me know if it would be more convenient to provide this issue in another form such as a google doc or something!

What's the problem this feature will solve?

At Twitter, we are trying to enable the creation of self-bootstrapping "ipex" files, executable zip files of Python code which can resolve 3rdparty requirements when first run. This approach greatly reduces the time to build, upload, and deploy compared to a typical PEX file, which contains all of its dependencies in a single monolithic zip archive created at pex build time. The implementation of "ipex" in pantsbuild/pants#8793 (more background at that link) will invoke pex at runtime, which will itself invoke a pip subprocess (since pex version 2) to resolve these 3rdparty dependencies. #7729 is a separate performance fix to enable this runtime resolve approach.

Because ipex files do not contain their 3rdparty requirements at build time, it's not necessary to run the entirety of pip download or pip install. Instead, in pantsbuild/pants#8793, pants will take all of the requirements provided by the user (which may include requirements with inequalities, or no version constraints at all), then convert to a list of transitive == requirements. This ensures that the ipex file will resolve the same requirements at build time and run time, even if the index changes in between.

Describe the solution you'd like

A pip resolve command with similar syntax to pip download, which instead writes a list of == requirement strings, each with a single download URL, to stdout, corresponding to the transitive dependencies of the input requirements. These download URLs correspond to every file that would have been downloaded by pip download.

pants would be able to invoke pip resolve as a distinct phase of generating an ipex file. pex would likely not be needed to intermediate this resolve command -- we could just execute pip resolve directly as a subprocess from within pants. The pants v2 engine makes process executions individually cacheable, and transparently executable on a remote cluster via the Bazel Remote Execution API, so pants users would then be able to generate these "dehydrated" ipex files at extremely low latency if the pip resolve command can be made performant enough.

Alternative Solutions / Prototype Implementation

As described above, pantsbuild/pants#8793 is able to create ipex files already, by simply using pip download via pex to extract the transitive == requirements. The utility of a separate pip resolve command, if any, would lie in whether it can achieve the same end goal of extracting transitive == requirements, but with significantly greater performance.

In a pip branch I have implemented a prototype pip resolve command which is able to achieve an immediate ~2x speedup vs pip download on the first run, before almost immediately levelling out to 800ms on every run afterwards.

This performance is achieved with two techniques:

  1. Extracting the contents of the METADATA file from a url for a wheel without actually downloading the wheel at all.
  • _hacky_extract_sub_reqs() (see https://github.com/cosmicexplorer/pip/blob/a60a3977e929cfaed6d64b0c9e3713d7c502e51e/src/pip/_internal/resolution/legacy/resolver.py#L550-L552) will:
    a. send a HEAD request to get the length of the zip file
    b. perform several successive GET requests to extract the relative location of the METADATA file
    c. extract the DEFLATE-compressed METADATA file and INFLATE it
    d. parse all Requires-Dist lines in METADATA for requirement strings
  • This is surprisingly reliable, and extremely fast! This makes pip resolve tensorflow==1.14 take 15 seconds, compared to 24 seconds for pip download tensorflow==1.14.
  • A URL to a non-wheel file is processed the normal way -- by downloading the file, then preparing it into a dist.
  1. Caching the result of each self._resolve_one() call in a persistent json file.

Additional context

This pip resolve command as described above (with the resolve cache) would possibly be able to resolve this long-standing TODO about separating dependency resolution from preparation, without requiring any separate infrastructure changes on PyPI's part:

Once PyPI has static dependency metadata available, it would be
possible to move the preparation to become a step separated from
dependency resolution.

I have only discussed the single "ipex" motivating use case here, but I want to make it clear that I am making this issue because I believe a pip resolve command would be generally useful to all pip users. I didn't implement it in the prototype above, but I believe that after the pip resolve command stabilizes and any inconsistencies between it and pip download are worked out, it would likely be possible to make pip download consume the output of pip resolve directly, which would allow removal of the if self.quickly_parse_sub_requirements conditionals added to resolver.py, as well as (probably) improve pip download performance by waiting to download every wheel file in parallel after resolving URLs for them with pip resolve!

For that reason, I think a pip resolve command which can quickly resolve URLs for requirements before downloading them is likely to be a useful feature for all pip users.

I am extremely open to designing/implementing whatever changes pip contributors might desire in order for this change to go in, and I would also fully understand if this use case is something pip isn't able to support right now.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C: cliCommand line interface related things (optparse, option grouping etc)C: downloadAbout fetching data from PyPI and other sourcesstate: needs discussionThis needs some more discussiontype: enhancementImprovements to functionalitytype: feature requestRequest for a new feature

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0