8000 Collect model fields when rebuilding a model by Viicos · Pull Request #11388 · pydantic/pydantic · GitHub
[go: up one dir, main page]

Skip to content

Collect model fields when rebuilding a model #11388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Feb 12, 2025
Merged

Collect model fields when rebuilding a model #11388

merged 10 commits into from
Feb 12, 2025

Conversation

Viicos
Copy link
Member
@Viicos Viicos commented Feb 4, 2025

Fixes #11390, fixes #10695, works towards #9969 (we don't get the KeyError anymore, but we still get a clean schema error because of a missing definition).

Change Summary

#5305 made it so that unknown forward annotations would not stop field collection. The PR is huge and doesn't explain why this was done, but this introduced subtle bugs and code smells that this PR tries removing.

The first issue arising from this is that the model fields can be inaccurate when a model is not fully built yet:

from typing import Annotated

from pydantic import BaseModel

class Model(BaseModel):
    a: 'Annotated[Forward, Field(alias="b")]'

Model.model_fields['a'].alias
#> None, should be 'b'

and I won't be surprised if some users currently make use of model_fields on an incomplete model (especially as we implicitly rebuild on instance creation — meaning most users don't explicitly call model_rebuild()). A relatively big concern I also have is FastAPI relying on FieldInfo instances to validate data. They do manual stuff before calling the Pydantic validate methods, such as checking for FieldInfo.is_required(), alias.

When a model is being rebuilt, we reevaluate the annotation and recreate a FieldInfo instance during the core schema gen process (and we used to even to this in a less performant way; see #10769):

if not field_info.evaluated:
# TODO Can we use field_info.apply_typevars_map here?
try:
evaluated_type = _typing_extra.eval_type(field_info.annotation, *self._types_namespace)
except NameError as e:
raise PydanticUndefinedAnnotation.from_name_error(e) from e
evaluated_type = replace_types(evaluated_type, self._typevars_map)
field_info.evaluated = True
if not has_instance_in_type(evaluated_type, PydanticRecursiveRef):
new_field_info = FieldInfo.from_annotation(evaluated_type)
field_info.annotation = new_field_info.annotation
# Handle any field info attributes that may have been obtained from now-resolved annotations
for k, v in new_field_info._attributes_set.items():
# If an attribute is already set, it means it was set by assigning to a call to Field (or just a
# default value), and that should take the highest priority. So don't overwrite existing attributes.
# We skip over "attributes" that are present in the metadata_lookup dict because these won't
# actually end up as attributes of the `FieldInfo` instance.
if k not in field_info._attributes_set and k not in field_info.metadata_lookup:
setattr(field_info, k, v)

Doing so introduced a code smell where we track explicitly set attributes using _attributes_set. This is known to be a source of bugs as we need to keep track of such explicitly set attrs when you alter the FieldInfo instance:

  • We forgot to update it when using ... as the default value in some cases. See this added test case:
    def test_ellipsis_forward_ref_annotated() -> None:
    """This previously resulted in the ellipsis being used as a default value."""
    class Model(BaseModel):
    f: 'Forward'
    Forward = Annotated[int, Field(...)]
    assert Model.model_fields['f'].default is PydanticUndefined
  • We currently forget to update deprecated (first bug of Deprecated fields bugs with forward annotations #11390).

And more generally:

  • On _generate_schema.py#L1311, we directly mutate the original field info instance to take the newly evaluated annotation. This is the reason we get the KeyError on generics (KeyError when inhering from a self-referencing generic model #9969).
  • We make use of the model fields (even though they might be incorrect) inside ModelMetaclass.__new__ , at model creation (i.e. stuff that isn't reprocessed on model_rebuild()), such as setting deprecated descriptors (second bug of #11390: we set these descriptors based on FieldInfo.deprecated).

Instead, this PR does three things:

  • Add a new __pydantic_fields_complete__ class property on models (for private use). It is True if at least one field annotation failed to evaluate during collect_model_fields(). On a related note: we still set __pydantic_fields__ for backwards compatibility, but a follow up PR can now easily add a user/deprecation warning when accessing model_fields if __pydantic_fields_complete__ == False. In V3, we could remove this backward compatibility logic (TBD).
    Also note that now we don't enter GenerateSchema if the fields collection failed, as it is guaranteed to fail anyway (if an annotation failed to evaluate during fields collection, it will also immediately fail to be evaluated in GenerateSchema).
  • Calling model_rebuild() will redo the fields collection if it wasn't successful the first time. Note that this aligns with Pydantic dataclasses.
  • When an incomplete Pydantic model is encountered as an annotation (inside GenerateSchema), we first call rebuild_model_fields(); a new function that iterates over the existing (but potentially not complete) field infos, and recreate instances if the annotation needs to be evaluated again. This way, we avoid generating a schema for each field just to realize that one field still has a forward annotation:
    class Sub(BaseModel):
        f1: int
        ...  # Many fields...
        f100: 'Forward'
    
    class Model(BaseModel):
        sub: Sub
        # Note that Sub is incomplete, so in the process of generating the schema for
        # `Model`, we don't use the cached schema attribute
        # Thus previously, we would generate a schema for f1, f2, ..., f99. f100 would fail,
        # meaning we generated 99 fields core schemas for nothing.
        # In this PR, we first (re)collect fields for `Sub`, and immediately raise if fields collection did not suceed.
    On important thing to acknowledge: when encountering sub: Sub, we don't call Sub.model_rebuild(). Doing so might be the right thing to do, but is unfortunately not backwards compatible. The reason is that the NS Resolver used for Model might contain annotations (e.g. provided through the parent_namespace) that should be available for Sub. This means that even if sub: Sub successfully builds, Sub isn't complete yet — each model should be rebuilt independently, through model_rebuild()).

TODO:

  • Performance issues: I believe this is a tradeoff to have, as the couple affected benchmarks are models with one field defined, and the new rebuild_model_fields() might add a bit of overhead. However, it will be beneficial if many fields are defined (see the example with f1..100 above and the test_failed_rebuild benchmark).

Related issue number

Checklist

  • The pull request title is a good summary of the changes - it will be used in the changelog
  • Unit tests for the changes exist
  • Tests pass on CI
  • Documentation reflects the changes where applicable
  • My PR is ready to review, please add a comment including the phrase "please review" to assign reviewers

This is the biggest change in this PR, and the hardest to
reason about and the trickiest to implement.

Instead of re-evaluating annotations on a field-by-field basis,
we assume all fields have their annotation evaluated (and
consequently the `FieldInfo` attributes -- some of them
set depending on the annotation -- are correct). To do so,
we collect model fields (if necessary, thanks to the new
`__pydantic_fields_complete__` attribute) in `_model_schema()`,
and then have `_common_field_schema()` called with a fully built
`FieldInfo` instance (this mimics what is done for dataclasses).

When `model_rebuild()` is called, we also want to recollect fields
if it wasn't successful the first time. This introduces challenges,
such as keeping the model assignments to be able to recollect fields
and call `FieldInfo.from_annotated_attribute()`.
@github-actions github-actions bot added the relnotes-fix Used for bugfixes. label Feb 4, 2025
Copy link
codspeed-hq bot commented Feb 4, 2025

CodSpeed Performance Report

Merging #11388 will degrade performances by 15.94%

Comparing model-fields-cleanup (de1adec) with main (53f8ece)

Summary

⚡ 2 improvements
❌ 3 regressions
✅ 41 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
test_schema_build 2.5 ms 2.8 ms -11.53%
test_construct_dataclass_schema 1.2 ms 1.1 ms +8.6%
test_failed_rebuild 3,851.1 µs 211.9 µs ×18
test_deeply_nested_recursive_model_schema_generation 1.1 ms 1.3 ms -15.94%
test_recursive_discriminated_union_with_base_model 1.4 ms 1.5 ms -8.22%

@Viicos Viicos force-pushed the model-fields-cleanup branch from 9f735a6 to 131fa9f Compare February 4, 2025 17:04
Copy link
cloudflare-workers-and-pages bot commented Feb 4, 2025

Deploying pydantic-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: de1adec
Status: ✅  Deploy successful!
Preview URL: https://203f8a46.pydantic-docs.pages.dev
Branch Preview URL: https://model-fields-cleanup.pydantic-docs.pages.dev

View logs

The `has_instance_in_type` function is no longer used.

We don't need to track attributes explicitly set on `FieldInfo`
(this was a code smell and I'm glad it's gone now).
@Viicos Viicos added the third-party-tests Add this label on a PR to trigger 3rd party tests label Feb 4, 2025
@Viicos Viicos closed this Feb 4, 2025
@Viicos Viicos reopened this Feb 4, 2025
@Viicos Viicos force-pushed the model-fields-cleanup branch from bba626b to a463e08 Compare February 4, 2025 19:58
…s()`

They need to be performed once we know fields are complete.
@Viicos Viicos force-pushed the model-fields-cleanup branch from 0e3cd84 to 10a643a Compare February 6, 2025 13:54
@Viicos 8000 Viicos force-pushed the model-fields-cleanup branch 4 times, most recently from 74e241e to 783c3be Compare February 7, 2025 10:53
@Viicos Viicos force-pushed the model-fields-cleanup branch from 783c3be to 1743828 Compare February 7, 2025 16:32
@sydney-runkle
Copy link
Contributor

Add a new __pydantic_fields_complete__ class property on models

Great, this seems like a good step.

Calling model_rebuild() will redo the fields collection if it wasn't successful the first time. Note that this aligns with Pydantic dataclasses.

Seems good to be consistent here. We should comb through model_rebuild issues, this might close some of those.

This way, we avoid generating a schema for each field just to realize that one field still has a forward annotation:

I really like this pattern as well. Not only more performant, but also makes more sense logically.

Performance issues: I believe this is a tradeoff to have, as the couple affected benchmarks are models with one field defined, and the new rebuild_model_fields() might add a bit of overhead. However, it will be beneficial if many fields are defined (see the example with f1..100 above and the test_failed_rebuild benchmark).

Sure, willing to move forward here, these regressions aren't super concerning. As I review, I'll see if I can suggest any changes that might lead to slight reductions in the regressions.

Copy link
Contributor
@sydney-runkle sydney-runkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the in depth PR description.

Code changes generally look good. I'd like to learn more about:
a) memory usage consequences
b) where more time is now being spent - the codspeed benchmarks aren't super clear on what's slowing things down

I understand that the slower benchmarks are for pretty specific use cases, but want to make sure we understand perf consequences before merging.

Comment on lines +241 to +245
# The `from_annotated_attribute()` call below mutates the assigned `Field()`, so make a copy:
original_assignment = (
copy(assigned_value) if not evaluated and isinstance(assigned_value, FieldInfo_) else assigned_value
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any metrics on perf impact of this copy? Seems like a pretty special case...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only relevant if an annotation fails to evaluate and has a Field() assigned:

a: Undefined = Field()

So not that common, + this copy could be removed if we manage to refactor how we merge field infos together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's definitely worth investigating refactoring how we merge field infos :).

"""
FieldInfo_ = import_cached_field_info()

rebuilt_fields: dict[str, FieldInfo] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understandable, but downside - this is going to increate memory usage. Could you benchmark this with memray and see what the difference is like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only going to make temporary memory allocations, as:

  • if this is called from model_rebuild(), it will override the existing __pydantic_fields__, so the old ones will be gc'ed.
  • if this is called in GenerateSchema._model_schema(), as part of another model build, they are going to be gc'ed at the end of the method.

@Viicos Viicos force-pushed the model-fields-cleanup branch from 5fb8716 to 29f7702 Compare February 10, 2025 21:21
@Viicos Viicos marked this pull request as ready for review February 12, 2025 14:05
@Viicos Viicos force-pushed the model-fields-cleanup branch from 83bdb4c to 76d535d Compare February 12, 2025 14:06
Copy link
Contributor
github-actions bot commented Feb 12, 2025

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  pydantic
  fields.py
  main.py
  pydantic/_internal
  _fields.py
  _generate_schema.py
  _model_construction.py
Project Total  

This report was generated by python-coverage-comment-action

Comment on lines +229 to +233
if config_wrapper.defer_build:
# TODO we can also stop there if `__pydantic_fields_complete__` is False.
# However, `set_model_fields()` is currently lenient and we don't have access to the `NameError`.
# (which is useful as we can provide the name in the error message: `set_model_mock(cls, e.name)`)
set_model_mocks(cls)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we link an issue to this comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #11453.

# Private attributes, used to rebuild FieldInfo instances:
self._complete = True
self._original_annotation: Any = PydanticUndefined
self._original_assignment: Any = PydanticUndefined
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be _original_default? Or is that confusing for the Field() case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's confusing for Field() assignments

Copy link
Contributor
@sydney-runkle sydney-runkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work here - this takes us in a good direction in terms of refactoring field collection and merging, which have the potential for some modest perf benefits and minimal breaking changes.

@sydney-runkle
Copy link
Contributor

Note - failing 3rd party tests are unrelated and should be fixed by #11429

@Viicos Viicos force-pushed the model-fields-cleanup branch from 76d535d to de1adec Compare February 12, 2025 16:28
@Viicos Viicos merged commit 72c77bf into main Feb 12, 2025
76 of 77 checks passed
@Viicos Viicos deleted the model-fields-cleanup branch February 12, 2025 16:41
Viicos added a commit that referenced this pull request Apr 15, 2025
This results in parameterized classes being different type objects,
although they represent the same type.

A new refactor (#11388)
was introduced since then and fixes the root issue.
@Viicos Viicos mentioned this pull request Apr 15, 2025
5 tasks
Viicos added a commit that referenced this pull request Jun 4, 2025
This applying roughly the same logic from #11388
for dataclasses.
Viicos added a commit that referenced this pull request Jun 4, 2025
This applying roughly the same logic from #11388
for dataclasses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes-fix Used for bugfixes. third-party-tests Add this label on a PR to trigger 3rd party tests
Projects
None yet
2 participants
0