8000 Fixed #36118 -- Accounted for multiple primary keys in bulk_update max_batch_size. by sarahboyce · Pull Request #19088 · django/django · GitHub
[go: up one dir, main page]

Skip to content

Fixed #36118 -- Accounted for multiple primary keys in bulk_update max_batch_size. #19088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 29, 2025

Conversation

sarahboyce
Copy link
Contributor
@sarahboyce sarahboyce commented Jan 22, 2025

Trac ticket number

ticket-36118

Branch description

Been struggling to prove that the "pk" additions are required at all (logic was added in 9cbdb44).

The tests were to help prove a point, but given we have queries.test_bulk_update.BulkUpdateTests.test_large_batch a test might not be required

@sarahboyce

This comment was marked as outdated.

self._for_write = True
connectio 8000 n = connections[self.db]
max_batch_size = connection.ops.bulk_batch_size(["pk", "pk"] + fields, objs)
max_batch_size = connection.ops.bulk_batch_size(fields, objs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is wrong in the bulk_batch_size arithmetic if this passes on SQLite 🤔

The query that will be generated here is of the form

UPDATE
SET f0 = (
    CASE WHEN (pk0 = pkv0 ... AND ... pkn = pkn_vo) THEN f0_v0
    ...
    CASE WHEN (pk0 = pk0_vm ... AND ... pkn = pkn_vm) THEN f0_v0
)
...
SET fk = (
    CASE WHEN (pk0 = pkv0 ... AND ... pkn = pkn_vo) THEN fk_v0
    ...
    CASE WHEN (pk0 = pk0_vm ... AND ... pkn = pkn_vm) THEN fk_v0
)
WHERE (pk0 = pk0_v0 ... AND ... pkn = pkn_v0) ... OR (pk0 = pk0_vm ... AND ... pkn = pkn_vm)

For k = len(fields), m = len(objs), and n = len(pk_fields).

So the number of required placeholders will be

  • len(objs) * len(pk_fields) * len(fields) for each field of each object the primary key fields must be specified for the CASE / WHEN criteria.
  • len(objs) * len(fields) each field needs to be assigned a different value for each object
  • len(objs) * len(pk_fields) for the WHERE clause

So that makes len(objs) * ((len(pk_fields) * len(fields)) + len(fields) + len(pk_fields)) placeholders by query.

Does SQLite have different rules for different placeholders location?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does SQLite have different rules for different placeholders location?

It feels like it might, as even when I was very aggressive and added 10 pks in the composite, it still passed without a complaint 🤔
I can try to read the SQLite docs tomorrow 👍

Copy link
Member
@charettes charettes Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So from the test below that updates text for 1000 items I would expect max_batch_size to return 500 as len(fields) == 1 but if we plug len(objs) = 500 into the above formula for the Basic model

500 * ((3 * 1) + 1 + 3) = 3500

Which is way greater than 999 🤔

Maybe the version of SQLite we have compiled on CI has a greater limit than 999?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES! ⭐
https://www.sqlite.org/limits.html#max_variable_number

To prevent excessive memory allocations, the maximum value of a host parameter number is SQLITE_MAX_VARIABLE_NUMBER, which defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that explains it! So we can likely bump max_query_parameters for SQLite depending on the current version but we'd likely still want to test against SQLite < 3.32 as our minimum_database_version is (3, 31) right now.

Wonder if should have something alike ticket-35412 to consider dropping < 3.32 support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if should have something alike ticket-35412 to consider dropping < 3.32 support.

If I understand it correctly, it was Ubuntu Focal which kept the support with 3.31. Focal ends standard support in April, and I think Ubuntu Jammy has 3.37.2. So we might be able to bump to 3.37 for 6.0 which would get rid of a couple of flags (cc @felixxm)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me double check and I can create a similar ticket later today.

Copy link
Contributor
@laymonage laymonage Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laymonage I know you were interested in django/django-docker-box#11, I don't suppose you already had a chance to take a look and can help us test against SQLite 3.31? 🙏

Sorry, was a bit busy with Wagtail's feature freeze last week. I'll try to get on it this weekend!

On a related note, this is precisely why I suggested having a CI job that tests against an SQLite version that's independent from the distro in #18899. I explained in quite detail in #18899 (comment), including the issue with SQLITE_MAX_VARIABLE_NUMBER that's customized by Debian/Ubuntu, even prior to 3.32.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, it was Ubuntu Focal which kept the support with 3.31. Focal ends standard support in April, and I think Ubuntu Jammy has 3.37.2. So we might be able to bump to 3.37 for 6.0 which would get rid of a couple of flags (cc @felixxm)

It seems that we could bump to 3.34 (but I doesn't give us anything, so not worth doing). ​SQLite 3.34 was released in December 2020 (SQLite version support seems like a similar situation as GEOS libraries which we generally support about 5 years after released). Moreover, both Ubuntu Focal and Debian Bullseye are still in LTS support.

Copy link
Contributor
@laymonage laymonage Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiling SQLite 3.31.0 without overriding SQLITE_MAX_VARIABLE_NUMBER would cause some tests to fail. Setting the flag to 32766 makes the tests pass, so there is some benefit to bumping the minimum to 3.34. Whether that's worth it – I don't know. It's also likely to be a bug or potential improvement in Django itself, and setting the flag/bumping the version would just be ignoring the problem.

See django/django-docker-box#50 (comment) for more details.

@sarahboyce sarahboyce changed the title Fixed #36118 -- Removed accounting for "pk" fields in bulk_update max_batch_size. Fixed #36118 -- Accounted for multiple primary keys in bulk_update max_batch_size calculation. Jan 23, 2025
@@ -175,3 +175,25 @@ def test_cant_update_to_unsaved_object(self):

with self.assertRaisesMessage(ValueError, msg):
Comment.objects.update(user=User())

def test_bulk_update_max_bulk_batch_size(self):
Copy link
Contributor Author
@sarahboyce sarahboyce Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This acts as a regression test for Oracle, which would fail with:

django.db.utils.DatabaseError: ORA-00907: missing right parenthesis

before the pk field changes

@sarahboyce

This comment was marked as outdated.

@sarahboyce sarahboyce force-pushed the ticket-36118 branch 2 times, most recently from 1464da3 to 85d9b6f Compare January 23, 2025 10:31
@sarahboyce

This comment was marked as outdated.

Comment on lines 139 to 144
# SQLITE_MAX_VARIABLE_NUMBER defaults to 999 for SQLite versions prior
# to 3.32.0 or 32766 for SQLite versions after 3.32.0. See
# https://www.sqlite.org/limits.html#max_variable_number
if Database.sqlite_version_info >= (3, 32):
return 32766
return 999
Copy link
Contributor
@laymonage laymonage Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the discussions and in the linked documentation, this value can vary depending on the actual flags used when compiling SQLite.

I found that the following query can be used to get the real limit value, but I'm not sure how we feel about doing such queries. It is documented though: https://www.sqlite.org/pragma.html#pragma_compile_options

        with self.connection.cursor() as cursor:
            try:
                with transaction.atomic(self.connection.alias):
                    result = cursor.execute(
                         "SELECT compile_options FROM pragma_compile_options "
                         "WHERE compile_options LIKE 'MAX_VARIABLE_NUMBER=%'"
                    ).fetchone()
                    return int(result[0].split('=')[-1])
            except (OperationalError, TypeError):
                # If it fails for some reason, use the defaults as per
                # https://www.sqlite.org/limits.html#max_variable_number
                if Database.sqlite_version_info >= (3, 32):
                    return 32766
                return 999

@@ -70,7 +70,7 @@ class DatabaseFeatures(BaseDatabaseFeatures):
supports_over_clause = True
supports_frame_range_fixed_distance = True
supports_ignore_conflicts = False
max_query_params = 2**16 - 1
max_query_params = 7999
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my current theory that there is a max query size of 8000 bytes and hence this shouldn't exceed this
https://docs.oracle.com/cd/E19851-01/819-7557/gfqsr/index.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs are ancient, I assure you that it's not the actual limit.

Copy link
Member
@charettes charettes Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7999 is also the number of bytes that the sum of query parameter values bytes must not exceed not the number of query parameters itself.

In other words, assuming these docs were still valid, if you have have 2 fields per object which are each long values that are documented to take 8 bytes each (e.g. origination from IntegerField) you can only have 8000 / 8 1000 query parameters.

I'm not sure how the 2**16 - 1 value was picked up but 7999 is definitely not the maximum number of query params.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See docs: "number of formal parameters in an explicit cursor, function, or procedure".

Copy link
Contributor Author
@sarahboyce sarahboyce Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it's not obvious, I don't really know what I'm doing and feel quite out my depth.
I initially thought this would be a "simple" change updating connection.ops.bulk_batch_size(["pk", "pk"] + fields, objs) to connection.ops.bulk_batch_size((pk_fields * 2) + fields, objs).

But it looks like when using composite primary keys, the complexity of the query increases to the point that we are in danger of hitting multiple different database limits.

From trial and error, I could increase the value here to roughly 17000. This doesn't mean much because it assumes the current maths is correct.
However, it is close half of 32768 and so I could do:

--- a/django/db/backends/oracle/features.py
+++ b/django/db/backends/oracle/features.py
@@ -70,7 +70,7 @@ class DatabaseFeatures(BaseDatabaseFeatures):
     supports_over_clause = True
     supports_frame_range_fixed_distance = True
     supports_ignore_conflicts = False
-    max_query_params = 17000
+    max_query_params = 32768
     supports_partial_indexes = False
diff --git a/django/db/backends/oracle/operations.py b/django/db/backends/oracle/operations.py
index aa67f28a79..a62a8e7b62 100644
--- a/django/db/backends/oracle/operations.py
+++ b/django/db/backends/oracle/operations.py
@@ -700,7 +700,7 @@ END;
     def bulk_batch_size(self, fields, objs):
         """Oracle restricts the number of parameters in a query."""
         if fields:
-            return self.connection.features.max_query_params // len(fields)
+            return self.connection.features.max_query_params // len(fields * 2)
         return len(objs)

And the test passes.

But it's smoke and mirrors, I don't have any strong logic to back the changes.

One option could be to add a docs warning that when using bulk_update on a model with a composite primary key, they should define a batch_size due to the query complexity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spending time investigating here @sarahboyce. Dealing with different backend versions and specialized configuration option and trying to make sense of it is definitely painful.

In order to make progress here I think we need to state a few things.

This ticket was created with an intent of auditing the current state of things and I think we reached that point here in the sense that we now know that the bulk_batch_size approach is broken in a few ways:

  1. Themax_query_params value used for SQLite is going to be overly protective in most cases
  2. There likely are other factors at play (different settings) that the maximum numbers of query parameters (e.g. possibly their respective value in bytes) that must be considered to have a complete bullet proof solution
  3. Some adjustments to bulk_batch_size calls should still be made to some extent to prevent to ensure the situation is not worsened by the usage of composite primary keys.

Now where does that leaves us.

I think that 1. can be considered a distinct optimization ticket to reduce the number of queries made during bulk operation.

In the case of 2. I also think that this could be explored as a distinct ticket to likely design a better API than bulk_batch_size to holistically consider factors that would require reducing the number of paramters.

This leaves us with the core of this issue; the adjustments required to make sure composite primary usage in bulk operations are not more likely to run into problems. As you may know all backends but SQLite support native tuple comparison (that is they are able to do (f0, ..., fn) = %s instead of f0 = %s ... AND ... fn = %) which means they should not have to be adjusted at all to accommodate composite primary keys (as they'll use the same number of query parameters which is the only factor bulk_batch_size currently considers).

Assuming all the above makes sense I think we could focus this PR on the simple linear increase (constant time the number of additional fields in the primary key) of query parameters on SQLite. A potential solution to this problem could be systematically pass Field instances to bulk_batch_size (instead of passing str | Field) and let backends unpack them if required. Something like the following for example

diff --git a/django/db/backends/sqlite3/operations.py b/django/db/backends/sqlite3/operations.py
index 0078cc077a..f0ddbc66fa 100644
--- a/django/db/backends/sqlite3/operations.py
+++ b/django/db/backends/sqlite3/operations.py
@@ -8,6 +8,7 @@
 from django.core.exceptions import FieldError
 from django.db import DatabaseError, NotSupportedError, models
 from django.db.backends.base.operations import BaseDatabaseOperations
+from django.db.models import CompositePrimaryKey
 from django.db.models.constants import OnConflict
 from django.db.models.expressions import Col
 from django.utils import timezone
@@ -36,6 +37,12 @@ def bulk_batch_size(self, fields, objs):
         If there's only a single field to insert, the limit is 500
         (SQLITE_MAX_COMPOUND_SELECT).
         """
+        fields = list(
+            chain.from_iterable(
+                field.fields if isinstance(field, CompositePrimaryKey) else [field]
+                for field in fields
+            )
+        )
         if len(fields) == 1:
             return 500
         elif len(fields) > 1:
diff --git a/django/db/models/deletion.py b/django/db/models/deletion.py
index fd3d290a96..da2e934c96 100644
--- a/django/db/models/deletion.py
+++ b/django/db/models/deletion.py
@@ -230,9 +230,8 @@ def get_del_batches(self, objs, fields):
         """
         Return the objs in suitably sized batches for the used connection.
         """
-        field_names = [field.name for field in fields]
         conn_batch_size = max(
-            connections[self.using].ops.bulk_batch_size(field_names, objs), 1
+            connections[self.using].ops.bulk_batch_size(fields, objs), 1
         )
         if len(objs) > conn_batch_size:
             return [
diff --git a/django/db/models/query.py b/django/db/models/query.py
index eb17624bf1..610e496e9f 100644
--- a/django/db/models/query.py
+++ b/django/db/models/query.py
@@ -874,11 +874,12 @@ def bulk_update(self, objs, fields, batch_size=None):
         objs = tuple(objs)
         if not all(obj._is_pk_set() for obj in objs):
             raise ValueError("All bulk_update() objects must have a primary key set.")
-        fields = [self.model._meta.get_field(name) for name in fields]
+        opts = self.model._meta
+        fields = [opts.get_field(name) for name in fields]
         if any(not f.concrete or f.many_to_many for f in fields):
             raise ValueError("bulk_update() can only be used with concrete fields.")
-        all_pk_fields = set(self.model._meta.pk_fields)
-        for parent in self.model._meta.all_parents:
+        all_pk_fields = set(opts.pk_fields)
+        for parent in opts.all_parents:
             all_pk_fields.update(parent._meta.pk_fields)
         if any(f in all_pk_fields for f in fields):
             raise ValueError("bulk_update() cannot be used with primary key fields.")
@@ -892,7 +893,9 @@ def bulk_update(self, objs, fields, batch_size=None):
         # and once in the WHEN. Each field will also have one CAST.
         self._for_write = True
         connection = connections[self.db]
-        max_batch_size = connection.ops.bulk_batch_size(["pk", "pk"] + fields, objs)
+        max_batch_size = connection.ops.bulk_batch_size(
+            [opts.pk, opts.pk] + fields, objs
+        )
         batch_size = min(batch_size, max_batch_size) if batch_size else max_batch_size
         requires_casting = connection.features.requires_casted_case_in_updates
         batches = (objs[i : i + batch_size] for i in range(0, len(objs), batch_size))
diff --git a/tests/backends/sqlite/test_operations.py b/tests/backends/sqlite/test_operations.py
index 3ff055248d..10cbffdf80 100644
--- a/tests/backends/sqlite/test_operations.py
+++ b/tests/backends/sqlite/test_operations.py
@@ -1,7 +1,7 @@
 import unittest

 from django.core.management.color import no_style
-from django.db import connection
+from django.db import connection, models
 from django.test import TestCase

 from ..models import Person, Tag
@@ -86,3 +86,25 @@ def test_sql_flush_sequences_allow_cascade(self):
             "zzz'",
             statements[-1],
         )
+
+    def test_bulk_batch_size(self):
+        self.assertEqual(connection.ops.bulk_batch_size([], [Person()]), 1)
+        first_name_field = Person._meta.get_field("first_name")
+        last_name_field = Person._meta.get_field("last_name")
+        self.assertEqual(
+            connection.ops.bulk_batch_size([first_name_field], [Person()]), 500
+        )
+        self.assertEqual(
+            connection.ops.bulk_batch_size(
+                [first_name_field, last_name_field], [Person()]
+            ),
+            connection.features.max_query_params // 2,
+        )
+        composite_pk = models.CompositePrimaryKey("first_name", "last_name")
+        composite_pk.fields = [first_name_field, last_name_field]
+        self.assertEqual(
+            connection.ops.bulk_batch_size(
+                [composite_pk, first_name_field], [Person()]
+            ),
+            connection.features.max_query_params // 3,
+        )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. ticket-36143 created
  2. ticket-36144 created (might be able to improve the description adding a test failure when ticket-36118 is done)
  3. I like the approach, I will push up changes 👍

@sarahboyce sarahboyce changed the title Fixed #36118 -- Accounted for multiple primary keys in bulk_update max_batch_size calculation. Fixed #36118 -- Accounted for multiple primary keys in bulk_update max_batch_size. Jan 23, 2025
@sarahboyce

This comment was marked as outdated.

@sarahboyce sarahboyce force-pushed the ticket-36118 branch 2 times, most recently from 25ab9be to 218fc63 Compare January 24, 2025 08:13
…x_batch_size.

Co-authored-by: Simon Charette <charette.s@gmail.com>
@sarahboyce
Copy link
Contributor Author

buildbot, test on oracle.

@sarahboyce sarahboyce marked this pull request as ready for review January 27, 2025 10:58
Comment on lines +709 to +714
fields = list(
chain.from_iterable(
field.fields if isinstance(field, CompositePrimaryKey) else [field]
for field in fields
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want to add this for now due to TupleExact.as_oracle but one thing that I noticed while playing with the docker-compose version is that Oracle 23ai (actually 23c renamed to 23ai because why not 😅 ) appears to support native tuple comparison while 19c still doesn't for most operations?

One thing that is very weird though is that I can't confirm from raw SQL so I wonder if the oracledb wrapper does something special?

Anyway, I'll continue investigating and this should go in as-is for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #19107 as the conclusion of this investigation.

@sarahboyce sarahboyce merged commit 5a2c1bc into django:main Jan 29, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0