-
-
Notifications
You must be signed in to change notification settings - Fork 32.7k
Fixed #36118 -- Accounted for multiple primary keys in bulk_update max_batch_size. #19088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
django/db/models/query.py
Outdated
self._for_write = True | ||
connectio 8000 n = connections[self.db] | ||
max_batch_size = connection.ops.bulk_batch_size(["pk", "pk"] + fields, objs) | ||
max_batch_size = connection.ops.bulk_batch_size(fields, objs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is wrong in the bulk_batch_size
arithmetic if this passes on SQLite 🤔
The query that will be generated here is of the form
UPDATE
SET f0 = (
CASE WHEN (pk0 = pkv0 ... AND ... pkn = pkn_vo) THEN f0_v0
...
CASE WHEN (pk0 = pk0_vm ... AND ... pkn = pkn_vm) THEN f0_v0
)
...
SET fk = (
CASE WHEN (pk0 = pkv0 ... AND ... pkn = pkn_vo) THEN fk_v0
...
CASE WHEN (pk0 = pk0_vm ... AND ... pkn = pkn_vm) THEN fk_v0
)
WHERE (pk0 = pk0_v0 ... AND ... pkn = pkn_v0) ... OR (pk0 = pk0_vm ... AND ... pkn = pkn_vm)
For k = len(fields)
, m = len(objs)
, and n = len(pk_fields)
.
So the number of required placeholders will be
len(objs) * len(pk_fields) * len(fields)
for each field of each object the primary key fields must be specified for theCASE
/WHEN
criteria.len(objs) * len(fields)
each field needs to be assigned a different value for each objectlen(objs) * len(pk_fields)
for theWHERE
clause
So that makes len(objs) * ((len(pk_fields) * len(fields)) + len(fields) + len(pk_fields))
placeholders by query.
Does SQLite have different rules for different placeholders location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does SQLite have different rules for different placeholders location?
It feels like it might, as even when I was very aggressive and added 10 pks in the composite, it still passed without a complaint 🤔
I can try to read the SQLite docs tomorrow 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So from the test below that updates text
for 1000 items I would expect max_batch_size
to return 500
as len(fields) == 1
but if we plug len(objs) = 500
into the above formula for the Basic
model
500 * ((3 * 1) + 1 + 3) = 3500
Which is way greater than 999 🤔
Maybe the version of SQLite we have compiled on CI has a greater limit than 999?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YES! ⭐
https://www.sqlite.org/limits.html#max_variable_number
To prevent excessive memory allocations, the maximum value of a host parameter number is SQLITE_MAX_VARIABLE_NUMBER, which defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice that explains it! So we can likely bump max_query_parameters
for SQLite depending on the current version but we'd likely still want to test against SQLite < 3.32 as our minimum_database_version
is (3, 31)
right now.
Wonder if should have something alike ticket-35412 to consider dropping < 3.32 support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if should have something alike ticket-35412 to consider dropping < 3.32 support.
If I understand it correctly, it was Ubuntu Focal which kept the support with 3.31. Focal ends standard support in April, and I think Ubuntu Jammy has 3.37.2. So we might be able to bump to 3.37 for 6.0 which would get rid of a couple of flags (cc @felixxm)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me double check and I can create a similar ticket later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@laymonage I know you were interested in django/django-docker-box#11, I don't suppose you already had a chance to take a look and can help us test against SQLite 3.31? 🙏
Sorry, was a bit busy with Wagtail's feature freeze last week. I'll try to get on it this weekend!
On a related note, this is precisely why I suggested having a CI job that tests against an SQLite version that's independent from the distro in #18899. I explained in quite detail in #18899 (comment), including the issue with SQLITE_MAX_VARIABLE_NUMBER
that's customized by Debian/Ubuntu, even prior to 3.32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, it was Ubuntu Focal which kept the support with 3.31. Focal ends standard support in April, and I think Ubuntu Jammy has 3.37.2. So we might be able to bump to 3.37 for 6.0 which would get rid of a couple of flags (cc @felixxm)
It seems that we could bump to 3.34 (but I doesn't give us anything, so not worth doing). SQLite 3.34 was released in December 2020 (SQLite version support seems like a similar situation as GEOS libraries which we generally support about 5 years after released). Moreover, both Ubuntu Focal and Debian Bullseye are still in LTS support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compiling SQLite 3.31.0 without overriding SQLITE_MAX_VARIABLE_NUMBER
would cause some tests to fail. Setting the flag to 32766 makes the tests pass, so there is some benefit to bumping the minimum to 3.34. Whether that's worth it – I don't know. It's also likely to be a bug or potential improvement in Django itself, and setting the flag/bumping the version would just be ignoring the problem.
See django/django-docker-box#50 (comment) for more details.
1e98128
to
d8677c7
Compare
tests/composite_pk/test_update.py
Outdated
@@ -175,3 +175,25 @@ def test_cant_update_to_unsaved_object(self): | |||
|
|||
with self.assertRaisesMessage(ValueError, msg): | |||
Comment.objects.update(user=User()) | |||
|
|||
def test_bulk_update_max_bulk_batch_size(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This acts as a regression test for Oracle, which would fail with:
django.db.utils.DatabaseError: ORA-00907: missing right parenthesis
before the pk field changes
d8677c7
to
848b1ad
Compare
This comment was marked as outdated.
This comment was marked as outdated.
1464da3
to
85d9b6f
Compare
This comment was marked as outdated.
This comment was marked as outdated.
# SQLITE_MAX_VARIABLE_NUMBER defaults to 999 for SQLite versions prior | ||
# to 3.32.0 or 32766 for SQLite versions after 3.32.0. See | ||
# https://www.sqlite.org/limits.html#max_variable_number | ||
if Database.sqlite_version_info >= (3, 32): | ||
return 32766 | ||
return 999 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in the discussions and in the linked documentation, this value can vary depending on the actual flags used when compiling SQLite.
I found that the following query can be used to get the real limit value, but I'm not sure how we feel about doing such queries. It is documented though: https://www.sqlite.org/pragma.html#pragma_compile_options
with self.connection.cursor() as cursor:
try:
with transaction.atomic(self.connection.alias):
result = cursor.execute(
"SELECT compile_options FROM pragma_compile_options "
"WHERE compile_options LIKE 'MAX_VARIABLE_NUMBER=%'"
).fetchone()
return int(result[0].split('=')[-1])
except (OperationalError, TypeError):
# If it fails for some reason, use the defaults as per
# https://www.sqlite.org/limits.html#max_variable_number
if Database.sqlite_version_info >= (3, 32):
return 32766
return 999
85d9b6f
to
06fd11f
Compare
@@ -70,7 +70,7 @@ class DatabaseFeatures(BaseDatabaseFeatures): | |||
supports_over_clause = True | |||
supports_frame_range_fixed_distance = True | |||
supports_ignore_conflicts = False | |||
max_query_params = 2**16 - 1 | |||
max_query_params = 7999 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my current theory that there is a max query size of 8000 bytes and hence this shouldn't exceed this
https://docs.oracle.com/cd/E19851-01/819-7557/gfqsr/index.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docs are ancient, I assure you that it's not the actual limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7999 is also the number of bytes
that the sum of query parameter values bytes must not exceed not the number of query parameters itself.
In other words, assuming these docs were still valid, if you have have 2 fields per object which are each long values that are documented to take 8 bytes each (e.g. origination from IntegerField
) you can only have 8000 / 8
1000 query parameters.
I'm not sure how the 2**16 - 1
value was picked up but 7999
is definitely not the maximum number of query params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See docs: "number of formal parameters in an explicit cursor, function, or procedure".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case it's not obvious, I don't really know what I'm doing and feel quite out my depth.
I initially thought this would be a "simple" change updating connection.ops.bulk_batch_size(["pk", "pk"] + fields, objs)
to connection.ops.bulk_batch_size((pk_fields * 2) + fields, objs)
.
But it looks like when using composite primary keys, the complexity of the query increases to the point that we are in danger of hitting multiple different database limits.
From trial and error, I could increase the value here to roughly 17000. This doesn't mean much because it assumes the current maths is correct.
However, it is close half of 32768 and so I could do:
--- a/django/db/backends/oracle/features.py
+++ b/django/db/backends/oracle/features.py
@@ -70,7 +70,7 @@ class DatabaseFeatures(BaseDatabaseFeatures):
supports_over_clause = True
supports_frame_range_fixed_distance = True
supports_ignore_conflicts = False
- max_query_params = 17000
+ max_query_params = 32768
supports_partial_indexes = False
diff --git a/django/db/backends/oracle/operations.py b/django/db/backends/oracle/operations.py
index aa67f28a79..a62a8e7b62 100644
--- a/django/db/backends/oracle/operations.py
+++ b/django/db/backends/oracle/operations.py
@@ -700,7 +700,7 @@ END;
def bulk_batch_size(self, fields, objs):
"""Oracle restricts the number of parameters in a query."""
if fields:
- return self.connection.features.max_query_params // len(fields)
+ return self.connection.features.max_query_params // len(fields * 2)
return len(objs)
And the test passes.
But it's smoke and mirrors, I don't have any strong logic to back the changes.
One option could be to add a docs warning that when using bulk_update on a model with a composite primary key, they should define a batch_size
due to the query complexity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spending time investigating here @sarahboyce. Dealing with different backend versions and specialized configuration option and trying to make sense of it is definitely painful.
In order to make progress here I think we need to state a few things.
This ticket was created with an intent of auditing the current state of things and I think we reached that point here in the sense that we now know that the bulk_batch_size
approach is broken in a few ways:
- The
max_query_params
value used for SQLite is going to be overly protective in most cases - There likely are other factors at play (different settings) that the maximum numbers of query parameters (e.g. possibly their respective value in bytes) that must be considered to have a complete bullet proof solution
- Some adjustments to
bulk_batch_size
calls should still be made to some extent to prevent to ensure the situation is not worsened by the usage of composite primary keys.
Now where does that leaves us.
I think that 1. can be considered a distinct optimization ticket to reduce the number of queries made during bulk operation.
In the case of 2. I also think that this could be explored as a distinct ticket to likely design a better API than bulk_batch_size
to holistically consider factors that would require reducing the number of paramters.
This leaves us with the core of this issue; the adjustments required to make sure composite primary usage in bulk operations are not more likely to run into problems. As you may know all backends but SQLite support native tuple comparison (that is they are able to do (f0, ..., fn) = %s
instead of f0 = %s ... AND ... fn = %
) which means they should not have to be adjusted at all to accommodate composite primary keys (as they'll use the same number of query parameters which is the only factor bulk_batch_size
currently considers).
Assuming all the above makes sense I think we could focus this PR on the simple linear increase (constant time the number of additional fields in the primary key) of query parameters on SQLite. A potential solution to this problem could be systematically pass Field
instances to bulk_batch_size
(instead of passing str | Field
) and let backends unpack them if required. Something like the following for example
diff --git a/django/db/backends/sqlite3/operations.py b/django/db/backends/sqlite3/operations.py
index 0078cc077a..f0ddbc66fa 100644
--- a/django/db/backends/sqlite3/operations.py
+++ b/django/db/backends/sqlite3/operations.py
@@ -8,6 +8,7 @@
from django.core.exceptions import FieldError
from django.db import DatabaseError, NotSupportedError, models
from django.db.backends.base.operations import BaseDatabaseOperations
+from django.db.models import CompositePrimaryKey
from django.db.models.constants import OnConflict
from django.db.models.expressions import Col
from django.utils import timezone
@@ -36,6 +37,12 @@ def bulk_batch_size(self, fields, objs):
If there's only a single field to insert, the limit is 500
(SQLITE_MAX_COMPOUND_SELECT).
"""
+ fields = list(
+ chain.from_iterable(
+ field.fields if isinstance(field, CompositePrimaryKey) else [field]
+ for field in fields
+ )
+ )
if len(fields) == 1:
return 500
elif len(fields) > 1:
diff --git a/django/db/models/deletion.py b/django/db/models/deletion.py
index fd3d290a96..da2e934c96 100644
--- a/django/db/models/deletion.py
+++ b/django/db/models/deletion.py
@@ -230,9 +230,8 @@ def get_del_batches(self, objs, fields):
"""
Return the objs in suitably sized batches for the used connection.
"""
- field_names = [field.name for field in fields]
conn_batch_size = max(
- connections[self.using].ops.bulk_batch_size(field_names, objs), 1
+ connections[self.using].ops.bulk_batch_size(fields, objs), 1
)
if len(objs) > conn_batch_size:
return [
diff --git a/django/db/models/query.py b/django/db/models/query.py
index eb17624bf1..610e496e9f 100644
--- a/django/db/models/query.py
+++ b/django/db/models/query.py
@@ -874,11 +874,12 @@ def bulk_update(self, objs, fields, batch_size=None):
objs = tuple(objs)
if not all(obj._is_pk_set() for obj in objs):
raise ValueError("All bulk_update() objects must have a primary key set.")
- fields = [self.model._meta.get_field(name) for name in fields]
+ opts = self.model._meta
+ fields = [opts.get_field(name) for name in fields]
if any(not f.concrete or f.many_to_many for f in fields):
raise ValueError("bulk_update() can only be used with concrete fields.")
- all_pk_fields = set(self.model._meta.pk_fields)
- for parent in self.model._meta.all_parents:
+ all_pk_fields = set(opts.pk_fields)
+ for parent in opts.all_parents:
all_pk_fields.update(parent._meta.pk_fields)
if any(f in all_pk_fields for f in fields):
raise ValueError("bulk_update() cannot be used with primary key fields.")
@@ -892,7 +893,9 @@ def bulk_update(self, objs, fields, batch_size=None):
# and once in the WHEN. Each field will also have one CAST.
self._for_write = True
connection = connections[self.db]
- max_batch_size = connection.ops.bulk_batch_size(["pk", "pk"] + fields, objs)
+ max_batch_size = connection.ops.bulk_batch_size(
+ [opts.pk, opts.pk] + fields, objs
+ )
batch_size = min(batch_size, max_batch_size) if batch_size else max_batch_size
requires_casting = connection.features.requires_casted_case_in_updates
batches = (objs[i : i + batch_size] for i in range(0, len(objs), batch_size))
diff --git a/tests/backends/sqlite/test_operations.py b/tests/backends/sqlite/test_operations.py
index 3ff055248d..10cbffdf80 100644
--- a/tests/backends/sqlite/test_operations.py
+++ b/tests/backends/sqlite/test_operations.py
@@ -1,7 +1,7 @@
import unittest
from django.core.management.color import no_style
-from django.db import connection
+from django.db import connection, models
from django.test import TestCase
from ..models import Person, Tag
@@ -86,3 +86,25 @@ def test_sql_flush_sequences_allow_cascade(self):
"zzz'",
statements[-1],
)
+
+ def test_bulk_batch_size(self):
+ self.assertEqual(connection.ops.bulk_batch_size([], [Person()]), 1)
+ first_name_field = Person._meta.get_field("first_name")
+ last_name_field = Person._meta.get_field("last_name")
+ self.assertEqual(
+ connection.ops.bulk_batch_size([first_name_field], [Person()]), 500
+ )
+ self.assertEqual(
+ connection.ops.bulk_batch_size(
+ [first_name_field, last_name_field], [Person()]
+ ),
+ connection.features.max_query_params // 2,
+ )
+ composite_pk = models.CompositePrimaryKey("first_name", "last_name")
+ composite_pk.fields = [first_name_field, last_name_field]
+ self.assertEqual(
+ connection.ops.bulk_batch_size(
+ [composite_pk, first_name_field], [Person()]
+ ),
+ connection.features.max_query_params // 3,
+ )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ticket-36143 created
- ticket-36144 created (might be able to improve the description adding a test failure when ticket-36118 is done)
- I like the approach, I will push up changes 👍
06fd11f
to
c7b7144
Compare
This comment was marked as outdated.
This comment was marked as outdated.
25ab9be
to
218fc63
Compare
…x_batch_size. Co-authored-by: Simon Charette <charette.s@gmail.com>
218fc63
to
bace654
Compare
buildbot, test on oracle. |
fields = list( | ||
chain.from_iterable( | ||
field.fields if isinstance(field, CompositePrimaryKey) else [field] | ||
for field in fields | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely want to add this for now due to TupleExact.as_oracle
but one thing that I noticed while playing with the docker-compose version is that Oracle 23ai (actually 23c renamed to 23ai because why not 😅 ) appears to support native tuple comparison while 19c still doesn't for most operations?
One thing that is very weird though is that I can't confirm from raw SQL so I wonder if the oracledb
wrapper does something special?
Anyway, I'll continue investigating and this should go in as-is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #19107 as the conclusion of this investigation.
Trac ticket number
ticket-36118
Branch description
Been struggling to prove that the
"pk"
additions are required at all (logic was added in 9cbdb44).The tests were to help prove a point, but given we have
queries.test_bulk_update.BulkUpdateTests.test_large_batch
a test might not be required