-
Notifications
You must be signed in to change notification settings - Fork 111
Enabling an expression based sort
Some of our API endpoints make use of SQL expressions (functions) in the WHERE clause of the underlying SQL statements. In these cases, we should be making use of the same expression in the ORDER BY clause of the SQL statement if we're also sorting (or would like to enable sorting) on the same field(s) found in the WHERE clause.
In order to accomplish this and ensure that the queries remain performant (or become more performant in the case of addressing existing performance issues), we must make use of indexes on expressions. With an index setup on the expression itself we can take advantage of the performance gains in both the WHERE and ORDER BY clauses in the SQL statements generated for the API.
There are three steps for accomplishing this:
- Setting up the expression-based index(es)
- Setting up the expression-based sort clause to map to the column(s) exposed for sorting in the API
- Excluding the expression-based sort mapping from the schema tied to the resource
Let's dive into the details of these three steps using issue #2791 as the basis for the example.
Setting up an expression-based index is identical to a regular index, the only difference is there is a SQL expression in place of a column name. For example, there are COALESCE statements in the WHERE clause for the Schedule B queries:
COALESCE(ofec_sched_b_master.disb_dt, '9999-12-31'::date);We'll be referencing this statement in the example. But first, note that the API is also keeping track of an index for paging, and in the case of Schedule A and B, a CHECK CONSTRAINT for table partitioning due to the size of those data sets. All of our indexes must keep these things in mind, so they are usually setup as multicolumn indexes for maximum performance with our filtering, paging, and sorting options.
Back to our example with Schedule B, in order to create an index for this statement we would execute this statement:
CREATE INDEX idx_ofec_sched_b_1977_1978_election_cycle_coalesce_disb_dt
ON public.ofec_sched_b_1977_1978
USING btree
(two_year_transaction_period, COALESCE(disb_dt, '9999-12-31'::date), sub_id);Please note that we would do this for all tables in the Schedule B partition. With an expression-based index created, the WHERE and ORDER BY clauses can now take advantage of it. In this case, it's the COALESCE statement using the disb_dt column.
A full example of this can be seen in PR #2904, where a migration was created to add a group of expression-based indexes for the Schedule B partition.
With expression-based indexes in place, any query with an expression in the WHERE clause that matches the index will automatically take advantage of it. The ORDER BY clause will not, however, because of the way the API is setup and how our sorting is exposed in the interface and URL. Only a column name is exposed in the API (both for usability and security, as we don't want to open ourselves up to SQL injection attacks), so we must map the column name to the expression we would actually like to sort by. This is transparent to the user, meaning the API continues to work as it always has by specifying one or more (where supported) sorting columns with an optional - in front to toggle between ascending or descending order.
In order to setup this mapping to enable expression-based sorting for a given field, a hybrid_property must be setup on the model associated with the resource that is mapped to the endpoint being modified. For example, when using the API to view the Schedule B (disbursement) data, you would see this in the URL parameters by default: sort=-disbursement_date. This is the column name exposed in the API, but under the hood it should actually sort by COALESCE(ofec_sched_b_master.disb_dt, '9999-12-31'::date). The API model is what enables us to do this if we add a new property to it.
Continuing with the Schedule B example, take a look at the ScheduleB model. First, we need to make sure we're importing these three things at the top of the file (there will likely be additional import statements):
import marshmallow as ma
import sqlalchemy as sa
from sqlalchemy.ext.hybrid import hybrid_propertyWith those in place, find the model we want to add the sort expression support to and add this code (again, this is the code built for Schedule B in this example):
@hybrid_property
def sort_expressions(self):
return {
'disbursement_date': {
'expression': sa.func.coalesce(
self.disbursement_date,
sa.cast('9999-12-31', sa.Date)
),
'field': ma.fields.Date,
'type': 'date',
'null_sort': self.disbursement_date, # This is optional - it will default to the expression itself.
},
}Let's unpack what is going on here:
- First, we need to use the
@hybrid_propertydecorator because we are making use of a column defined in the model and a SQLAlchemy construct (the reference tocoalescein this case). - Second, we create the property as a method and call it
sort_expressions. This is important as the sorting code checks for the presence of this property name on a model to see if any columns need to be swapped for expressions. - Third, we return a
dictin the method that contains the column names found in the model as the keys and an inner dictionary that contains the expression, marshmallow field, and data type we would like to swap them with as the values. Again, the outer keys are the column names that are defined on the model, meaning the model properties, not the database column name itself. In this example,disbursement_dateis the model property anddisb_dtis the database column name. We need to usedisbursement_datebecause that is the model property and what is exposed in the API.
The inner dictionary always has three parts to it, and one optional part:
- The expression itself
- The marshmallow field that the value of the expression would be mapped to
- The data type of the value in the database itself as a lower case string
-
(OPTIONAL) The means for accounting how to sort
NULLvalues
The expression itself is calculated for its actual value in the pagination code, so we need the marshmallow field because it is used to serialize the value of the expression. Furthermore, we need the data type of the expression because that is used to help construct the ORDER BY clause in the pagination code as well. Both of these items are derived programmatically from a column, but they cannot be derived from an expression, so we must explicitly map them as a part of this.
The optional field, null_sort, is for accounting how to sort in the case of NULL values. In this example with Schedule B, the disbursement_date field can be NULL and since we expose it in the API as NULL as well as in the web site, we need to still be able to sort NULL values even though the expression would never yield a NULL value. In this case, we have to provide an alternative value to sort by so the sorting still works as expected. However, if you are working with an expression that does return a NULL value, then this is not needed and the sorting code will default to also using the expression for sorting NULL values.
If we were going to enable sorting by expression for another column, we would add another key/value pair in the dictionary of the model property and the expression, field, and type to order by.
The sorting and paging code takes care of all of the rest for us, including supporting which direction to sort in. With this property and mapping in place, the expression will be used in the ORDER BY clause, and the ORDER BY clause in turn will make use of the index setup for the expression.
If you are interested in seeing the full implementation of how this is constructed, PR #2905 contains the changeset that enables the underlying support for this.
Each end point is tied to a resource, which in turn is usually a mapping between a model and a schema. If there is a schema associated with a resource, we need to also ignore the sort_expressions property in the schema as that should not be returned with the results.
Using Schedule B as the example once more, the Schedule B schema is defined as follows in webservices/schemas.py:
ScheduleBSchema = make_schema(
models.ScheduleB,
fields={
'memoed_subtotal': ma.fields.Boolean(),
'committee': ma.fields.Nested(schemas['CommitteeHistorySchema']),
'recipient_committee': ma.fields.Nested(schemas['CommitteeHistorySchema']),
'image_number': ma.fields.Str(),
'original_sub_id': ma.fields.Str(),
'sub_id': ma.fields.Str(),
},
options={
'exclude': (
'recipient_name_text',
'disbursement_description_text',
'recipient_street_1',
'recipient_street_2',
),
'relationships': [
Relationship(
models.ScheduleB.committee,
models.CommitteeHistory.name,
'committee_name',
1
),
],
}
)The exclude key in the options dictionary needs an additional entry in it, sort_expressions:
ScheduleBSchema = make_schema(
models.ScheduleB,
fields={
'memoed_subtotal': ma.fields.Boolean(),
'committee': ma.fields.Nested(schemas['CommitteeHistorySchema']),
'recipient_committee': ma.fields.Nested(schemas['CommitteeHistorySchema']),
'image_number': ma.fields.Str(),
'original_sub_id': ma.fields.Str(),
'sub_id': ma.fields.Str(),
},
options={
'exclude': (
'recipient_name_text',
'disbursement_description_text',
'recipient_street_1',
'recipient_street_2',
'sort_expressions',
),
'relationships': [
Relationship(
models.ScheduleB.committee,
models.CommitteeHistory.name,
'committee_name',
1
),
],
}
)This will ensure the sort_expressions property does not appear in the JSON output of the API.