Handling bulk update or create on Django models through filters

Oct 06, 2021

Recently, I came across a case where I needed to handle either bulk creates or bulk updates on a set of Django records within the same endpoint. Unfortunately, Django doesn't have a way of handling this "either-or" eloquently. I needed to be able to run either operation without the knowledge of primary keys and do so in bulk to minimize the number of statements made against the database.

The problem

This endpoint had a simple requirement: given a set of objects and a combination of fields to filter by, I needed to create new records if they did not already exist, or update them with the new value if they did.

Another way to look at this is through an example of the request body:

[
    {
        "category_id": 23,
        "subcategory_id": 1,
        "post_id": 445,
        "value": "A test value"
    },
    {...}
]

If a record already existed in the table with:

a category_id of 23,
a subcategory_id of 1,
a post_id of 445,

I'd need to update that record's value field with the value provided in the request. Otherwise, I need to create a new record with this data. If any of those three fields did not match, then it would be considered a separate record (such as the case where subcategory_id=2 ).

The equivalent expression would look something like this:

Demo.objects.filter(category_id=23, subcategory_id=1, post_id=445).exists()

The problem statement was straightforward enough, and in normal contexts, should be simple to solve.

An important caveat

However, there was an issue: due to how this data was being created, no primary keys were being provided in the request, and this endpoint is supposed to handle bulk operations. Identifying if the provided records already existed would have been a single statement, but I couldn't check what models already existed in bulk (using the in_bulk method) without primary keys.

Django's in_bulk method doesn't provide a way to filter by several field names like we would need in this case. It's convenient when we have a single list of values, but falls short in more complex cases. Instead, I had to resort to making linear checks through filtering to see if the record existed:

records = [
    {
        "id": Demo.objects.filter(
            category_id=record.get("category_id"),
            subcategory_id=record.get("subcategory_id"),
            post_id=record.get("post_id")
        )
        .first()
        .id
        if Demo.objects.filter(
            category_id=record.get("category_id"),
            subcategory_id=record.get("subcategory_id"),
            post_id=record.get("post_id")
        ).first()
        is not None
        else None,
        **record,
    }
    for record in records
]

This statement looks a bit complex, so let's break it down. We're using list comprehension to return a list of dicts, iterating over every object in the request. For each dict, we're adding a new id key which will hold the primary key of the record that exists given the values, or a value of None if the record can't be found in the database. We're using the filter statement from above to make this comparison. We're then unpacking the data from the request alongside the id key.

The solution

We now have an id key for every record. Django doesn't offer a bulk upsert or update_or_create method, so we have to run bulk_create and bulk_update separately. We'll do this by splitting our new records list into two separate lists which hold the records to update and the records to create.

This is easy to achieve: for every dict in the records list, if the id field is None, then we append that record to the list for creation. Otherwise, that record is appended to the list for updates.

We'll do another list comprehension to achieve this, and we'll also want to remove the id field for records that need to be created since it holds a value of None.

records_to_update = []
records_to_create = []

[
    records_to_update.append(record)
    if record["id"] is not None
    else records_to_create.append(record)
    for record in records
]

[record.pop("id") for record in records_to_create]

We have both of the lists that we need now – all that's left is running our bulk_create and bulk_update methods. By default, Django will attempt to write a single statement for all of the records that you provide it. To limit the length of this statement, we want to specify the batch_size parameter to instruct the method to handle these bulk inserts at a rate of 1000 records per statement.

created_records = Demo.objects.bulk_create(
    [Demo(**values) for values in records_to_create], batch_size=1000
)

# Unless you're using Django 4.0, bulk_update doesn't return a value
Demo.objects.bulk_update(
    [
        Demo(id=values.get("id"), value=values.get("value"))
        for values in records_to_update
    ],
    ["value"],
    batch_size=1000
)

For your bulk_update, Django expects that you specify the list of fields to update as the second argument. Because we are only updating one field in this example, I explicitly set these fields within the model, whereas we are unpacking the full dictionary within the bulk_create statement.

Here is the full code within a viewset:

from rest_framework import generics, status
from rest_framework.response import Response
from api.models import Demo
from api.serializers import DemoSerializer

class DemoViewset(generics.ListCreateAPIView):
    def post(self, request):
        records = request.data.get("data")

        # Let's define two lists:
        # - one to hold the values that we want to insert,
        # - and one to hold the new values alongside existing primary keys to update
        records_to_create = []
        records_to_update = []

        # This is where we check if the records are pre-existing,
				# and add primary keys to the objects if they do
        records = [
            {
                "id": Demo.objects.filter(
                    category_id=record.get("category_id"),
                    subcategory_id=record.get("subcategory_id"),
                    post_id=record.get("post_id")
                )
                .first()
                .id
                if Demo.objects.filter(
                    category_id=record.get("category_id"),
                    subcategory_id=record.get("subcategory_id"),
                    post_id=record.get("post_id")

                ).first()
                is not None
                else None,
                **record,
            }
            for record in records
        ]

        # This is where we delegate our records to our split lists:
        # - if the record already exists in the DB (the 'id' primary key), add it to the update list.
        # - Otherwise, add it to the create list.
        [
            records_to_update.append(record)
            if record["id"] is not None
            else records_to_create.append(record)
            for record in records
        ]

        # Remove the 'id' field, as these will all hold a value of None,
				# since these records do not already exist in the DB
        [record.pop("id") for record in records_to_create]


        created_records = Demo.objects.bulk_create(
            [Demo(**values) for values in records_to_create], batch_size=1000
        )

        # Unless you're using Django 4.0, bulk_update doesn't return a value
        Demo.objects.bulk_update(
            [
                Demo(id=values.get("id"), value=values.get("value"))
                for values in records_to_update
            ],
            ["value"],
            batch_size=1000
        )

        # We may want to return different statuses and content based on the type of operations we ended up doing.
        message = None
        if len(records_to_update) > 0 and len(records_to_create) > 0:
            http_status = status.HTTP_200_OK
        elif len(records_to_update) > 0 and len(records_to_create) == 0:
            http_status = status.HTTP_204_NO_CONTENT
        elif len(records_to_update) == 0 and len(records_to_create) > 0:
            http_status = status.HTTP_201_CREATED
            message = DemoSerializer(created_records, many=True).data

        return Response(message, status=http_status)

Other approaches considered

Initially, I felt weary about this solution, as I would be making a SELECT statement against my model's table for every record to check if it exists by filtering. It felt naive and inefficient. After a bit of reading, however, I found that there wasn't a good solution for this or at least one that was widely agreed upon.

As stated previously, Django doesn't have an eloquent way of handling bulk updates/creates without knowledge of primary keys. You can't easily run bulk upserts against the database, at least without writing raw SQL, which could be time-consuming and complex. While the SQL approach is faster, you risk writing longer expressions that may be less readable or take too much time if you have a simpler case. I wanted something quick and easy, that I knew would work well enough despite potentially taking a minor performance hit.

You can also consider the approach of defining a temporary table, and then copying the newly inserted data to your actual table – but that's cumbersome and can be more difficult to follow. You have an additional table to maintain in your schema, and it can be harder to trace how data is flowing.

Lastly, there's a PyPI package called django-bulk-update-or-create, which adds a Manager to your model's QuerySet. This requires you to import another package for your project, and has been reported to suffer from similar performance drawbacks (many SELECT statements over a large list of models), and may not be feasible if you only have to perform these operations in one view.

The consensus

This approach isn't the most efficient, but it's enough of a solution. With more reading on how to approach this problem, the consensus was that it would be easiest to split your records into two lists: that of which models exist, and models which would need to be created. You could then run a bulk update and bulk create on each list, respectively. At one point, I asked myself:

Why can't I just run a bulk update the same way that I bulk create?

In other words, I wanted to write Demo.objects.bulk_update(Demo(**values)), where I would unpack the dict into the model and expect Django the intelligently handle conflicts. As it turns out, you can't do this, as the statement does not know a primary key. Trying to do this results in an exception:

ValueError: All bulk_update() objects must have a primary key set.

I couldn't find a good workaround for this, so splitting these lists and doing filtering again felt like the best option.

Why do these operations in bulk?

Using bulk_create and bulk_update is ideal so that we can use a single statement against the database. As I mentioned earlier, Django also allows us to use the batch_size parameter to ensure the statement doesn't become overly long, which otherwise could lead to too much CPU and memory usage when executing statements. Our goal was to minimize the number of API requests being made between the two services, and reduce the load on the database, so bulk operations were the perfect candidate.

This can be further optimized by batching the process and running it through jobs; however, I'll leave it up to you to figure out what strategy you want to employ for better memory management and performance.

Return home