Handling bulk update or create on Django models through filters
Recently, I came across a case where I needed to handle either bulk creates or bulk updates on a set of Django records within the same endpoint. Unfortunately, Django doesn't have a way of handling this "either-or" eloquently. I needed to be able to run either operation without the knowledge of primary keys and do so in bulk to minimize the number of statements made against the database.
The problem
This endpoint had a simple requirement: given a set of objects and a combination of fields to filter by, I needed to create new records if they did not already exist, or update them with the new value if they did.
Another way to look at this is through an example of the request body:
[ { "category_id": 23, "subcategory_id": 1, "post_id": 445, "value": "A test value" }, {...} ]
If a record already existed in the table with:
- a category_id of 23,
- a subcategory_id of 1,
- a post_id of 445,
I'd need to update that record's value
field with the value provided in the request. Otherwise, I need to create a new record with this data. If any of those three fields did not match, then it would be considered a separate record (such as the case where subcategory_id=2
).
The equivalent expression would look something like this:
Demo.objects.filter(category_id=23, subcategory_id=1, post_id=445).exists()
The problem statement was straightforward enough, and in normal contexts, should be simple to solve.
An important caveat
However, there was an issue: due to how this data was being created, no primary keys were being provided in the request, and this endpoint is supposed to handle bulk operations. Identifying if the provided records already existed would have been a single statement, but I couldn't check what models already existed in bulk (using the in_bulk method) without primary keys.
Django's in_bulk
method doesn't provide a way to filter by several field names like we would need in this case. It's convenient when we have a single list of values, but falls short in more complex cases. Instead, I had to resort to making linear checks through filtering to see if the record existed:
records = [ { "id": Demo.objects.filter( category_id=record.get("category_id"), subcategory_id=record.get("subcategory_id"), post_id=record.get("post_id") ) .first() .id if Demo.objects.filter( category_id=record.get("category_id"), subcategory_id=record.get("subcategory_id"), post_id=record.get("post_id") ).first() is not None else None, **record, } for record in records ]
This statement looks a bit complex, so let's break it down. We're using list comprehension to return a list of dicts, iterating over every object in the request. For each dict, we're adding a new id
key which will hold the primary key of the record that exists given the values, or a value of None
if the record can't be found in the database. We're using the filter statement from above to make this comparison. We're then unpacking the data from the request alongside the id
key.
The solution
We now have an id
key for every record. Django doesn't offer a bulk upsert or update_or_create
method, so we have to run bulk_create and bulk_update separately. We'll do this by splitting our new records list into two separate lists which hold the records to update and the records to create.
This is easy to achieve: for every dict in the records list, if the id
field is None
, then we append that record to the list for creation. Otherwise, that record is appended to the list for updates.
We'll do another list comprehension to achieve this, and we'll also want to remove the id
field for records that need to be created since it holds a value of None
.
records_to_update = [] records_to_create = [] [ records_to_update.append(record) if record["id"] is not None else records_to_create.append(record) for record in records ] [record.pop("id") for record in records_to_create]
We have both of the lists that we need now – all that's left is running our bulk_create
and bulk_update
methods. By default, Django will attempt to write a single statement for all of the records that you provide it. To limit the length of this statement, we want to specify the batch_size
parameter to instruct the method to handle these bulk inserts at a rate of 1000 records per statement.
created_records = Demo.objects.bulk_create( [Demo(**values) for values in records_to_create], batch_size=1000 ) # Unless you're using Django 4.0, bulk_update doesn't return a value Demo.objects.bulk_update( [ Demo(id=values.get("id"), value=values.get("value")) for values in records_to_update ], ["value"], batch_size=1000 )
For your bulk_update
, Django expects that you specify the list of fields to update as the second argument. Because we are only updating one field in this example, I explicitly set these fields within the model, whereas we are unpacking the full dictionary within the bulk_create
statement.
Here is the full code within a viewset:
from rest_framework import generics, status from rest_framework.response import Response from api.models import Demo from api.serializers import DemoSerializer class DemoViewset(generics.ListCreateAPIView): def post(self, request): records = request.data.get("data") # Let's define two lists: # - one to hold the values that we want to insert, # - and one to hold the new values alongside existing primary keys to update records_to_create = [] records_to_update = [] # This is where we check if the records are pre-existing, # and add primary keys to the objects if they do records = [ { "id": Demo.objects.filter( category_id=record.get("category_id"), subcategory_id=record.get("subcategory_id"), post_id=record.get("post_id") ) .first() .id if Demo.objects.filter( category_id=record.get("category_id"), subcategory_id=record.get("subcategory_id"), post_id=record.get("post_id") ).first() is not None else None, **record, } for record in records ] # This is where we delegate our records to our split lists: # - if the record already exists in the DB (the 'id' primary key), add it to the update list. # - Otherwise, add it to the create list. [ records_to_update.append(record) if record["id"] is not None else records_to_create.append(record) for record in records ] # Remove the 'id' field, as these will all hold a value of None, # since these records do not already exist in the DB [record.pop("id") for record in records_to_create] created_records = Demo.objects.bulk_create( [Demo(**values) for values in records_to_create], batch_size=1000 ) # Unless you're using Django 4.0, bulk_update doesn't return a value Demo.objects.bulk_update( [ Demo(id=values.get("id"), value=values.get("value")) for values in records_to_update ], ["value"], batch_size=1000 ) # We may want to return different statuses and content based on the type of operations we ended up doing. message = None if len(records_to_update) > 0 and len(records_to_create) > 0: http_status = status.HTTP_200_OK elif len(records_to_update) > 0 and len(records_to_create) == 0: http_status = status.HTTP_204_NO_CONTENT elif len(records_to_update) == 0 and len(records_to_create) > 0: http_status = status.HTTP_201_CREATED message = DemoSerializer(created_records, many=True).data return Response(message, status=http_status)
Other approaches considered
Initially, I felt weary about this solution, as I would be making a SELECT
statement against my model's table for every record to check if it exists by filtering. It felt naive and inefficient. After a bit of reading, however, I found that there wasn't a good solution for this or at least one that was widely agreed upon.
As stated previously, Django doesn't have an eloquent way of handling bulk updates/creates without knowledge of primary keys. You can't easily run bulk upserts against the database, at least without writing raw SQL, which could be time-consuming and complex. While the SQL approach is faster, you risk writing longer expressions that may be less readable or take too much time if you have a simpler case. I wanted something quick and easy, that I knew would work well enough despite potentially taking a minor performance hit.
You can also consider the approach of defining a temporary table, and then copying the newly inserted data to your actual table – but that's cumbersome and can be more difficult to follow. You have an additional table to maintain in your schema, and it can be harder to trace how data is flowing.
Lastly, there's a PyPI package called django-bulk-update-or-create, which adds a Manager to your model's QuerySet. This requires you to import another package for your project, and has been reported to suffer from similar performance drawbacks (many SELECT
statements over a large list of models), and may not be feasible if you only have to perform these operations in one view.
The consensus
This approach isn't the most efficient, but it's enough of a solution. With more reading on how to approach this problem, the consensus was that it would be easiest to split your records into two lists: that of which models exist, and models which would need to be created. You could then run a bulk update and bulk create on each list, respectively. At one point, I asked myself:
Why can't I just run a bulk update the same way that I bulk create?
In other words, I wanted to write Demo.objects.bulk_update(Demo(**values))
, where I would unpack the dict into the model and expect Django the intelligently handle conflicts. As it turns out, you can't do this, as the statement does not know a primary key. Trying to do this results in an exception:
ValueError: All bulk_update() objects must have a primary key set.
I couldn't find a good workaround for this, so splitting these lists and doing filtering again felt like the best option.
Why do these operations in bulk?
Using bulk_create
and bulk_update
is ideal so that we can use a single statement against the database. As I mentioned earlier, Django also allows us to use the batch_size
parameter to ensure the statement doesn't become overly long, which otherwise could lead to too much CPU and memory usage when executing statements. Our goal was to minimize the number of API requests being made between the two services, and reduce the load on the database, so bulk operations were the perfect candidate.
This can be further optimized by batching the process and running it through jobs; however, I'll leave it up to you to figure out what strategy you want to employ for better memory management and performance.