DEV Community

Cover image for Could not create unique index: how to solve duplication errors
Thomas Bracher for Doctolib Engineering

Posted on

Could not create unique index: how to solve duplication errors

Unique indexes enforce the uniqueness (as their name implies) of one or more columns' values. They can guarantee one-to-one relationships in your model. For instance, let's have a unique shopping cart per user:

=# create unique index uidx_shopping_carts_on_user_id on shopping_carts (user_id);

ERROR:  could not create unique index "uidx_shopping_carts_by_user_id"
DETAIL:  Key (user_id)=(2) is duplicated.
Enter fullscreen mode Exit fullscreen mode

Wow! What was that? It seems that the index can not be created: some existing rows already infringe the constraint.

Oopsy

Let's explore the data a bit. Can we spot culprits?

=# select user_id from shopping_carts order by user_id;

 user_id 
---------
       1
       2
       2
       3
       4
       5
Enter fullscreen mode Exit fullscreen mode

Oopsy, "2" appears twice. This can be quite an issue. It can go wrong for 3 reasons:

  • The user can see the wrong cart, leading to items disappearing
  • Your data team has to apply heuristics to reunify the duplicates
  • Worst case: it leads to complex bugs because everyone suppose shopping carts are unique

Unfortunately, postgres cannot create a unique index as long as the duplicate items exist.

First things first: how many duplicates do you have? Some self joining request does the trick:

=# select
-=   count(*)
-= from shopping_carts a
-= join (select
-=   user_id,
-=   count(*)
-= from shopping_carts
-= group by user_id
-= having count(*) > 1) as b
-=   on b.user_id = a.user_id; 

 count 
-------
 76533
Enter fullscreen mode Exit fullscreen mode

There are too many duplicates for a manual cleanup... Create a dashboard to track the duplicate injection rate. If your table already includes a created_at datetime, you can use a variation of the previous query:

=# select
-#   last_occurrence,
-#   count(*)
-# from shopping_carts a
-# join (select
-#   user_id,
-#   max(date(created_at)) as last_occurrence,
-#   count(*)
-# from shopping_carts
-# group by user_id
-# having count(*) > 1) as b
-#   on b.user_id = a.user_id
-# group by last_occurrence
-# order by last_occurrence DESC;

 last_occurrence | count 
-----------------+-------
 2022-09-13      |   193
 2022-09-12      |   484
Enter fullscreen mode Exit fullscreen mode

We have the duplicate rate per day. This is useful to check if our future fixes work.

Remove the root cause

Let's consider your endpoint implements the following pseudo-ruby code:

def create
  ShoppingCarts.create!(...)
end
Enter fullscreen mode Exit fullscreen mode

A quick fix is to check the existence of the row before creating it:

def create
  unless ShoppingCarts.where(user_id: current_user.id).exists? do
    ShoppingCarts.create!(...)
  end
end
Enter fullscreen mode Exit fullscreen mode

This way no duplicate creation, right? You deploy and check the numbers the next day:

 last_occurrence | count 
-----------------+-------
 2022-09-14      |     8
 2022-09-13      |   193
 2022-09-12      |   484
Enter fullscreen mode Exit fullscreen mode

Wait what? Still more duplicates!!!!

Transactions

This is a race condition happening. Two requests check existence and insert at the same time.

request 1 request 2 select count(*) from shopping_carts where id = 2
exists()... == false - 0
- exists()... == false 0
insert()... - 1
- insert()... 2

Transactions are gonna help, but not directly.

def create
  ActiveRecord::Base.transaction do
    unless ShoppingCarts.where(user_id: current_user.id).exists? do
      ShoppingCarts.create!(...)
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

You deploy this code and watch its effects:

 last_occurrence | count 
-----------------+-------
 2022-09-15      |    15
 2022-09-14      |     8
 2022-09-13      |   193
 2022-09-12      |   484
Enter fullscreen mode Exit fullscreen mode

The problem is not solved. Transactions are configured to committed read isolation by default. If you replay the previous race condition under this isolation, you'll find no violation. Transactions do not avoid race conditions. Fortunately, there are tools to stop the bleeding: locks.

Locks are usually managed by your database for operations like indexation, etc. Locks are a common way to mutually exclude processes from shared resources. You can also manually lock a row. Mutual exclusion is preventing 2 processes to access the same resource.

In postgres you can lock rows or tables using the for udpate keyword in a select query:

def create
  ActiveRecord::Base.transaction do
    unless ShoppingCarts.lock("FOR UPDATE").where(user_id: current_user.id).exists? do
      ShoppingCarts.create!(...)
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

You deploy and check the numbers the next day:

 last_occurrence | count 
-----------------+-------
 2022-09-16      |    11
 2022-09-15      |    15
 2022-09-14      |     8
 2022-09-13      |   193
 2022-09-12      |   484
Enter fullscreen mode Exit fullscreen mode

Oopsy, the lock does not work? Because the select query returns no rows, no lock is upheld. We need to lock something that exists. Great thing that we have a unique user!

def create
  current_user.with_lock do
    unless ShoppingCarts.where(user_id: current_user.id).exists? do
      ShoppingCarts.create!(...)
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

Here the transaction is finally helpful. The lock is released when the transaction is committed. You can deploy and see the result:

 last_occurrence | count 
-----------------+-------
 2022-09-16      |    11
 2022-09-15      |    15
 2022-09-14      |     8
 2022-09-13      |   193
 2022-09-12      |   484
Enter fullscreen mode Exit fullscreen mode

No line for 2022-09-17: victory! Next, we remove duplicates and create the unique index.

Removing duplicates

Removing duplicates can be a touchy matter. I find the following heuristic true most of the time: the latest modified row is the most updated. Some frameworks automatically generate updated_at columns. This can be useful to apply the heuristic. We use updated_at column to discriminate the old rows to delete:

select
  b.id
from shopping_carts a
join (select
  user_id,
  max(updated_at) AS last_updated_at
from shopping_carts
group by user_id
having (count(*) > 1)) b
  on a.user_id = b.user_id
  and a.updated_at < b.last_updated_at
Enter fullscreen mode Exit fullscreen mode

This way only the last updated rows remain. Let's check if there's no duplicate left:

=# select
-=   count(*)
-= from shopping_carts a
-= join (select
-=   user_id,
-=   count(*)
-= from shopping_carts
-= group by user_id
-= having count(*) > 1) as b
-=   on b.user_id = a.user_id; 

 count 
-------
     0
Enter fullscreen mode Exit fullscreen mode

Nice!

Creating the unique index

With no duplicate in the table left, we can create the index:

=# create unique index uidx_shopping_carts_on_user_id on shopping_carts (user_id);
Enter fullscreen mode Exit fullscreen mode

This time, there's no error. Every user has at most one shopping cart. When attempting to create duplicates, you encounter the following error:

=# insert into shopping_carts(user_id) values (2), (2);

ERROR:  duplicate key value violates unique constraint "uidx_shopping_carts_by_user_id"
Enter fullscreen mode Exit fullscreen mode

No process can violate the business rule, even under race conditions! Great value for the buck if you ask me.

Conclusion

I hope you learned some sql knowledge. Here are the main takeaways:

  1. Learn some sql syntax to navigate easily in your data
  2. Columns created_at and updated_at managed by your ORM are great tools for maintenance
  3. Transactions do not avoid race conditions
  4. Introduce unique constraints before you feed the data (this applies to other constraints)

I had to solve a similar problem at work and we are hiring talented engineers to solve this kind of problems.
Take care.

Photo by Stefan Cosma on Unsplash

Top comments (0)