Curious Case of Database Transaction Deadlocks Using TransactionScope

We built an event driven system in my team over 2 years ago that sometime back ran into database transaction deadlock problems which puzzled me enough to want to take a deeper look into what happened. This mini-post is a bit of documentation of that investigation.

The service has 3 instances running, one per EC2 host on AWS. It is written in C#/.NET Core 3.1 with BackgroundServices and is backed by an Aurora MySql database. It picks off domain events from multiple queues, processes them by mutating the state of the local domain entities and persisting them back in the MySql database. All this happens concurrently and in some cases inside of database transaction blocks that look something like this:

	static async Task Main(string[] args)
	{
	using (var scope = new TransactionScope(TransactionScopeAsyncFlowOption.Enabled))
	{
	await Save(product);
	// we might update more entities
	scope.Complete();
	}
	}

	public static async Task Save(Product product)
	{
	using (var connection = new MySqlConnection(“...”))
	{
	await connection.OpenAsync();

	// select count(1) from Products where Id = @id
	if (await Exists(product, connection))
	// update Products set ... where Id = @id
	await Update(product, connection);
	else
	// insert into Products values(...)
	await Insert(product, connection);
	}
	}

view raw CrudeUpsert.cs hosted with ❤ by GitHub

We are using TransactionScope because the domain operation spans multiple entities and the corresponding data access logic for these entities is implemented in separate classes. The only way to put these in a single transaction in the domain layer is by using a unit of work like TransactionScope.

However, when this code ran in production we’d regularly end up with deadlocks:

So I decided to isolate the transactional code path and run it inside of a Parallel.For loop targeting a local MySql database to see if I can replicate these deadlocks:

	Parallel.For(0, 500,
	async index =>
	{
	var product = new Product
	{
	Id = id,
	Stock = index+1
	};

	using (var scope = new TransactionScope(
	TransactionScopeAsyncFlowOption.Enabled))
	{
	await Save(product);
	scope.Complete();
	}
	});

view raw ParallelInsert.cs hosted with ❤ by GitHub

It didn’t take too many iterations to replicate the error:

MySql.Data.MySqlClient.MySqlException (0x80004005): Deadlock found when trying to get lock; try restarting transaction…

Hmmm! 🤔

Note that a deadlock in and of itself is not the end of the world, because you can re-issue the deadlocked transaction because it gets rolled back by MySql. Its only a problem if it happens so regularly that it makes it difficult to make forward progress and/or results in data loss in other ways. In our case although data loss wasn’t much of an issue but repeated deadlocks were still worrisome. An increase in message volume, could very well make this problem worse in the future. So I decided to investigate and address this.

My first hunch was that may be the transactions are just taking too long due to the fact that we’re SELECTing and then either INSERTing or UPDATing and that’s causing contention with locks being held for too long. But how do I see what’s really happening?

Debugging Locking in MySql

There are at least 2 basic ways to see what’s going on with locks in MySql:

a. Show the most recent deadlocks by running show engine innodb status query against the database. This shows an output like so:

	=====================================
	2021-07-23 21:27:55 0x7f0fec4a3700 INNODB MONITOR OUTPUT
	=====================================
	...
	------------------------
	LATEST DETECTED DEADLOCK
	------------------------
	2021-07-23 21:26:29 0x7f0fd3558700
	*** (1) TRANSACTION:
	TRANSACTION 2631, ACTIVE 0 sec starting index read
	mysql tables in use 1, locked 1
	LOCK WAIT 4 lock struct(s), heap size 1136, 2 row lock(s)
	MySQL thread id 15, OS thread handle 139706363459328, query id 3531 172.20.0.1 root updating
	update TrxDb.Products set stock = 495 where Id = 1000 and Version = 1

	*** (1) HOLDS THE LOCK(S):
	RECORD LOCKS space id 2 page no 4 n bits 72 index PRIMARY of table `TrxDb`.`Products` trx id 2631 lock mode S locks rec but not gap
	Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 0
	...


	*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
	RECORD LOCKS space id 2 page no 4 n bits 72 index PRIMARY of table `TrxDb`.`Products` trx id 2631 lock_mode X locks rec but not gap waiting
	Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 0
	...


	*** (2) TRANSACTION:
	TRANSACTION 2632, ACTIVE 0 sec starting index read
	mysql tables in use 1, locked 1
	LOCK WAIT 4 lock struct(s), heap size 1136, 2 row lock(s)
	MySQL thread id 9, OS thread handle 139706363754240, query id 3533 172.20.0.1 root updating
	update TrxDb.Products set stock = 357 where Id = 1000 and Version = 1

	*** (2) HOLDS THE LOCK(S):
	RECORD LOCKS space id 2 page no 4 n bits 72 index PRIMARY of table `TrxDb`.`Products` trx id 2632 lock mode S locks rec but not gap
	Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 0
	...

	*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
	RECORD LOCKS space id 2 page no 4 n bits 72 index PRIMARY of table `TrxDb`.`Products` trx id 2632 lock_mode X locks rec but not gap waiting
	Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 0
	...

	*** WE ROLL BACK TRANSACTION (2)
	...
	----------------------------
	END OF INNODB MONITOR OUTPUT
	============================

view raw ShowLatestDeadlock.log hosted with ❤ by GitHub

This will list the most recent deadlock and show what locks the transactions held and what locks they were waiting for at the time of deadlock.

b. Log all deadlocks to MySql logs by enabling (i.e. setting it to “ON”, default is “OFF”) the innodb_print_all_deadlocks setting in system variable. This will mostly produce the same output but it will log all the deadlocks. MySql docs recommend enabling it only for debugging and then disabling it once the problem is resolved, probably because deadlock logging could adversely affect normal transaction performance. Although I left this setting enabled in our production database and haven’t noticed any significant impact on transaction processing.

Anyway, in this case, I can see both transaction 1 and 2 hold an S (shared) lock and are waiting for X (exclusive) lock to update the row. This means each transaction is waiting for a lock to write to the record but they each hold another lock that at least one of them needs to release before they can get the X lock, therefore, DEADLOCK!!!

The way these locks work is documented very well in MySql docs if you want to understand more! This is a very deep subject and it’s hard to keep all edge cases in your head at all times.

But why the S lock in the first place? I wondered if a simple SELECT statement was resulting in the S lock by any chance so I also ran just the SELECT statement in a transaction in SQL without committing/rolling back the transaction so that I can run diagnostic queries with the transaction still in RUNNING state:

	set autocommit = off;
	start transaction;
	select * from TrxDb.Products where Id = 1000;

view raw TrySimpleSelectInTrx.sql hosted with ❤ by GitHub

In order to see what kind of locks are being taken, I ran the following diagnostic query:

select * from performance_schema.data_locks;

Hmm! No locks! What if it locks only when done along with an UPDATE statement in the same transaction?

	set autocommit = off;
	Start transaction;
	select count(1) from TrxDb.Products where Id = 1000;
	update TrxDb.Products set Stock = 8 where Id = 1000 and Version = 1;

view raw TrySelectAndUpdateInTrx.sql hosted with ❤ by GitHub

The result from the diagnostic query:

Ok! just the X lock taken by the UPDATE statement (as expected) but still no S lock! Ignore the IX lock its an intention lock that always seems to be granted to all the transactions that request it. It doesn’t seem to affect other locks or transactions!

If I run 2 transactions almost simultaneously (in two separate tabs/sessions and with slightly different data values), the output of the diagnostic query shows that one transaction has got the X lock whilst the other one is waiting for it, but still no S lock in sight*:

After a while waiting, I get a LOCK WAIT TIMEOUT error, not a DEADLOCK:

The plot thickens!!!

Does transaction isolation level affect locking? 🤔

The default isolation level in MySql is REPEATABLE READ so I would assume that the default isolation level in TransactionScope is also REPEATABLE READ which makes the deadlock even weirder because the SQL script execution doesn’t deadlock. So I looked under the hood of TransactionScope and found that:

It internally creates an instance of a CommittableTransaction

Which uses SERIALISABLE as the default isolation level (this is also documented but at the time I didn’t bother to check the documentation that far)!…well! ok! Is that the reason for the deadlock then?

I changed my SQL scripts to set the isolation level to SERIALISABLE and ran them simultaneously-ish again (reason for artificial delays is to increase the likelihood of a deadlock by slightly altering the order in which locks get acquired, this likely also happens in repro code as well which is why it takes a few iterations for the deadlocks to pop-up. Without this the only error I get is LOCK WAIT TIMEOUT! 🤷‍♂️) :

	#Transaction 1
	set autocommit = off;
	set SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE ;
	start transaction;
	select * from TrxDb.Products where Id = 1000;
	do sleep(8);
	update TrxDb.Products set stock = 8 where Id = 1000 and Version = 1;

	#Transaction 2
	set autocommit = off;
	set SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE ;
	start transaction;
	do sleep(5);
	select * from TrxDb.Products where Id = 1000;
	update TrxDb.Products set stock = 8 where Id = 1000 and Version = 1;

view raw TrxWithSerialisable.sql hosted with ❤ by GitHub

And…DEADLOCK!!!

The diagnostic query result shows the locks:

Both transactions acquired S locks, and then they ended up waiting for X lock because once the S lock has been taken, an X lock cannot be granted because that could run the risk of modifying a record from underneath a reader. This is also what I saw in the original deadlock log output at the start.

But why does the SELECT statement take an S lock under serialisable isolation level**?

Well it turns out the answer to the former is pretty simple and somewhat anti-climactic:

SERIALIZABLE
This level is like REPEATABLE READ, but InnoDB implicitly converts all plain SELECT statements to SELECT … FOR SHARE if autocommit is disabled. If autocommit is enabled, the SELECT is its own transaction. It therefore is known to be read only and can be serialized if performed as a consistent (nonlocking) read and need not block for other transactions. (To force a plain SELECT to block if other transactions have modified the selected rows, disable autocommit.)
MySql docs

And a SELECT…FOR SHARE takes the S lock by design, so that’s the source of the S lock!

How did we fix this?

Often deadlocks might indicate application design problems, I noticed that we didn’t actually need to do an UPSERT type operation in the data layer but instead we can split it into independent INSERT and UPDATE operations and move the decision making up in the domain layer.

By splitting the UPSERT we effectively not only shortened the individual transactions but also made them more explicit and simpler. The domain code can then determine whether to create a new record if a record for that id doesn’t yet exist or to update an existing one with the state carried by the event. For all intents and purposes this is a domain decision, not a data access layer decision because domain has the context and the awareness of whether or not it has enough information to continue. The “exists” check need not lock anything so splitting also resulted in fewer locks being taken. We can run into a race where on server A the check says, “product does not exist, create it” and simultaneously on another server B, the same check happens but it ends up creating it first then the server A fails with a primary key violation. But since this system is message driven with automatic retries, a simple re-try a bit later on will appropriately end up updating the record and the system will be consistent again. We can deal with a little bit of a delayed processing.

All of this will of course vary from one system to another so this might not be the right or desirable thing to do in every situation but doing a retrieval to determine whether to insert or update all in one transaction is a longer running transaction by design and could create conditions for lock contention which could lead to deadlocks (as this post highlights).

If you want to play with the repro code and cook up additional scenarios, its on GitHub!

* The way I am running transactions here is of course different from the code because here I am not committing the transaction at all (because I want to catch them in action) and this has a side-effect of the transactions holding onto locks for longer than needed. But the fact that the code still deadlocks, leads me to make a calculated guess that my set up whilst not perfect is not entirely wrong! If you know of a better or more reliable way to set this experiment up, by all means, shout out in the comments!

** It turns out that serialisable isolation level is quite misunderstood (I certainly did), it makes you think that the transactions will execute one after the other in a deterministic fashion so you wonder why would they deadlock if the locks get taken in a deterministic fashion. But that’s not really the case! Giving the impression as though transactions happened one after the other is the not the same as physically executing one after the other. The locks taken in a serialisable transaction don’t prevent other serialisable transactions from starting, they only prevent phantom reads and dirty reads of uncommitted data. Meaning, the transactions will still overlap and deadlocks is how they achieve “one after another” i.e. by making it your responsibility as a engineer!

Curious Case of Database Transaction Deadlocks Using TransactionScope

Debugging Locking in MySql

Does transaction isolation level affect locking? 🤔

How did we fix this?

Published by Aman Agrawal

One Reply to “Curious Case of Database Transaction Deadlocks Using TransactionScope”

Leave a comment Cancel reply

Debugging Locking in MySql

Does transaction isolation level affect locking? 🤔

How did we fix this?

Share this:

Related

Published by Aman Agrawal

One Reply to “Curious Case of Database Transaction Deadlocks Using TransactionScope”

Leave a comment Cancel reply