What specific aspects of AWS Lambda, EFS, or Sqlite3's configuration cause sqlite to fail when used concurrently from Lambda over an EFS filesystem despite nfsv4 fcntl() locking support?

Question

Problem Statement: I would like to use sqlite3 over EFS concurrently from Lambda, as a provisionless SQL database. I'm avoiding the term "serverless" here because Aurora Serverless exists, but you still have to provision ACUs. I'm looking for a "provisionless" SQL solution, i.e. no ACUs to provision or manage, and only pay for the compute you actually use.

There is ample anecdotal evidence throughout the Internet, that says sqlite3 does not work well when multiple sqlite3 clients concurrently share the same database file on a remote filesystem. But this is exactly what I'm trying to do. I found the mountain of anecdotal evidence unconvincing and unsatisfying, and decided to test it myself.

It took only a few minutes of testing to get the error:

> Error: database is locked

It appears that some stale locks are left behind, as the database is NOT locked at the point where the error is encountered.

My question is WHY does this happen?

Here's the source code of the unix vfs module where Sqlite accesses the filesystem: https://www.sqlite.org/src/file?name=src/os_unix.c

Ctrl-F to the comment section "Posix Advisory Locking" which begins with the phrase "POSIX advisory locks are broken by design" (insert facepalm emoji here) ... and you'll see that this clearly should work, as all known edge-cases and shortcomings of NFS locking have been thoroughly analyzed and handled in the code.

Some users have reported that EFS locking works as expected: https://stackoverflow.com/questions/53177938/is-it-safe-to-use-flock-on-aws-efs-to-emulate-a-critical-section

So, what could be the actual root cause of sqlite apparently failing to properly acquire and release locks over EFS?

Is there something broken in the way the Lambda container mounts the EFS volume? My test Lambda used python zip packaging, so it relies on the Lambda platform's container behind the scenes.

Can someone from AWS internal engineering please chime in on whether the issue could be caused by the mount options used by the Lambda platform when mounting EFS volumes? See: https://stackoverflow.com/questions/43914819/file-locks-support-in-docker-volumes-of-nfs4-shares

Answer

Some new notes while I continue to investigate this:
1. The error seems to permanently corrupt a database once it occurs. It is not simply a timeout waiting for locks while under load.
2. This might be a limitation of sqlite3 independent of Lambda or EFS: https://forum.djangoproject.com/t/sqlite-and-database-is-locked-error/26994
3. Also, nfsv4 only guarantees close-to-open consistency. So the fact that concurrent updates work at all is actually kind of surprising.

What specific aspects of AWS Lambda, EFS, or Sqlite3's configuration cause sqlite to fail when used concurrently from Lambda over an EFS filesystem despite nfsv4 fcntl() locking support?

相關內容