What specific aspects of AWS Lambda, EFS, or Sqlite3's configuration cause sqlite to fail when used concurrently from Lambda over an EFS filesystem despite nfsv4 fcntl() locking support?

0

Problem Statement: I would like to use sqlite3 over EFS concurrently from Lambda, as a provisionless SQL database. I'm avoiding the term "serverless" here because Aurora Serverless exists, but you still have to provision ACUs. I'm looking for a "provisionless" SQL solution, i.e. no ACUs to provision or manage, and only pay for the compute you actually use.

There is ample anecdotal evidence throughout the Internet, that says sqlite3 does not work well when multiple sqlite3 clients concurrently share the same database file on a remote filesystem. But this is exactly what I'm trying to do. I found the mountain of anecdotal evidence unconvincing and unsatisfying, and decided to test it myself.

It took only a few minutes of testing to get the error:

Error: database is locked

It appears that some stale locks are left behind, as the database is NOT locked at the point where the error is encountered.

My question is WHY does this happen?

Here's the source code of the unix vfs module where Sqlite accesses the filesystem: https://www.sqlite.org/src/file?name=src/os_unix.c

Ctrl-F to the comment section "Posix Advisory Locking" which begins with the phrase "POSIX advisory locks are broken by design" (insert facepalm emoji here) ... and you'll see that this clearly should work, as all known edge-cases and shortcomings of NFS locking have been thoroughly analyzed and handled in the code.

Some users have reported that EFS locking works as expected: https://stackoverflow.com/questions/53177938/is-it-safe-to-use-flock-on-aws-efs-to-emulate-a-critical-section

So, what could be the actual root cause of sqlite apparently failing to properly acquire and release locks over EFS?

Is there something broken in the way the Lambda container mounts the EFS volume? My test Lambda used python zip packaging, so it relies on the Lambda platform's container behind the scenes.

Can someone from AWS internal engineering please chime in on whether the issue could be caused by the mount options used by the Lambda platform when mounting EFS volumes? See: https://stackoverflow.com/questions/43914819/file-locks-support-in-docker-volumes-of-nfs4-shares

AlexR
已提問 1 個月前檢視次數 238 次
1 個回答
0

Some new notes while I continue to investigate this:

  1. The error seems to permanently corrupt a database once it occurs. It is not simply a timeout waiting for locks while under load.
  2. This might be a limitation of sqlite3 independent of Lambda or EFS: https://forum.djangoproject.com/t/sqlite-and-database-is-locked-error/26994
  3. Also, nfsv4 only guarantees close-to-open consistency. So the fact that concurrent updates work at all is actually kind of surprising.
AlexR
已回答 1 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南