Skip to content

DocumentDB 5.0: intermittent "Error in query execution" with code 40 and empty errorLabels — what triggers it?

0

We are observing intermittent server-side query errors on a DocumentDB 5.0 cluster that started after a load incident in early May 2026. The driver receives a MongoServerError with code: 40 and an empty errorLabels array, so it cannot auto-retry, and we cannot find documentation that explains what code 40 means in DocumentDB context (in community MongoDB it is ConflictingUpdateOperators, which makes no sense for read-only queries).

Cluster

  • DocumentDB 5.0.0
  • 1 writer (db.r6g.2xlarge) + 4 readers (db.r6g.xlarge)
  • Region: us-east-1
  • Parameter group: profiler enabled (sampling 0.5, threshold 10s); tls=disabled is set but pending-reboot (never applied)
  • Cluster CPU healthy: writer ~19% avg, readers ~27-30% avg in peak hours, no alarm has fired

Error shape (raw payload from the driver)

{
  "name": "MongoServerError",
  "message": "Error in query execution",
  "code": 40,
  "ok": 0,
  "errorLabels": []
}

The reply contains only {ok, operationTime, code} — no errInfo, no errmsg detail, no codeName. Because errorLabels is empty (no TransientTransactionError / RetryableWriteError), the Node.js driver does not auto-retry.

Symptoms

  • 0.054% rate on production read endpoints (about 290 errors/day).
  • Every occurrence is fast: 15-300 ms.
  • Distributed across many distinct documents (125+ unique IDs in 24h).
  • Started exactly when our cluster had a load spike — zero occurrences in the 30 days prior.
  • Rate is escalating slowly: 0.45 → 0.94 → 1.48 errors/1000 reqs over successive days at the same time-of-day window (12-14 UTC).
  • Affects multiple services, but is loudest on the service with the highest read volume on indexed findOne + populate patterns.

Driver

  • Node.js mongodb 4.17.2 / mongoose 6.13.8
  • URI: mongodb://...@<cluster-endpoint>:27017/<db>?replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false

What we have ruled out

  1. Saturation by heavy aggregations — only 1.4% of failures correlate with any slow query (>10s) running in the profiler.
  2. A specific bad reader — 1500 sequential findOne+populate calls per instance via direct connection (writer + each of 4 readers, 7500 total) returned 0 failures.
  3. Profiler — disabled briefly during a peak window. Per-1000 rate did not drop, was actually slightly higher in the disabled window.
  4. A relief reader instance added during the original load spike — removed; rate did not change.
  5. Out-of-VPC reproduction with the same connection string, same mongoose version, same query shape, concurrency 10 — 0 failures across 7500 attempts. The failure shape is what DocumentDB itself returns to traffic from inside the VPC.

Questions

  1. What does code 40 mean in DocumentDB 5.0 specifically? It does not behave like community MongoDB's ConflictingUpdateOperators because the failing operations are read-only findOne calls.
  2. Is there a known cluster-internal condition (compaction, internal election/failover, snapshot or backup window, ttl_monitor activity, profiler sampling, replica catchup) that would cancel an in-flight read query on a secondary and surface it to the client as code: 40 with no errorLabels?
  3. Why does the response not include an errorLabels value the driver can use to retry? Is this expected, or should this be reported as a driver/server compatibility issue?
  4. Is there any cluster-level diagnostic (CloudWatch metric, parameter, audit log) that correlates with these cancellations?
  5. Has anyone seen this rate grow over time and then plateau, or correlate with cluster size / time since last failover?

Happy to provide CloudWatch query IDs, profiler samples, or cluster identifier privately. Thanks!

2 Answers
1

While official documentation for Code 40 in DocumentDB is sparse, this behavior (especially on read-only findOne + populate patterns) strongly suggests an internal Read-View or History Store conflict within the shared storage volume.

  • Internal Execution Failure: In DocumentDB's architecture, Code 40 often surfaces when the query processor cannot maintain a consistent snapshot of the data. If the storage layer's garbage collection (purging old versions) is too aggressive or struggling after a load spike, the reader may lose the "version" of the document it is trying to fetch.
  • Empty errorLabels: The absence of labels like TransientTransactionError indicates that the engine does not classify this as a standard retryable network/failover event, but rather a hard execution break at the storage level.
  • Escalation Pattern: The fact that the rate is growing suggests a "fragmentation" or "backlog" issue in the underlying storage history that hasn't cleared since the May incident.

I would try the following actions:

  1. Monitor LowOldestReadTimestamp: Check CloudWatch for correlations between this metric and your error spikes.
  2. Manual Retry Logic: Since the driver isn't getting the labels, implement a manual client-side retry specifically for code: 40. Given your 15–300ms error duration, a second attempt will likely succeed.
  3. Audit Logs: Enable DML auditing briefly to see if the server-side logs provide more context than the driver’s raw payload.

Hope that helps!

EXPERT
answered a month ago
  • Update — fully resolved. Turned out the root cause was on our side: a hot read path on a 12.6k-doc collection had no usable index for its filter, so every call was a COLLSCAN running 50–300ms. On DocDB 4.0 that was just slow; after our migration to 5.0 in late April, those slow in-flight reads started getting cancelled internally and surfaced to the client as code: 40 / "Error in query execution". The cluster itself was healthy by every external metric, which is why nothing showed up in CPU/MVCC/GC. We added the missing compound index online (background:true, ~2s for 12.6k docs) and the error rate went from a steady ~290/day, escalating, straight to zero — 10+ hours later including peak hours, no occurrences. Reader CPU also dropped ~35% instantly. Thanks again for pushing us toward the manual retry and the audit/profiler ideas — even though the audit/profiler didn't capture the failures themselves, the profiler at 50ms is what surfaced the COLLSCAN-heavy pattern that pointed to the missing index. Really appreciate the help.

0

Based on the available documentation, I cannot find specific information about error code 40 in DocumentDB 5.0 or the "Error in query execution" message you're encountering. The sources provided do not contain details about this particular error code or its triggers in the DocumentDB context.

However, I can address some aspects of your situation based on documented behaviors:

Regarding retryable operations and errorLabels: Amazon DocumentDB does not currently support retryable writes. When using MongoDB 4.2+ compatible drivers (which enable retryable writes by default), you need to explicitly disable this feature in your connection string or client constructor. You've already done this with retryWrites=false in your connection string, which is correct.

The empty errorLabels array you're seeing means the error is not classified as a transient transaction error or retryable write error by DocumentDB, which explains why the driver doesn't auto-retry. This behavior aligns with DocumentDB's current limitations around retry semantics.

Regarding read preferences and failover: There is a documented issue with certain Java Mongo drivers where read preferences may not be honored correctly, and clients can enter a bad state after failover events. While your setup uses Node.js, it's worth noting that driver behavior during cluster state changes can sometimes produce unexpected results. Your use of secondaryPreferred with the cluster endpoint should generally work correctly, but driver-level connection handling during internal cluster events could potentially contribute to intermittent errors.

What I recommend:

  1. This specific error code and behavior is not documented in the available public documentation, so you should open a support case with AWS to get authoritative guidance on what code 40 represents in DocumentDB 5.0
  2. Request cluster-level diagnostic logs that might correlate with these error timestamps
  3. Ask AWS Support specifically about any internal maintenance operations, compaction cycles, or replica synchronization behaviors that might trigger query cancellations

The escalating rate pattern you're observing and its correlation with the load incident suggests this may be related to an internal cluster state that hasn't fully recovered or stabilized, but without access to internal DocumentDB diagnostics, I cannot provide a definitive explanation.
Sources
Functional differences: Amazon DocumentDB and MongoDB - Amazon DocumentDB
Troubleshooting connectivity issues - Amazon DocumentDB

answered a month ago
EXPERT
reviewed a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.