We are having intermittent performance issues with our OpenSearch cluster. There are multiple daily occurrences of long-running OpenSearch queries when executing our lambda function. When this occurs there is record like the example below in the Open Search es-application-logs
2022-09-09T11:37:35.083-04:00
[2022-09-09T08:37:34,853][WARN ][o.o.t.TransportService ] [09122640e72c461b2c57179fcc08d339] Received response for a request that has timed out, sent [44029ms] ago, timed out [24016ms] ago, action [PATH], node [{d52117a284ca0ada6bd69784bbd8a8c5}
{ekzv2YqtT8q7b-TVvXGKEQ} {aGQCli5rTFKAeGjWRihvTQ}
{IP}
{IP} {dimr} {dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=true, _AMAZON_INTERNAL, cross_cluster_transport_address=IP, shard_indexing_pressure_enabled=true, _AMAZON_INTERNAL}
], id [3986058]
Normally these queries complete with sub second response times. The impact to our application is that occasionally requests to API Gateway are timing out due to the long running Lambda and results in 504 Gateway Timeout being returned to the API consumer.
Here is an example of the OpenSearch HTTP request
{
"method": "POST",
"hostname": "vpc-tycc-analytics-dev-os-qs4p2rpyars2oacibpdih4ctmu.ca-central-1.es.amazonaws.com",
"query": {
"size": "500",
"_source": [
"CurrentAgentSnapshot"
],
"sort": "CurrentAgentSnapshot.Configuration.Username:asc"
},
"headers": {
"Content-Type": "application/json",
"host": "vpc-tycc-analytics-dev-os-qs4p2rpyars2oacibpdih4ctmu.ca-central-1.es.amazonaws.com"
},
"body": "{"query":{"bool":{"must_not":{"term":{"CurrentAgentSnapshot.Configuration.RoutingProfile.Name":"Basic Routing Profile"}}}}}",
"protocol": "https:",
"path": "/agent-records-real-time-alias/_search"
}
Are you able to please investigate and provide feedback as to what is causing the performance issue along with remediation action.
Let me know if you require additional information.
Thanks,
BTW
The following is out cluster specification:
• Deployment type: Dev/Test
• OS version: 1.2 (latest)
• Availability Zones: 1AZ
• Instance type: t3.small.search
• Number of nodes: 2
• Storage type: EBS
• EBS volume type: General Purpose (SSD) - gp2
EBS storage size per node: 10 GB
Many thanks. Please find below the log list for the first part of your answer
022-10-10T18:19:51.482-04:00 [2022-10-10T15:19:50,699][WARN ][o.o.m.j.JvmGcMonitorService] [09122640e72c461b2c57179fcc08d339] [gc][3851910] overhead, spent [815ms] collecting in the last [1.1s]
2022-10-10T18:20:53.519-04:00 [2022-10-10T15:20:52,911][WARN ][o.o.m.j.JvmGcMonitorService] [09122640e72c461b2c57179fcc08d339] [gc][3851943] overhead, spent [29.4s] collecting in the last [30.1s]
2022-10-10T22:18:23.563-04:00 [2022-10-10T19:18:23,014][WARN ][o.o.t.TransportService ] [09122640e72c461b2c57179fcc08d339] Received response for a request that has timed out, sent [54249ms] ago, timed out [34233ms] ago, action [PATH], node [{d52117a284ca0ada6bd69784bbd8a8c5}{ekzv2YqtT8q7b-TVvXGKEQ}{vBbzFjeDTymDJ7NEPqJBLg}{IP}{IP}{dimr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=true, AMAZON_INTERNAL, cross_cluster_transport_address=IP, shard_indexing_pressure_enabled=true, AMAZON_INTERNAL}], id [13800571]
2022-10-10T22:18:23.566-04:00 [2022-10-10T19:18:23,018][WARN ][o.o.t.TransportService ] [09122640e72c461b2c57179fcc08d339] Received response for a request that has timed out, sent [33232ms] ago, timed out [13207ms] ago, action [PATH], node [{d52117a284ca0ada6bd69784bbd8a8c5}{ekzv2YqtT8q7b-TVvXGKEQ}{vBbzFjeDTymDJ7NEPqJBLg}{IP
[gc][3851910] overhead, spent [815ms] collecting in the last [1.1s] [gc][3851943] overhead, spent [29.4s] collecting in the last [30.1s] It seems to be spending considerable time in GC which is not good but there is also 4 hour gap between GC log event and timeout log event. Can you corroborate this by checking the Cloudwatch metrics as mentioned in #2 of the initial comment.