AWS for SAP Support Playbook
The purpose of this document is to offer general guidance on how to troubleshoot the issues related to SAP on AWS and its support scope. It also explains how to open cases with AWS and what details to provide for a faster and better support from AWS.
SAP Support Playbook
Disclaimer
This document describes the general guidance for triaging the issues related to SAP on AWS and its scope of support and should not be considered as single source of truth in troubleshooting. The commands/configuration given is made based on the generic setup of cluster. Thus the steps/commands may change with subject to the configuration or future OS releases. Hence customers are always advised to validate the commands with the reference document provided in the run book and run the command/configuration on their non-prod environment first before executing in production environment.
Scope of SAP on AWS Support:
-
SAP requires customers to have a minimum AWS Business Support plan with AWS. This ensures that any critical issues raised with SAP are also handled and prioritized by AWS. See SAP Note #1656250 - SAP on AWS: Support prerequisites for more details.
-
AWS Support does not provide any support for the actual SAP software or customer applications. For SAP application specific issues, AWS support Engineering recommends that customers raise an incident with SAP via the SAP support portal.
-
If the customers issues appears to be related SAP software and/or there's uncertainty whether the root cause is due to SAP software or an AWS service, then the customer should open up a support case with SAP support. After the first level of investigation, SAP can redirect the incident to AWS Support if they find an infrastructure related issue which needs to be investigated further by AWS Support. However, if the customer chooses to raise support issues for SAP applications with AWS Support, AWS Support cannot redirect the tickets to SAP. Opening a AWS Support Case for issues with SAP software may delay resolution of the actual issue as AWS Support will end up advising the customer to open up a SAP support case.
For example: * For any RHEL/SLES OS or infrastructure related issues, the customer should raise the issue directly with AWS support engineering. * If the customers nodes are not configured and/or function correctly e.g. unable to synchronize the database 'outside' of High Availability clustering, then the issue is more than likely a problem that needs to be raised directly with SAP.
Related Resources
- SAP on AWS Support
- SAP on AWS FAQ
- AWS Docs - Architecture guidance for availability and reliability of SAP on AWS
- AWS Support FAQ - General & Third-party Software
- AWS Support FAQ - AWS Incident Detection and Response
SUSE Support
AWS Support provides OS level support i.e. troubleshooting, configuration guidance and assistance for EC2 instances that are launched from the PAYG SLES based AMIs.
- Public Cloud SUSE Support for PAYG SLES - AWS Support is responsible for providing the front line support and initial triage. Further, troubleshooting will be done by SUSE premium support escalated by AWS Support. SUSE is not providing support directly to the AWS customer and there should be no expectation that SUSE support will join a customer call, etc.
- PAYG SLES Images - The PAYG SUSE AMIs i.e. 'SLES' and 'SLES for SAP Applications' are pre-built images by SUSE and they are made available on the AWS Marketplace for use by AWS customers.
- BYOS SUSE AMIS - The BYOS SUSE AMIs i.e. ‘SLES’ and ‘SLES for SAP Applications’, etc are pre-built images by SUSE and they are made available as community images. They require entitlement and registration with the customers SCC to be able to utilize these images.
- Technology Previews - Technology previews are packages, stacks and/or features that are delivered by SUSE to provide customers with a glimpse of upcoming innovations. Technology previews are included for the convenience of customers to give them a chance to test new technologies within their environment. SUSE would appreciate any feedback! If the customer does testing with a technology preview, they can contact their SUSE representative and let them know about their experience and use cases. Any input is helpful for future development. Technology previews come with the following limitations:
- Technology previews are still in development. Therefore, they may be functionally incomplete, unstable, or in other ways not suitable for production use.
- Technology Previews are NOT supported.
- Technology Previews may only be available for specific hardware architectures. Details and functionality of technology Previews are subject to change. As a result, upgrading to subsequent releases of a technology preview may be impossible and require a fresh installation.
- Technology Previews can be removed from a product at any time. This may be the case, for example, if SUSE discovers that a preview does not meet the customer or market needs, or does not comply with enterprise standards.
Related Resources
- https://documentation.suse.com/sle-public-cloud/all/single-html/public-cloud/#sec-intro-support
- https://www.suse.com/releasenotes/x86_64/SUSE-SLES/15-SP3/index.html#intro-support
- https://www.suse.com/releasenotes/x86_64/SUSE-SLES/15-SP3/index.html#intro-technology-preview
Red Hat Support
AWS Support is responsible for providing OS level support i.e. troubleshooting, configuration guidance and assistance for EC2 instances that are launched from the PAYG RHEL AMIs.
- PAYG RHEL Support - AWS Support is responsible for providing the front line support and initial triage. Further, troubleshooting will be done by Red Hat Support, in situation where it’s necessary AWS Support to escalate. Red Hat does not provide support directly to the customer i.e. there should be no expectation that Red Hat support will join a customer call, etc.
- PAYG RHEL Images - The PAYG Red Hat AMIs i.e. 'RHEL', 'RHEL with HA', 'RHEL for SAP Solutions with HA and US', etc are pre-built images by Red Hat and they are made available on the AWS Marketplace for use by customers.
- PAYG RHEL vs Cloud Access “Gold Images” (BYOS/BYOL) - Red Hat PAYG images differ from the Red Hat Cloud Access "Gold Images" aka "BYOS / BYOL" images in the following ways:
- AWS customers purchase / subscribe to the PAYG images directly with AWS.
- The Red Hat product "subscription" comes as part of the PAYG AMI and the entitlement is associated with the EC2 Instances.
- All of the Red Hat PAYG images are already pre-configured to receive updates from the AWS RHUI.
- The AWS RHUI is maintained and managed by Red Hat on behalf of AWS.
- AWS customers receive their Red Hat support directly from AWS Support Engineering.
Case Severity
AWS Support Best Practices
- Before troubleshooting any service, start by checking the Personal Health Dashboard in your AWS console to make sure everything is reporting healthy.
- Support Playbook in following slides are general guidelines on issue details to be provided while raising a support case for specific services.
- Provide as much details you could while following these guidelines, however this should not stop you from raising a production issue.
- Request chat in the AWS Support case to get faster response.
- Engage your TAM if you need to escalate a High/Medium or Low severity case due to lack of expected response.
- AWS Support may not be able to provide RCA related to service issues. Please engage your TAM to get the RCA.
SAP Support Checklist
Steps on furnishing the cases and how to collect the required logs. Command for both RHEL and SUSE
- A complete and detailed description of the problem or error, including:
- The respective AWS resource ID(s) e.g. EC2 Instance IDs, etc.
- The current impact to the application / service and/or current system state.
- The date and time including the timezone for when the issue occurred.
- Information about the SAP Landscape[for example ENSA1, ENSA2 if SAP Netweaver,
- Any recent change made to environment i.e configuration file or software/OS upgrade
- Actions taken if they have fixed the issue (example by stop starting the instance or service restart etc.)
- Share the configuration details, system information, and diagnostic information.
- If SUSE, refer the data collection steps for SUSE:
- Either use suse-hacollect tool or collect with individual commands e.g supportconfig, hb_report etc.
- If Redhat, refer the data collection steps for Red Hat Data Collection Steps for Gathering a System Report
- If SUSE, refer the data collection steps for SUSE:
Data Collection Steps for Troubleshooting OS and Cluster Issues
SUSE
SUSE Data Collection Steps for Gathering a System Report
- Use the supportconfig command-line tool to collect detailed systems information and create a tar archive that can be shared with AWS Support Engineering. The supportconfig command-line tool is provided by the supportutils package, which should be installed by default on any resources that were launched from the PAYG SLES for SAP Applications images.
- For customer resources that are based off PAYG SLES for SAP Applications, make sure to have the customer install both the supportutils-plugin-suse-public-cloud and supportutils-plugin-ha-sap plugin packages.
- supportconfig versions released prior to January 2021 may collect data, which could be considered sensitive e.g. usernames, etc by a customers security policies. Sharing this type of data does not typically impact system security, however some customers may have their own security policies, which do not permit the sharing of some information and so they will need to review the collected content before sharing with AWS.
Collecting a supportconfig archive on SLES as the root user
-
Run the following commands on all of the nodes in the cluster to install and/or update the supportutils and plugin packages on the host(s):
(source /etc/os-release; sudo SUSEConnect -p sle-module-public-cloud/$VERSION_ID/x86_64) # Make sure the supportutils package is installed sudo zypper in -y supportutils yast2-support # Make sure the supportconfig sap ha plugin is installed sudo zypper in -y supportutils-plugin-ha-sap # Make sure the supportconfig public cloud plugin is installed sudo zypper in -y supportutils-plugin-suse-public-cloud
-
Run the following on all of the cluster nodes to collect the system information in a compressed tar archive, switch to the root user and following supportconfig command:
# Create the default config i.e. /etc/supportconfig.conf *sudo supportconfig -C # Use the following command to gather the supportconfig sudo supportconfig -l
Note: For larger high memory HANA hosts i.e. 18TB, 24TB use the following command instead of the above to gather the supportconfig.
For larger high memory HANA hosts i.e. 18TB, 24TB use the following command to gather the supportconfig
sudo supportconfig -l -x OFILES,PROC
3. The tool will generate a compressed file in the "/var/log" directory with all the configurations files, logs and the already rotated logs.
Related Resources
- SUSE KB - supportconfig
- SUSE KB - Details for supportconfig plugin to gather SAP information related to SUSE solutions
- SUSE Docs - Data Collection for General OS Issues
- SUSE Docs - Troubleshooting and Gathering System Information for Support
SUSE Data Collection Steps for Gathering a Pacemaker Cluster Report
- By default the hb_report tool will try to gather data from cluster nodes either via ssh using the root user on the other cluster nodes or via another specified user using the sudo tool.
- The method that is used will depend on the customers requirements for their environment. All of the cluster nodes must be able to access each other via ssh.
- Tools like hb_report/crm_report and Hawk's History Explorer require password less ssh access between the nodes. If this is not setup, the hb_report will only be able to collect data from the node where the command was run.
Collecting an HB Report from a SLES for SAP Cluster (As the root user):
-
When generating the hb_report, make sure to specify a START FROM and TIME TO FINISH for the incident with a time frame covering at least twenty-four hours (24) prior to the incident happening and two hours (2) after the incident.
sudo su - hb_report -u root -f "YYYY/MM/DD HH:MM" -t "YYYY/MM/DD HH:MM" /tmp/hb_report-$(date +%Y%m%d-%s)
The following formats are acceptable for us with the -f and -t flags:
2pm
1:00
"2007/9/5 12:30"
"09-Sep-07 2:00"
Collecting an HB Report from a SLES for SAP Cluster with a custom Corosync log location
If the customer has any custom logs i.e. corosync.log that is configured in corosync.conf for an alternative location and this may result in the hb_report not capturing the log in the custom location. Make sure to collect that as well using the -E <file> flag (extra log files to collect). This option is cumulative and by default /var/log/messages/ will be collected along with the other cluster related logs. For example this would also gather the corosync.log that is under the /var/log/cluster directory.
sudo su -
hb_report -u root -f "YYYY/MM/DD HH:MM" -t "YYYY/MM/DD HH:MM" -n <node01 hostname> -n <node02 hostname> \ -E /var/log/cluster/corosync.log /tmp/hb_report-$(date +%Y%m%d-%s)
Collect a Cluster Report from a Single Node
One can collect cluster report from single nodes, one by one:
ssh <user>@<cluster node> sudo -u root /usr/sbin/crm report -S /home/<user>/<cluster node>
(Optional) Pacemaker Cluster Collection Steps
In some certain situation it may be helpful to have the customer provide the pengine files. To collect the pengine files, run the following commands on each of the cluster nodes:
tar cvfJ $(hostname)-pengine-files.txz /var/lib/pacemaker/pengine/
(Optional) SAP HANA Data Collection Steps
To collect additional SAP HANA related information, switch to the SAP <sid>adm user:
sudo su - <sid>adm
Then run the following commands on each cluster node to collect the additional SAP HANA information:
HDB info >> /tmp/hana.$HOSTNAME
HDBSettings.sh systemOverview.py >> /tmp/hana.$HOSTNAME
/usr/sap/<SAPSID>/<INSTANCE><NUMBER>/exe/sapcontrol -nr <NR> -function GetProcessList >> /tmp/hana.$HOSTNAME
HDBSettings.sh landscapeHostConfiguration.py --sapcontrol=1 >> /tmp/hana.$HOSTNAME; echo $? >> /tmp/hana.$HOSTNAME
HDBSettings.sh systemReplicationStatus.py >> /tmp/hana.$HOSTNAME; echo $? >> /tmp/hana.$HOSTNAME
hdbnsutil -sr_state >> /tmp/hana.$HOSTNAME
In addition, please request the customer to share the file /tmp/hana.$HOSTNAME
Related Resources
- SUSE KB - Data Collection for In-depth HANA Cluster Debugging i.e. Pacemaker, SAP, etc
- SUSE KB - Usage of hb_report for SLE-HAE
- SUSE Docs - Data Collection for Cluster Issues
- SUSE Docs - Collecting a hb_report/crm_report as a non-root user
Red Hat
Red Hat Data Collection Steps for Gathering a Pacemaker Cluster Report
- The sosreport utility should automatically collect a crm_report via the cluster plugin and this is the recommended method for gathering cluster related data. All cluster nodes must be able to access each other via SSH. Tools like hb_report/crm_report for troubleshooting and Hawk's History Explorer require passwordless SSH access between the nodes, otherwise they can only collect data from the current node.
- If do not have the sos package, you can run the crm_report tool on all cluster nodes and manually provide the Corosync logs.
Collect a Cluster Report as the root user
The crm_report tool is part of the pacemaker package and should already be installed on the host. There's really shouldn't be a need to run this command.
sudo yum install -y pacemaker
When collecting a crm_report during an investigation, it is important that the date specified is earlier than the known occurrences of the problems or behavior in question, ideally by at least a day or more. If it is unknown when the issue occurred, then going back one week or more is a good strategy. Specify how far in the past you want the report to start and which directory to place the collected data (Note: The directory you specify must not exist). The following command will collect exactly 7 days prior to the current date/time:
sudo su -
crm_report -f "`date --date='7 days ago' +%Y-%m-%d' '%H:%M:%S`" /tmp/crm_report-$(date +%Y%m%d-%s)
To collect a crm_report on a single node, run the following command (Note: the -S means single node):
sudo su -
*crm_report -f "`date --date='7 days ago' +%Y-%m-%d' '%H:%M:%S`" -S /tmp/crm_report-$(date +%Y%m%d-%s)
In the event that you use a non-standard SSH port, use the -X option. For example, if your SSH port is 3479, invoke a crm_report with:
sudo su -
crm_report -X '-p 2222' -f "`date --date='7 days ago' +%Y-%m-%d' '%H:%M:%S`" -S /tmp/crm_report-$(date +%Y%m%d-%s)
Related Resources
- https://docs.aws.amazon.com/wellarchitected/latest/sap-lens/best-practice-1-6.html
- https://aws.amazon.com/sap/docs/
- How do I generate a crm_report from a RHEL 6 or 7 High Availability cluster node using pacemaker?
Troubleshooting Common Pacemaker Cluster Issues:
[1] Corosync Totem Token Timeout
The following warning message will be logged if the token warning threshold has been reached. The warning message indicates that the token timeout may need to be increased if you're seeing this frequently in the logs. The warning message will happen at 75% of the token timeout, so if there's nothing set in the corosync configuration, e.g. 750ms for 1000ms / 2250ms for 3000ms.
node01 corosync[190620]: [TOTEM ] Token has not been received in X ms
Related Resources
- SUSE KB - How to setup corosync token and consensus in a cluster with more then 2 nodes using unicast (udpu)
- Red Hat KB - How do I change the totem token timeout value on a RHEL 5, 6, 7, 8 or 9 High Availability cluster?
- Red Hat KB - How do I configure the consensus timeout in a Red Hat High Availability cluster?
- SAP on AWS Docs - Increase the Corosoync timeout for a RHEL for SAP NetWeaver Setup
- bz1870449
- github.com/corosync/corosync/pull/600
[2] A cluster node was fenced due to corosync transmit errors and messages re-transmitted during heavy load
A cluster node was fenced off when the cluster node was under heavy load and there was lots corosync messages being retransmitted. The situation may occur during a short network blip or third party security software scan which delays the corosync message packet transmission.
Related Resources
- SUSE KB - Corosync Communication Failure
- Red Hat KB - How to prevent FAILED TO RECEIVE on overloaded network used by corosync?
- Red Hat KB - "[TOTEM] Retransmit List" messages repeatedly seen in RHEL 5, 6, 7, or 8 High Availability cluster node logs
- github.com/corosync/corosync/issues/622 * bz2001969
[3] A Node was fenced after it logged that "A processor failed, forming new configuration"
A customer's node(s) were fenced and I see the following messages logged in the cluster logs i.e. corosync:
"A processor failed, forming new configuration"
This is typically due to Corosync token timeout and consensus that is too low i.e. 1s, 3s, 5s, etc. This is also the expected cluster behavior as there was a brief blind spot period and the cluster has to err on the side of caution and make sure to maintain data integrity and consistency.
Related Resources
[4] Cluster split-brian scenarios and STONITH Deathmatch aka Fence Races
Fence Race Scenario
In a two-node cluster, both nodes may attempt to fence each other simultaneously, causing both nodes to reboot or power off. This most commonly happens when there is an issue with the heartbeat network, where both nodes are healthy but cannot communicate with each other. This is referred to as "Fence Race".
An example in the logs of two nodes in a fence race might look like the following: node1 sees node2 is gone and fences.
Nov 20 15:17:40 [117052] node1 cib: notice: crm_update_peer_state_iter: Node node2 state is now lost | nodeid=168364360 previous=member source=crm_update_peer_proc
*Nov 20 15:17:41 [117056] node1 pengine: warning: pe_fence_node: Node node2 will be fenced because the node is no longer part of the cluster
node2 sees node1 is gone and attempts to fence at the same time.
Nov 20 15:17:40 [16727] node2 cib: notice: crm_update_peer_state_iter: Node node1 state is now lost | nodeid=168364359 previous=member source=crm_update_peer_proc
Nov 20 15:17:41 [16731] node2 pengine: warning: stage6: Scheduling Node node1 for STONITH
This results in both nodes fencing each other. While data integrity is maintained, this results in a complete loss of all services.
Related Resources
- STONITH Deathmatch
- Split-brain, Quorum and Fencing
- SUSE KB - Preventing a Fence Race in a Split Brain Scenario
- Red Hat KB - fence_aws powers off both nodes during a heartbeat network split
- Red Hat KB - Delaying Fencing in a Two Node Cluster to Prevent Fence Races or "Fence Death" Scenarios
- Red Hat KB - How do I delay fencing to prevent fence races when using a shared stonith device in a two-node cluster?
- bz1780515
- SAP HANA on AWS - STONITH
[5] stonith-timeout and stonith-action properies are ignored when running pcs stonith fence command in a Pacemaker cluster
The stonith-timeout and stonith-action cluster properties only apply to stonith actions (on/off/reboot) created by the Pacemaker scheduler. The stonith-timeout and stonith-action do not apply to fencing by the stonith_admin or pcs stonith fence <node> commands.
The stonith-timeout cluster property does apply to all devices, but it only controls the timeouts for on, off, and reboot actions. This setting does not control other actions like monitor, status, and list. This was not made clear in some older versions of the documentation.
The timeouts for individual actions on individual devices can be configured with the pcmk_<action>_timeout settings (replacing <action> with the appropriate action name), as described in Table 5.2. Advanced Properties of Fencing Devices.
Related Resolution
- Red Hat KB - stonith-timeout and stonith-action properies are ignored when running pcs stonith fence command in a Pacemaker cluster
- Red Hat KB - stonith-timeout doesn't work as expected in a RHEL 6 or 7 High Availability cluster with pacemaker
- Red Hat - Modifying fencing devices
[6] Fencing operations timed out
The AWS EC2 API does not provide a method to immediately power off / pull the plug or reset EC2 instances. All of the available methods will trigger a graceful OS shutdown (this includes a force stop). As a result of this, instance stops can take anywhere from seconds to several minutes for the instance to shut down i.e. going from the "stopping" to "stopped" state. If the instance does not complete the transition from "stopping" to "stopped" state before the fence agent's power_timeout expires (default value is 60 seconds), then the fence action fails with an EC_WAITING_OFF error ("Timed out waiting to power OFF"). This process also tends to take longer in the event of a break in heartbeat communication, because Pacemaker resists being stopped gracefully while it views the other node as needing fencing. If there is a fence delay for node1 and node2 is in the process of a fence-initiated graceful shutdown, the fence delay may expire before node 2 finishes shutting down. When this happens, node 2 sends a fence request for node 1. Both nodes can get fenced as a result, despite the delay. EC2 Bare Metal instances typically take much longer to stop when compared to the virtualized ones. The fence_aws STONITH action may time out when a metal instance is stopped, which can take up to 10 or so minutes to go from the "stopping" to a "stopped" state.
Related Resources
- Red Hat KB - The fence_aws fencing action timed out with EC2 Bare Metal Instances
- Red Hat KB - The fence_aws fencing action failed with "Timed out waiting to power OFF" and then "Unable to obtain correct plug status or plug is not available" when a node is panicked in a High Availability cluster
- Red Hat KB - fence_aws powers off both nodes during a heartbeat network split
- Red Hat KB - Why did pacemaker not shutdown until the "Shutdown Escalation" timer expired?
- GitHub - fence_aws.py
- GitHub - fencing.py.py
[7] Change the default fence behavior from "reboot" to "off" for fence_aws on RHEL
The default pcmk action is to reboot.
Related Resources
Relevant content
- Accepted Answer
- AWS OFFICIALUpdated 4 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 years ago