Skip to content

Why am I receiving a "FAILED TO CONNECT AGENT TO REPLICATION SERVER" error in my AWS MGN or AWS DRS replication initiation steps?

5 minute read
Content level: Expert
0

I am attempting to replicate or migrate a server using Application Migration Service (AWS MGN) or Elastic Disaster Recovery (AWS DRS), but the process fails with the error "FAILED TO CONNECT AGENT TO REPLICATION SERVER".

Short description

The "FAILED TO CONNECT AGENT TO REPLICATION SERVER" error shows that the replication agent installed on the source server is unable to establish a network connection with the AWS replication server in the AWS staging area. This connectivity is crucial for transferring data from the source server to the replication server during the migration or disaster recovery process.

Resolution

There are several potential causes for this error, primarily related to network configuration and connectivity issues between the source server and the replication server's subnet. If encountered, review the following troubleshooting steps to identify and resolve the root cause:

Network connectivity issues

A common reason for the "FAILED_TO_CONNECT_AGENT_TO_REPLICATION_SERVER" error is a network connectivity issue between the source server and the replication server or staging subnet. To troubleshoot this:

  • Ensure that the replication is configured properly as Public or Private in the Replication Settings. For Application Migration Service, see Replication settings. For Elastic Disaster Recovery, see AWS DRS replication settings.
  • Verify that the source server can communicate with the MGN/DRS service endpoint (e.g., mgn.eu-west-1.amazonaws.com) over port 443. Additionally, check that the required ports are open for inbound and outbound traffic between the source server and the staging subnet. For both MGN and DRS, port 1500 is required for data transfer between the source server (Replication Agent) and the replication server in the AWS staging area.
  • Test the connectivity from the source server to the replication server's IP address using the following commands:
echo | telnet <replication_server_ip> 1500
echo | openssl s_client -connect <replication_server_ip>:1500
  • If the commands hang or fail to establish a connection, it indicates a network connectivity issue that needs to be resolved.
  • Review the network configuration, including security groups, network ACLs, route tables, and gateways (NAT Gateway, Internet Gateway, Transit Gateway), to ensure proper connectivity between the source server and the staging subnet. For private replication scenarios using AWS Direct Connect or VPN, ensure that the replication subnet's route table has the correct routes configured to allow traffic from the source server's network.
  • Check for any proxy servers, firewalls, or SSL offloading rules that might be interfering with the connection between the source server and the replication server. Configure allow rules or exceptions as needed, and ensure that SSL connections are allowed without interception.
  • Perform an SSL connectivity test between your source environment and the staging area by following this document.
  • Capturing network traces (e.g., using Wireshark or tcpdump) during the agent's connection attempt can help analyze the network traffic and identify any potential issues.

SSL/TLS handshake issues

In some cases, the "FAILED TO CONNECT AGENT TO REPLICATION SERVER" error can be caused by an issue with the SSL/TLS handshake between the source server and the replication server. To troubleshoot this:

  • Install OpenSSL on the source server (if not already installed) and run the following OpenSSL commands to test the SSL/TLS connectivity:
echo | openssl s_client -showcerts -connect <replication_server_ip>:1500
echo | openssl s_client -connect <replication_server_ip>:1500
  • Test the SSL/TLS connection using different TLS versions (e.g., -tls1, -tls1_1, -tls1_2, -tls1_3) with the OpenSSL commands.

Example command:

echo | openssl s_client -showcerts -connect <replication_server_ip>:1500 -tls1_2
  • Review the OpenSSL command outputs, network trace logs (if available), and the replication agent logs (agent.log.0) for any clues about the SSL/TLS handshake issue. Verify that any SSL offloading or inspection rules are not interfering with the connection as the connection should be allowed without SSL interception.

Replication agent issues

Verify that the replication agent is running correctly in the source server and there is no issue with it. Check that the status of the agent services and if they are running correctly:

  • Linux
sudo systemctl status aws-replication-agent.service
sudo systemctl status aws-replication-tailer.service
sudo systemctl status aws-replication-run-migration-scripts.service
sudo systemctl status aws-replication-update-volumes.service
  • Windows
AwsReplicationService 
AwsReplicationLogger 
AwsReplicationDriverLogger
AwsReplicationVolumeUpdaterService 

Review agent logs

Examine the replication agent logs (typically located at C:\Program Files (x86)\AWS Replication Agent\agent.log.0 on Windows or /var/lib/aws-replication-agent/agent.log.0 on Linux) for more detailed error messages and clues about the connectivity issue. These logs often contain specific information that can help pinpoint the root cause of the connection failure. Example of SSL error from the agent log file :

{"message":"Error creating connection (minor)","log":{"type":"Remote","name":"Agent","level":"WARNING","level_number":30},"source_code":{"line":453,"function":"getChannel","file":"com.amazonaws.cloudendure.agent.driverflavor.ClientCommunicatorDrvFlavor"},"runtime_thread":21,"exception":{"message":"java.net.SocketTimeoutException: connect timed out","type":"com.amazonaws.cloudendure.agent.shared.ConnectionCreator$ConnectToReplicatorTimeoutException","trace":"com.amazonaws.cloudendure.agent.shared.ConnectionCreator$ConnectToReplicatorTimeoutException: java.net.SocketTimeoutException: connect timed out
    at com.amazonaws.cloudendure.agent.shared.ConnectionCreator.createOneReplicatorConnection(ConnectionCreator.java:453)
    at com.amazonaws.cloudendure.agent.shared.ConnectionCreator.createReplicatorConnections(ConnectionCreator.java:534)
    at com.amazonaws.cloudendure.agent.shared.ConnectionCreator.tryCreateConnection(ConnectionCreator.java:244)
    at com.amazonaws.cloudendure.agent.shared.ConnectionCreator.createConnection(ConnectionCreator.java:170)
    at com.amazonaws.cloudendure.agent.driverflavor.ClientCommunicatorDrvFlavor.getChannel(ClientCommunicatorDrvFlavor.java:122)
    at com.amazonaws.cloudendure.agent.driverflavor.ClientCommunicatorDrvFlavor.receive(ClientCommunicatorDrvFlavor.java:224)
    at com.amazonaws.cloudendure.agent.driverflavor.ClientMessageHandler.run(ClientMessageHandler.java:63)
Caused by: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:607)
    at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:293)
    at com.amazonaws.cloudendure.common.ConnectionUtilities.connectToMachine(ConnectionUtilities.java:187)
    at com.amazonaws.cloudendure.agent.shared.ConnectionCreator.createOneReplicatorConnection(ConnectionCreator.java:449)
    ... 6 more

Related information

AWS
SUPPORT ENGINEER
published 7 months ago1.8K views