How can I copy data to and from Amazon EFS in parallel to maximize performance on my EC2 instance?

3 minute read
0

I have a large number of files to copy. I want to run these jobs in parallel on an Amazon Elastic File System (Amazon EFS) file system on my Amazon Elastic Compute Cloud (Amazon EC2) instance.

Short Description

Use one of the following tools to run jobs in parallel on an Amazon EFS file system:

  • GNU parallel: For more information, see GNU Parallel on the GNU Operating System website.
  • msrsync: For more information, see msrsync on the GitHub website.
  • fpsync: For more information, see fpsync on the Ubuntu manuals website.

Resolution

GNU parallel

1.    Install GNU parallel.

Amazon Linux and RHEL 6:

$ sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm
$ sudo yum install parallel nload -y

RHEL 7:

$ sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
$ sudo yum install parallel nload -y

Amazon Linux 2:

$ sudo amazon-linux-extras install epel
$ sudo yum install nload sysstat parallel -y

Ubuntu:

$ sudo apt-get install parallel

2.    Use rsync to copy the files to Amazon EFS:

$ sudo time find -L /src -type f | parallel rsync -avR {} /dst

or

$ sudo time find /src -type f | parallel -j 32 cp {} /dst

3.    Use the nload console application to monitor network traffic and bandwidth.

$ sudo nload -u M

msrsync

msrsync is a Python wrapper for rsync that runs multiple rsync processes in parallel.

Note: msrsync is compatible only with Python. Run the msrsync script using Python version 2.7.14 or later.

1.    Install msrsync.

$ sudo curl -s https://raw.githubusercontent.com/jbd/msrsync/master/msrsync -o /usr/local/bin/msrsync && sudo chmod +x /usr/local/bin/msrsync

2.    Use the -p option to specify the number of rsync processes that you want to run in parallel. Replace X with the number of rsync processes. The **-**P option shows the progress of each job.

$ sudo time /usr/local/bin/msrsync -P -p X --stats --rsync "-artuv" /src/ /dst/

fpsync

The fpsync tool synchronizes directories in parallel using fpart and rsync. It can run several rsync processes locally or launch rsync transfers on several nodes (workers) through SSH.

For more information on fpart, see fpart on the Ubuntu manuals website.

1.    Activate the EPEL repository, and then install the fpart package. Amazon Linux and RHEL 6:

$ sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm
$ sudo yum install fpart -y

RHEL 7:

$ sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
$ sudo yum install fpart -y

Amazon Linux 2:

$ sudo amazon-linux-extras install epel
$ sudo yum install fpart -y

Ubuntu:

$ sudo apt-get install fpart

Note: In Ubuntu, fpsync is part of the fpart package.

2.    Use fpsync to synchronize the /dst and /src directories. Replace X with the number of rsync processes that you want to run in parallel.

$ sudo fpsync -n X /src /dst
AWS OFFICIAL
AWS OFFICIALUpdated a year ago
2 Comments

How about AL2023? I found that all the sync tools mentioned cannot be installed via yum or dnf in AL2023

replied 6 days ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 5 days ago