- Newest
- Most votes
- Most comments
Decrease storage capacity
- Take a backup of source file system for safety
- Create a new Amazon FSx for Windows File System, 5TB with an FSx throughput capacity of 64 (MBps) which comes with 8GB of RAM. https://docs.aws.amazon.com/fsx/latest/WindowsGuide/performance.html The recommendation from Microsoft is 1GB RAM per 1 TB of data you trying to dedupe.
- Setup an aggressive dedupe schedule that runs directly after the data transfer completes.
Optimization schedule
# Edit variables below: $DestRPSEndpoint = "amznfsxxxxxyyyy.mytestdomain.local" # This Start time is set for 10 seconds from now (ensure your files server time zone is in UTC) # Optimization job needs to start first $StartTime = (Get-Date).AddSeconds(20) # DurationHours of 8 hour causing the server to cancel the job after 8 hours if the process has not ended. $DurationHours = 8 Invoke-Command -Authentication Kerberos -ComputerName ${DestRPSEndpoint} -ConfigurationName FSxRemoteAdmin -ScriptBlock { New-FSxDedupSchedule -Name CustomOptimization -Type Optimization -Start $Using:StartTime -Days Mon, Tues, Wed, Thurs, Fri, Sat, Sun -Cores 80 -DurationHours $Using:DurationHours -Memory 70 -Priority High } # Get status Invoke-Command -Authentication Kerberos -ComputerName ${DestRPSEndpoint} -ConfigurationName FSxRemoteAdmin -ScriptBlock {Get-FSxDedupSchedule -Name CustomOptimization -Type Optimization}
Garbage Collection
# This Start time is for Garbage job which is what reclaims free space and runs after optimization completed # Microsoft defaults is 1hour after optimization but it all depends on how quick optimize completes, if it finishes before 35 min then change this value to what works for you. $StartTime = (Get-Date).AddSeconds(20) Invoke-Command -Authentication Kerberos -ComputerName ${DestRPSEndpoint} -ConfigurationName FSxRemoteAdmin -ScriptBlock { New-FSxDedupSchedule -Name "CustomGarbage" -Type GarbageCollection -Start $Using:StartTime -Days Mon, Tues, Wed, Thurs, Fri, Sat, Sun -DurationHours $Using:DurationHours } # Get status Invoke-Command -ComputerName ${DestRPSEndpoint} -ConfigurationName FSxRemoteAdmin -ScriptBlock {Get-FSxDedupSchedule -Name CustomGarbage -Type GarbageCollection}
- Use AWS DataSync Service or Robocopy to move the data in small batches of about 100GB per batch, from source to the newly created 5TB
Note Exclude system volume information folder which contains dedupe chunk store data. robocopy example:
/XD '$RECYCLE.BIN' "System Volume Information"
See robocopy command used in CloudFormation template link [2].
- Wait for 100GB data transfer to complete and check the dedupe status to see if it ran and has started deduping this 100GB batch. Wait until dedup garbage collection job finishes, reclaims space, then repeat process for the next 100GB chunk.
- Time how long it took to dedup 100GB then make adjustments to the data batch size accordingly (maybe 200GB) if needed.
- Recreate the SMB shares and permissions
- Create any alias or SPNs
- Terminate the 20TB and use the 5TB
- The full workflow (excluding dedupe tips) for share creation, SPN, DNS alias can be found in link [1]
Change Availability Zone
Restore from backup, allows you to change subnets which in turn changes AZs for that file system.
Additionally, to switch from a Multi-AZ to a Single-AZ
You can automate the migration to a smaller file system using link [2]. The title says upgrade singleAZ to multiAZ but the same concept applies, and it can also be used to move between two SingleAZ systems or move from MultiAZ down to SingleAZ. It moves data, CNAME records, SPN\alias and share permissions ACLs, etc..
In the CloudFormation example, I created a DataSync agent (EC2 instance) to cater for all migration scenarios including cross region, cross account migrations but you can change that part of the code to be a normal DataSync source and destination location if needed.
References:
[1] https://docs.aws.amazon.com/fsx/latest/WindowsGuide/migrate-files-fsx.html
My pleasure, glad I can assist.
"How do you setup a deduplication schedule like this?"
I have updated my answer to include an example command. You can work out the timings and rerun those schedules by editing the start time. To edit an existing schedule use:
Set-FSxDedupSchedule # Or remove them using Remove-FSxDedupSchedule -Name CustomOptimization
If you decide to remove the schedule instead of editing it, then you can recreate them using different start times.
"For Step 5 "Wait for 100GB data transfer to complete and check the dedupe status to see if it ran and has started deduping this 100GB batch" - Do you recommend transferring the chunk of data, THEN have dedup run, WAIT until dedup finishes, then repeat with another chunk? "
Avoid transferring chunks while dedup is running because it consumes RAM, CPU and IOPS. Data transfer also consumes IOPS, so rather wait for dedup to process the data then move to next chunk.
There is one public success story using the chunk method, where they split the data into chunks and each chunk\folder represented a data sync task. See this blog: How ClearScale overcame data migration hurdles using AWS DataSync
I have done a test in my lab and Amazon FSx for Windows File Server Dedup Optimization ran against 27GB of mixed size data, 128KB, 1GB, and majority 2MB files on 32MB Throughput FSx took 10min 38 seconds
1024MB FSx with 20.8GB of mix data 2MB file avg, dedup optimization process. START: 8:24:54 PM END: 8:35.03 PM Total: 10min 9 seconds
Please could you share your throughput size of FSx, average file size and the time it took for your FSx to run the optimize and garbage collection?
Have a great day further!
Relevant content
- Accepted Answerasked 3 years ago
- asked 4 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
This was great - thank you.
I have some questions about the process:
For step 3 "Setup an aggressive dedupe schedule that runs every 5min" - How do you setup a deduplication schedule like this? I see the AWS doc that goes over the schedule, but it states only days of the week and a single time of day to run.
For Step 5 "Wait for 100GB data transfer to complete and check the dedupe status to see if it ran and has started deduping this 100GB batch" - Do you recommend transferring the chunk of data, THEN have dedup run, WAIT until dedup finishes, then repeat with another chunk? - Or can we keep transferring chunks while it dedups?