- 최신
- 최다 투표
- 가장 많은 댓글
Hi David,
I understand that you're trying to create a new cluster by using as custom_ami the AMI of the head node of the another running cluster, but please correct me if I have misunderstood anything here.
If it's the case I have to confirm that it cannot work. You can't reuse an AMI from a running instance as base ami for a new cluster.
The reason is that during the bootstrap of the instance ParallelCluster executes configuration actions depending if it's the head node or a compute node of the cluster.
By using the head node ami you're trying to create a new cluster on which the configuration steps have been already executed, so this ami cannot work properly and it cannot be used as compute node.
If you're using the "Modify an AWS ParallelCluster AMI" approach you should always start from the AMIs in this list: https://github.com/aws/aws-parallelcluster/blob/v2.10.0/amis.txt
See more details here: https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_02_ami_customization.html#modify-an-aws-parallelcluster-ami
Let us know if it helps.
Edited by: enrico-aws on Dec 2, 2020 8:20 AM
Hi @ProlucidDavid
the resulting instance fails one of its status checks and is inaccessible
from what you are describing, it seems that one of the modifications done in the instance is causing a problem at operating system-level
It's hard to say what is the root cause, it could be for example:
- Failure to boot the operating system
- Failure to mount volumes correctly
- File system issues
- Incompatible drivers
- Kernel panic
I'll link you some guides that could help in troubleshooting the root cause
- https://aws.amazon.com/it/premiumsupport/knowledge-center/system-reachability-check/
- https://aws.amazon.com/it/premiumsupport/knowledge-center/ec2-linux-status-check-failure-os-errors/
- https://aws.amazon.com/it/premiumsupport/knowledge-center/ec2-linux-system-status-check-failure/
Please notice that you should be able to start the instance from the custom AMI you have created even outside of a cluster creation process.
So, before using the AMI to create a cluster, please make sure you are able to start an instance from it.
That said, it's not clear what is the process that you have followed to create a custom AMI, but it should be OK since you have make it through on your first attempt. Double checking that you have followed the official doc https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_02_ami_customization.html
I never found the source of the failed AWS launch, however by following the instructions on your last link (https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_02_ami_customization.html), I was able to create a new template AMI that parallel cluster can launch successfully.
One issue to be aware of is that you should ensure that the template AMI that you start working with matches the version of Parallel cluster. I spent several hours troubleshooting because of this mismatch.
Hi David,
we introduced a validation for the createami process as part of the 2.10 release: https://github.com/aws/aws-parallelcluster/releases
createami:
- Add validation step to fail when using a base AMI created by a different version of ParallelCluster.
- Add validation step for AMI creation process to fail if the selected OS and the base AMI OS are not consistent.
Could you confirm which version are you using by executing pcluster version command?
Thanks
pcluster v2.9.1 was unable to launch an AMI that was based off of version v2.10.0. There wasn't good visibility on the cause of this behaviour (pcluster didn't report an incompatible image). It looked like the munge service was failing.
pcluster v2.10.0 successfully launched an AMI that was based off of version v2.10.0.
I am re-opening this forum post.
Previously I had mentioned that following these steps I could make a custom AMI with installed software that pcluster could launch:(https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_02_ami_customization.html)
However if I take an AMI of the resulting pcluster image, I am unable to restore it. Steps that I take are:
- Use pcluster to launch HPC using custom AMI
- Take an AMI of the master node using the web console
- Use the same configuration file that originally launched the HPC, but modify the custom_ami tag from my successful base image, to the one created in step 1.
Regarding the failure you saw with 2.9.1 version unfortunately it's expected because we added the validation steps to check the AMI version within the 2.10.0 release.
Hi Enrico,
You understand correctly. I really appreciate your followups since we can now confirm that this is expected behaviour.
For context (in case there are any devs reviewing this thread) we had two goals by investigating this:
- Develop a strategy to take images so we could restore the system if it ever failed.
- It would be easier to develop our base AMI by installing software on a running HPC. In this way, we can verify that the software works as expected when compute nodes are instantiated. We can still test on an HPC, but then we need to reinstall on a base AMI and take an image, so its an extra setup step
But that's ok for now. The information on this thread has given us a path forward. I appreciate all your help
Regarding the version check, that's a great feature that's been added! Thanks for confirming the behaviour
Edited by: ProlucidDavid on Dec 3, 2020 7:44 AM
For completeness I just want to mention another way to customize your cluster, using custom bootstrap actions: https://docs.aws.amazon.com/parallelcluster/latest/ug/pre_post_install.html
The great thing is that this approach removes the extra step of creating a custom ami but clearly it is a good choice only if the customization steps don't require too much time or if they are only required in the head node.
Anyway thanks for the feedback.