"pcluster create" results in error during startup - How to address ?

0

I am attempting to create a small Parallel cluster using OSX and anaconda. I installed and configured parallelcluster with no errors. The resulting simple config file looks like:

[aws]
aws_region_name = us-east-1

[global]
cluster_template = default
update_check = true
sanity_check = true

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[cluster default]
key_name = newmac
base_os = ubuntu1604
scheduler = sge
initial_queue_size = 2
max_queue_size = 2
maintain_initial_size = true
vpc_settings = default

[vpc default]
vpc_id = vpc-058cb4b123bf54848
master_subnet_id = subnet-04998fb2f4bc80ccc
compute_subnet_id = subnet-03c7e26fb2ea81430
use_public_ips = false

When I went to create the cluster here is what I see:

(base) esteban$ pcluster create -c /Users/esteban/.parallelcluster/config first-cluster
Beginning cluster creation for cluster: first-cluster
WARNING: The configuration parameter 'scheduler' generated the following warnings:
The job scheduler you are using (sge) is scheduled to be deprecated in future releases of ParallelCluster. More information is available here: https://github.com/aws/aws-parallelcluster/wiki/Deprecation-of-SGE-and-Torque-in-ParallelCluster
Creating stack named: parallelcluster-first-cluster
Status: parallelcluster-first-cluster - ROLLBACK_IN_PROGRESS
Cluster creation failed. Failed events:

  • AWS::AutoScaling::AutoScalingGroup ComputeFleet Resource creation cancelled

  • AWS::CloudFormation::WaitCondition MasterServerWaitCondition Received FAILURE signal with UniqueId i-0438bb35816b58aa5

Okay so I can still login although there are no SGE binaries installed.

$ pcluster ssh first-cluster -i ~/.ssh/newmac.pem

So I just deleted the cluster which appears to have (I hope) cleaned up all the associated services.

pcluster delete -c /Users/esteban/.parallelcluster/config first-cluster

Any ideas ? Thanks

Edited by: Stevie on Jul 23, 2020 11:42 AM

Edited by: Stevie on Jul 23, 2020 11:46 AM

Edited by: Stevie on Jul 23, 2020 11:47 AM

Stevie
asked 4 years ago481 views
3 Answers
0

Hi, Stevie.

The error message you posted, MasterServerWaitCondition Received FAILURE signal with UniqueId INSTANCE_ID, means there was an error when one of the nodes in the cluster was running its configuration tasks after booting. Usually this is due to a failure on the head node of the cluster.

You can start debugging the issue by first passing the '-nr/--norollbackflag topcluster create` and then, if cluster creation fails again, you can log into the head node and look for the cause of the failure in /var/log/cfn-init.log.

This wiki page contains tips for debugging common cluster creation failures: https://github.com/aws/aws-parallelcluster/wiki/Stack-Creation-Failures

Let us know what you find in the logs, and we'll try and provide more specific guidance based on that.

~Tim

answered 4 years ago
0

Thank you. I had to move on to another project for a bit. I will take into consideration the suggested remedies. One thing that I noticed after trying to delete the failed cluster was that although it appeared to have deleted the cluster - I had to go in and manually remove components (VPC, NAT gateway, etc). In my case, it seemed that the failed create impaired the ability of pcluster delete to back out all supporting components.

Stevie
answered 4 years ago
0

Hi Stevie,

How did you created the VPC, NAT gateway and the network components?
You can use your own default resources, create them by hand or create them through the pcluster configure command.

In any case the pcluster create command expects these resources are already created so the delete command doesn't delete them.

It is by design because the same VPC could be used for multiple clusters, and there is a limit on the number of VPC, Elastic IPs, etc the users can create, so we decided to decouple VPC and network resources from the clusters ones, it is not related to the issue with your previous failed creation.

Let us know if it helps.

AWS
answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions