1 Answer
- Newest
- Most votes
- Most comments
0
Hi!
I think ParallelCluster by default support you to run multiple jobs on a nodes that with multiple GPUs.
Based on my testing, I couldn't reproduce the issue with the g3.16xlarge instances that have 4 GPUs. When submitting two jobs, each requiring one GPU, Slurm correctly assigned both jobs to the same node, with each job use one GPU.
Could you can check the CPU allocation and GPU allocation on the job with running scontrol show jobs --details
within 5 minutes when the job is finished?
Here's an example:
- Submit 2 jobs, each job requires 1 GPU
[ec2-user@ip-192-168-60-235 ~]$ srun -p queue-1 --gpus-per-node=1 sleep 120
- two jobs are running on the same node, each job use 1 GPu
[ec2-user@ip-192-168-60-235 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 queue-1 sleep ec2-user R 1:09 1 queue-1-dy-test-resume-1
4 queue-1 sleep ec2-user R 0:09 1 queue-1-dy-test-resume-1
- Run
scontrol show jobs --details
within 5 minutes when the job is finished.? - As show in the following you can see:
- Job 3 is using Nodes=queue-1-dy-test-resume-1 CPU_IDs=0, GRES=gpu:m60:1(IDX:0)
- Job 4 is using Nodes=queue-1-dy-test-resume-1 but on another cpu_id: CPU_IDs=1 and another gpu_id: GRES=gpu:m60:1(IDX:1)
[ec2-user@ip-192-168-60-235 ~]$ scontrol show jobs --details
JobId=3 JobName=sleep
UserId=ec2-user(1000) GroupId=ec2-user(1000) MCS_label=N/A
Priority=4294901757 Nice=0 Account=(null) QOS=(null)
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:01:15 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2023-08-07T23:02:38 EligibleTime=2023-08-07T23:02:38
AccrueTime=Unknown
StartTime=2023-08-07T23:02:38 EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-08-07T23:02:38 Scheduler=Main
Partition=queue-1 AllocNode:Sid=ip-192-168-60-235:9602
ReqNodeList=(null) ExcNodeList=(null)
NodeList=queue-1-dy-test-resume-1
BatchHost=queue-1-dy-test-resume-1
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=474726M,node=1,billing=1
AllocTRES=cpu=1,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
JOB_GRES=gpu:m60:1
Nodes=queue-1-dy-test-resume-1 CPU_IDs=0 Mem=0 GRES=gpu:m60:1(IDX:0)
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=sleep
WorkDir=/home/ec2-user
Power=
TresPerNode=gres:gpu:1
JobId=4 JobName=sleep
UserId=ec2-user(1000) GroupId=ec2-user(1000) MCS_label=N/A
Priority=4294901756 Nice=0 Account=(null) QOS=(null)
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:15 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2023-08-07T23:03:38 EligibleTime=2023-08-07T23:03:38
AccrueTime=Unknown
StartTime=2023-08-07T23:03:38 EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-08-07T23:03:38 Scheduler=Main
Partition=queue-1 AllocNode:Sid=ip-192-168-60-235:10855
ReqNodeList=(null) ExcNodeList=(null)
NodeList=queue-1-dy-test-resume-1
BatchHost=queue-1-dy-test-resume-1
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=474726M,node=1,billing=1
AllocTRES=cpu=1,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
JOB_GRES=gpu:m60:1
Nodes=queue-1-dy-test-resume-1 CPU_IDs=1 Mem=0 GRES=gpu:m60:1(IDX:1)
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=sleep
WorkDir=/home/ec2-user
Power=
TresPerNode=gres:gpu:1
Let me know if you could find any useful information from scontrol show jobs --details
Thanks
answered a year ago
Relevant content
- asked 4 months ago
- Accepted Answerasked a year ago
- asked 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 7 months ago