Is it possible to use smddp in notebook?

0

I recently tried the smddp v1.4.0 on SageMaker notebook instance (not sagemaker studio), using 8-GPU instances ml.p3.16xlarge, by directly using smddp as backend in the training scripts. I launched the estimator by setting instance_type to local_gpu and ended up with smddp error. Corresponding errors are attached below, saying an initialization error.

42u1m0wni0-algo-1-36bbw | Traceback (most recent call last):
42u1m0wni0-algo-1-36bbw |   File "true_main_notebook.py", line 636, in <module>
42u1m0wni0-algo-1-36bbw | main()
42u1m0wni0-algo-1-36bbw |   File "true_main_notebook.py", line 178, in main
42u1m0wni0-algo-1-36bbw | dist.init_process_group(backend=args.dist_backend)
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
42u1m0wni0-algo-1-36bbw | store, rank, world_size = next(rendezvous_iterator)
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 219, in _env_rendezvous_handler
42u1m0wni0-algo-1-36bbw | rank = int(_get_env_or_raise("RANK"))
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise
42u1m0wni0-algo-1-36bbw |     raise _env_error(env_var)
42u1m0wni0-algo-1-36bbw | ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
42u1m0wni0-algo-1-36bbw | Environment variable SAGEMAKER_INSTANCE_TYPE is not set
42u1m0wni0-algo-1-36bbw | Running smdistributed.dataparallel v1.4.0
42u1m0wni0-algo-1-36bbw | Error in atexit._run_exitfuncs:
42u1m0wni0-algo-1-36bbw | Traceback (most recent call last):
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 51, in at_exit_smddp
42u1m0wni0-algo-1-36bbw | hm.shutdown()
42u1m0wni0-algo-1-36bbw | RuntimeError: Was this script started with smddprun? For more info on using smddprun, run smddprun -h
42u1m0wni0-algo-1-36bbw | 2022-04-03 16:07:30,005 sagemaker-training-toolkit ERROR    Reporting training FAILURE
42u1m0wni0-algo-1-36bbw | 2022-04-03 16:07:30,005 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
42u1m0wni0-algo-1-36bbw | ExitCode 1
42u1m0wni0-algo-1-36bbw | ErrorMessage "ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
42u1m0wni0-algo-1-36bbw |  Environment variable SAGEMAKER_INSTANCE_TYPE is not set Error in atexit._run_exitfuncs: Traceback (most recent call last):   File "/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 51, in at_exit_smddp hm.shutdown() RuntimeError: Was this script started with smddprun? For more info on using smddprun, run smddprun -h"

The original goal is to launch a single-node smddp for debugging.

Does the smddp only support launched by AWS python SDK rather than the notebook? Or if something I've done is not correct?

Nessuna risposta

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande