【以下的问题经过翻译处理】 我因为考虑可维护性方面,使用SageMaker Training Compiler和[Hugging Face Trainer API,PyTorch]程序分割成多个.py文件。该作业需要在多个GPU上运行(尽管在当前规模下,多设备单节点也可以接受)。
遵循文档上的步骤,将distributed_training_launcher.py
启动脚本添加到source_dir
中并通过training_script
超参数传递真正的训练脚本。
...但当作业尝试启动时,我会得到以下错误:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch_xla/distributed/xla_spawn.py", line 90, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch_xla/distributed/xla_spawn.py", line 86, in main
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_gpus)
AttributeError: module 'train' has no attribute '_mp_fn'
有什么想法可能导致这个问题?对于编写在多个文件中的训练脚本是否有特定的限制或额外要求?
我还尝试以单GPU模式(p3.2xlarge)运行,直接调用训练脚本而不是使用分布式启动器,并且看到下面的错误,似乎是源自于TrainingArguments本身?不确定为什么在运行PT时它尝试调用'tensorflow/compiler'编译器..?
编辑:后来发现下面的错误可以通过在故障排除文档中提到的显式设置n_gpus来解决,但这让我回到了上面的错误消息。
File "/opt/ml/code/code/config.py", line 124, in __post_init__
super().__post_init__()
File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 761, in __post_init__
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1764, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 975, in device
return self._setup_devices
File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1754, in __get__
cached = self.fget(obj)
File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1764, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 918, in _setup_devices
device = xm.xla_device()
File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 231, in xla_device
devices = get_xla_supported_devices(
File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 137, in get_xla_supported_devices
xla_devices = _DEVICES.value
File "/opt/conda/lib/python3.8/site-packages/torch_xla/utils/utils.py", line 32, in value
self._value = self._gen_fn()
File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 19, in <lambda>
_DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/computation_client.cc:273 : Missing XLA configuration