1 回答
- 最新
- 投票最多
- 评论最多
0
【以下的回答经过翻译处理】 您好,
对于自定义 Sagemaker 容器或深度学习框架,我倾向于这样做。以下是我尝试过的pytorch的例子
- Entry point 文件:
# 1. Define a custom argument, say checkpointdir
parser.add_argument("--checkpointdir", help="The checkpoint dir", type=str,
default=None)
# 2. You can additional params for checkpoint frequency etc
# 3. Code for checkpointing
if checkpointdir is not None:
#TODO: save mode
- Jupyter 笔记本示例 Sagemaker estimator
# 1. Define local and remote variables for checkpoints
checkpoint_s3 = "s3://{}/{}t/".format(bucket, "checkpoints")
localcheckpoint_dir="/opt/ml/checkpoints/"
hyperparameters = {
"batchsize": "8",
"epochs" : "1000",
"learning_rate":.0001,
"weight_decay":5e-5,
"momentum":.9,
"patience": 20,
"log-level" : "INFO",
"commit_id":commit_id,
"model" :"FasterRcnnFactory",
"accumulation_steps": 8,
# 2. define hp for checkpoint dir
"checkpointdir": localcheckpoint_dir
}
# In the Sagemaker estimator fit, specify the local and remote path
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
entry_point='experiment_train.py',
source_dir = 'src',
dependencies =['src/datasets', 'src/evaluators', 'src/models'],
role=role,
framework_version ="1.0.0",
py_version='py3',
git_config= git_config,
image_name= docker_repo,
train_instance_count=1,
train_instance_type=instance_type,
# 3. The entrypoint file will pick up the checkpoint location from here
hyperparameters =hyperparameters,
output_path=s3_output_path,
metric_definitions=metric_definitions,
train_use_spot_instances = use_spot,
train_max_run = train_max_run_secs,
train_max_wait = max_wait_time_secs,
base_job_name ="object-detection",
# 4. Sagemaker knows that the checkpoints will need to be periodically copied from the localcheckpoint_dir to s3 pointed to by checkpoint_s3
checkpoint_s3_uri=checkpoint_s3,
checkpoint_local_path=localcheckpoint_dir)
相关内容
- AWS 官方已更新 2 年前
- AWS 官方已更新 2 年前
- AWS 官方已更新 2 年前