Spot instance integration with automl interactive runner

0

I'm working on an ML project and have the current workflow: 1: use sagemaker studio to propose test configurations for my project 2: export to sagemaker notebooks 3: adjust instance types 4: run project from sagemaker notebook

The output of 1 results in a notebook with 4-6 candidates, which are initially trained before hyperparameter optimization is run later on. The whole process uses the sagemaker_autoML.AutoMLInteractiveRunner pipeline. An example candidate before modification is shown below:

automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp0",
        "training_resource_config": {
            "instance_type": "ml.m5.12xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": True,
        "transformed_data_format": "text/csv",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "xgboost",
        "training_resource_config": {
            "instance_type": "ml.m5.12xlarge",
            "instance_count": 1,
        },
    }
})

My models usually only contain ~700 training instances and ~100 testing instances, so I've found that changing instance types to ml.g4dn.xlarge for training and ml.m5.large for inference cuts the cost by a factor of 3 and is even faster due to GPU acceleration. I've been able to successfully modify the instance types as stated without issue, but I'd also like to enable spot instances. My attempted changes to the candidates have looked like such:

automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp3",
        "training_resource_config": {
            "instance_type": "ml.g4dn.xlarge", #works
            "instance_count": 1,
            "use_spot_instances": True,  #does not work
            "max_run": 1800, #does not work
            "max_wait":3600, #does not work
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.large",
            "instance_count": 1,
        },
        "transforms_label": True,
        "transformed_data_format": "text/csv",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "xgboost",
        "training_resource_config": {
            "instance_type": "ml.g4dn.xlarge", #works
            "instance_count": 1,
            "use_spot_instances": True, #does not work
            "max_run": 1800, #does not work
            "max_wait":3600 #does not work
        },
    }
})

However, when I look at the training jobs that this notebook makes after modification, it shows that managed spot training is disabled, which is further verified by the billable time and training time being equal. My questions are as follows:

  1. How can I turn on spot instances for training these candidates?
  2. How can I turn on spot instances for Hyperparameter optimization?
  3. Where is the documentation for sagemaker_autoML.AutoMLInteractiveRunner? Any other recommendations on how to streamline this workflow are also welcome, as well!
1개 답변
0

Hello,

Thank you for using SageMaker Service.

It is observed that the, Autopilot doesn't support GPU training for Tabular use-case. To further investigate any relevant issues with regards to the jobs created by Auto Pilot we would required job ARN and other major details which we do not recommend to share via this portal. I would highly encourage you to open a case with Support engineering to further investigate the issue if the issue persist.

To open a support case with AWS using the link:

https://console.aws.amazon.com/support/home?#/case/create

AWS
지원 엔지니어
답변함 6달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠