Sagemaker experiments: two runs for one training job created

0

I want to create a Training Job on Sagemaker and associate both performance metrics and a model artifact with it. However, I have two problems with this:

  • In the Sagemajer "experiments" section, I see that two runs are created for one run of the notebook. One contains the performance metrics (this is the run I created manually), the other one contains the artifact (this run is created automatically through the training job)
  • I tried to circumvent this problem by explicitly attaching the artifact file to the "manual" run through run.log_file(file_path=filepath, name="model"). This should upload the file to some S3 and attach it to the run. However, I get a the following error, indicating that the S3 bucket is not accessible: botocore.exceptions.ClientError: An error occurred (404) when calling the HeadBucket operation: Not Found.

My questions:

  • How to avoid the creation of two runs in the first place, so that I have one run with both metrics and artifact attached?
  • Where can I change the settings for my training job so that it has access to the S3 bucket and can upload the artifact file?

Here is my code in a shortened version:

1. my notebook

sm_session = sagemaker.Session(
    sagemaker_config=sagemaker.config.config.load_sagemaker_config()
)
...
with Run(
    experiment_name=experiment_name,
    run_name=run_name,
    run_display_name=run_name,
    sagemaker_session=sm_session
) as run:

    experiment_config = run.experiment_config
    experiment_config.update({
        "TrialName": run_name,
        "TrialComponentDisplayName": run_name,
    })

    estimator = SKLearn(
        source_dir="training_job",
        entry_point="train.py",
        dependencies=["."],
        framework_version="1.2-1",
        instance_type="ml.m5.large",
        disable_output_compression=True,
        sagemaker_session=sm_session,
        experiment_config=experiment_config,
    )

2. train.py

import os
from sagemaker.session import Session

boto_session = boto3.session.Session(region_name=os.environ["AWS_REGION"])
sagemaker_session = Session(boto_session=boto_session)

if __name__ == "__main__":
    ...
    filepath = f"{os.environ['SM_MODEL_DIR']}/{args.model_name}.joblib"
    with load_run(sagemaker_session=sagemaker_session) as run:
        run.log_metric(
            f"validation_{metric}", fold_metric
        )
        # The following produces the ClientError
        run.log_file(file_path=filepath, name="model")
1 Answer
1
Accepted Answer

As far as I can see, it is a standard procedure that an experiment run is created for training jobs (not well documented, though; I found it here: https://docs.aws.amazon.com/sagemaker/latest/dg/experiment-faq.html). I ended up in tracking my artifact explicity with run.log_file() to the experiment run I created on purpose; I ignore the run created automatically for the training job. Regarding the ClientError: Apparently, the argument artifact_bucket in the experiment config is not passed to the training job. You have to manually specify it like this:

    with load_run(
        sagemaker_session=sagemaker_session,
        artifact_bucket=args.bucket,
    ) as run:
tadaki
answered 13 days ago
profile picture
EXPERT
reviewed 12 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions