I'm creating a pipeline with multiple steps
One to preprocess a dataset and the other one takes the preprocessed one as an input to train a BlazingText model for classification
My first ProcessingStep
outputs augmented manifest files
step_process = ProcessingStep(
name="Nab3Process",
processor=sklearn_processor,
inputs=[
ProcessingInput(source=raw_input_data, destination=raw_dir),
ProcessingInput(source=categories_input_data, destination=categories_dir)
],
outputs=[
ProcessingOutput(output_name="train", source=train_dir),
ProcessingOutput(output_name="validation", source=validation_dir),
ProcessingOutput(output_name="test", source=test_dir),
ProcessingOutput(output_name="mlb_train", source=mlb_data_train_dir),
ProcessingOutput(output_name="mlb_validation", source=mlb_data_validation_dir),
ProcessingOutput(output_name="mlb_test", source=mlb_data_test_dir),
ProcessingOutput(output_name="le_vectorizer", source=le_vectorizer_dir),
ProcessingOutput(output_name="mlb_vectorizer", source=mlb_vectorizer_dir)
],
code=preprocessing_dir)
But I'm having a hard time when I try to feed my train
output as a TrainingInput
to the model step to use it to train.
step_train = TrainingStep(
name="Nab3Train",
estimator=bt_train,
inputs={
"train": TrainingInput(
step_process.properties.ProcessingOutputConfig.Outputs[
"train"
].S3Output.S3Uri,
distribution="FullyReplicated",
content_type="application/x-recordio",
s3_data_type='AugmentedManifestFile',
attribute_names=['source', 'label'],
input_mode='Pipe',
record_wrapping='RecordIO'
),
"validation": TrainingInput(
step_process.properties.ProcessingOutputConfig.Outputs[
"validation"
].S3Output.S3Uri,
distribution="FullyReplicated",
content_type='application/x-recordio',
s3_data_type='AugmentedManifestFile',
attribute_names=['source', 'label'],
input_mode='Pipe',
record_wrapping='RecordIO'
)
})
And I'm getting the following error
'FailureReason': 'ClientError: Could not download manifest file with S3 URL "s3://sagemaker-us-east-1-xxxxxxxxxx/Nab3Process-xxxxxxxxxx/output/train". Please ensure that the bucket exists in the selected region (us-east-1), that the manifest file exists at that S3 URL, and that the role "arn:aws:iam::xxxxxxxxxx:role/service-role/AmazonSageMakerServiceCatalogProductsUseRole" has "s3:GetObject" permissions on the manifest file. Error message from S3: The specified key does not exist.'
What Should I do?
Are you able to view the files from your notebook? For example, like an
aws s3 ls
on the prefix and make sure it exists? I would check if your processing job has executed successfully and has the file there as well. Since the bucket name hassagemaker
in it, ServiceCatalogProductsUseRole would have s3:GetObject permissions by default.