Usage of Sagemaker Processing Job Manifest File

0

I have a processing Job that uses input files saved in different folders of a S3 Bucket and use the Manifest file within the processing Job to copy it to /opt/ml/processing/input Folder.

This works perfectly fine when i have all the files in one folder but wont work when they are under the same prefix but under different folders.

Following the steps listed in the url https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html

[ {"prefix": "s3://customer_bucket/some/prefix/"},

"relative/path/to/custdata-1",

"relative/path/custdata-2",

...

"relative/path/custdata-N"

]

If i have all the input files in "relative/path1/custdata-1" The job works fine but if i add another one "relative/path2/custdata-2", there is no file copied and my script fails with no such file or directory.

Any suggestions or advise on this will be very helpful.

Ahalli
asked 2 years ago1853 views
1 Answer
0

I've used Processing Job ManifestFile inputs successfully in the past with multiple relative folders (for e.g. the "Extract clean input images" section of this notebook - sorry for citing a large/sprawling sample, there are probably simpler ones out there).

Not sure exactly what could be going wrong here so I'll try to describe using the feature as I think of it and hope that helps:

Given an S3 bucket containing:

s3://customer_bucket/some/prefix/relative/path1/custdata-1
s3://customer_bucket/some/prefix/relative/path2/custdata-2

...and a manifest file like:

[ { "prefix":  "s3://customer_bucket/some/prefix/" },
  "relative/path1/custdata-1",
  "relative/path2/custdata-2"
]

...for a processing input something like the below (or equivalent if you're using boto3/etc instead of the SageMaker Python SDK):

ProcessingInput(
    destination="/opt/ml/processing/input/mycoolinput",
    input_name="mycoolinput",
    s3_data_type="ManifestFile",
    source="s3://path-to-your-manifest-file",
)

...I'd expect your processing job to see files:

/opt/ml/processing/input/mycoolinput/relative/path1/custdata-1
/opt/ml/processing/input/mycoolinput/relative/path2/custdata-2

So in this sense it is possible to have files under the same prefix with different subfolders. In the above mentioned sample, the raw_s3uri prefix contains credit card agreement PDFs categorized into folders by bank/provider - e.g. {raw_s3uri}/Bank1/Card1.pdf, {raw_s3uri}/CreditUnion2/Disclosures.pdf, etc.

To my knowledge it's not possible to have multiple { "prefix": "..." } entries in your manifest, but as I understood it didn't sound like you were trying to do that.

Apart from double-checking this overall setup (and maybe using Python os.walk() to recursively print() out the folder contents as your Processing job sees them), the only other thing I could suggest is to check if your S3 object keys have any special characters in them that could be causing issues when mapping to a local filesystem - such as files/folders with spaces at the end, or characters that aren't usually allowed in filenames?

AWS
EXPERT
Alex_T
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions