Usage of Sagemaker Processing Job Manifest File

0

I have a processing Job that uses input files saved in different folders of a S3 Bucket and use the Manifest file within the processing Job to copy it to /opt/ml/processing/input Folder.

This works perfectly fine when i have all the files in one folder but wont work when they are under the same prefix but under different folders.

Following the steps listed in the url https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html

[ {"prefix": "s3://customer_bucket/some/prefix/"},

"relative/path/to/custdata-1",

"relative/path/custdata-2",

...

"relative/path/custdata-N"

]

If i have all the input files in "relative/path1/custdata-1" The job works fine but if i add another one "relative/path2/custdata-2", there is no file copied and my script fails with no such file or directory.

Any suggestions or advise on this will be very helpful.

Ahalli
질문됨 2년 전1885회 조회
1개 답변
0

I've used Processing Job ManifestFile inputs successfully in the past with multiple relative folders (for e.g. the "Extract clean input images" section of this notebook - sorry for citing a large/sprawling sample, there are probably simpler ones out there).

Not sure exactly what could be going wrong here so I'll try to describe using the feature as I think of it and hope that helps:

Given an S3 bucket containing:

s3://customer_bucket/some/prefix/relative/path1/custdata-1
s3://customer_bucket/some/prefix/relative/path2/custdata-2

...and a manifest file like:

[ { "prefix":  "s3://customer_bucket/some/prefix/" },
  "relative/path1/custdata-1",
  "relative/path2/custdata-2"
]

...for a processing input something like the below (or equivalent if you're using boto3/etc instead of the SageMaker Python SDK):

ProcessingInput(
    destination="/opt/ml/processing/input/mycoolinput",
    input_name="mycoolinput",
    s3_data_type="ManifestFile",
    source="s3://path-to-your-manifest-file",
)

...I'd expect your processing job to see files:

/opt/ml/processing/input/mycoolinput/relative/path1/custdata-1
/opt/ml/processing/input/mycoolinput/relative/path2/custdata-2

So in this sense it is possible to have files under the same prefix with different subfolders. In the above mentioned sample, the raw_s3uri prefix contains credit card agreement PDFs categorized into folders by bank/provider - e.g. {raw_s3uri}/Bank1/Card1.pdf, {raw_s3uri}/CreditUnion2/Disclosures.pdf, etc.

To my knowledge it's not possible to have multiple { "prefix": "..." } entries in your manifest, but as I understood it didn't sound like you were trying to do that.

Apart from double-checking this overall setup (and maybe using Python os.walk() to recursively print() out the folder contents as your Processing job sees them), the only other thing I could suggest is to check if your S3 object keys have any special characters in them that could be causing issues when mapping to a local filesystem - such as files/folders with spaces at the end, or characters that aren't usually allowed in filenames?

AWS
전문가
Alex_T
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠