- 最新
- 最多得票
- 最多評論
Hi,
As specified in the Data format guidelines page of the Personalize documentation, input data must be in a CSV file.
The first step in the Amazon Personalize workflow is to create a dataset group. Upon creating a dataset group, if you want to import data from multiple data sources into an Amazon Personalize dataset, you can use Amazon SageMaker Data Wrangler. Data Wrangler is a feature of Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, and analyze data. See bulk data imports in the documentation.
If your CSV files are in a folder in your Amazon S3 bucket and you want to upload multiple CSV files to a dataset with one dataset import job, you can specify the path to the folder. Amazon Personalize only uses the files in the first level of your folder, it doesn't use any data in any sub folders. Use the following syntax with a / after the folder name: s3://<name of your S3 bucket>/<folder path>/
, see Importing bulk records with a dataset import job
Lastly, you have three ways to update your datasets in Personalize, see this blogpost for a comprehensive explanation.
Hope this helps.
Hi, the simplest way to go around this is to create a Lambda that is automatically triggered each time a file is written to your bucket. If it's a zip, it will automatically decompress it for you
See https://levelup.gitconnected.com/automating-zip-extraction-with-lambda-and-s3-9a083d4e8bab for an example
Best,
Didier
Thanks for the idea! I'm first seeking to confirm that in fact Personalize does not handle compressed files. Decompressing our full initial training set is quite cumbersome (even with Lambda). Given that every other AWS 'data' product with which I've dealt handles various forms of compression, this still feels like it should be possible.
Also, rather than decompressing each file -- I'd probably instead trigger a Lambda to decompress in-memory (i.e. streaming) and use Personalize's PutItems endpoint for incremental training. (Though would still prefer to just import the GZ CSV files :-)
相關內容
- AWS 官方已更新 1 年前
- AWS 官方已更新 2 年前
- AWS 官方已更新 2 年前
- AWS 官方已更新 2 年前
Yes, I know. As mentioned in the question: the data is already in CSV format, and is being read correctly in that format. The question is about compressed CSV files. (And secondarily a question about multiple files in an S3 prefix, as is typical for a sharded-storage scheme.)
Agreed on the fact that data input for Personalize must be natively in CSV format.
Personalize does not support compressed file format today, only CSV is supported. Regarding the second point, if your CSV files are in a folder in your Amazon S3 bucket and you want to upload multiple CSV files to a dataset with one dataset import job, you can specify the path to the folder. Amazon Personalize only uses the files in the first level of your folder, it doesn't use any data in any sub folders. Use the following syntax with a / after the folder name: s3://<name of your S3 bucket>/<folder path>/. See https://docs.aws.amazon.com/personalize/latest/dg/bulk-data-import-step.html