I want to forecast the future demand based on the 69 Millions historical demand records on CSV file, what is the best practice?

0

I have historical demand data. 18gb in CSV format , 69M records, 30 columns.

I'm exploring SageMaker options. I see several options. Amazon Forecast, SageMaker Studio, Canvas, Training Jobs, and just plain Jupyter Notebook instance. I believe theoretically all can be used but not sure which one actually can handle such a huge dataset without taking forever.

I think I heard some of these can only support a few Million records. I'd like to know the best approach with such a huge data points. (for forecasting the future demand)

Should I use Spark? Can someone lay out how to do this?

질문됨 8달 전361회 조회
1개 답변
1
수락된 답변

Hi,

For such large datasets Sagemaker Data Wrangler seems quite appropriate to prepare it. In https://aws.amazon.com/blogs/machine-learning/process-larger-and-wider-datasets-with-amazon-sagemaker-data-wrangler/ you have it benchmarked on a dataset of around 100 GB with 80 million rows and 300 columns.

About the training of large models with Amazon SageMaker, see this video: https://www.youtube.com/watch?v=XKLIhIeDSCY

Also, re. training of your model, this post helps you choose the best datasource: https://aws.amazon.com/blogs/machine-learning/choose-the-best-data-source-for-your-amazon-sagemaker-training-job/

Best,

Didier

profile pictureAWS
전문가
답변함 8달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠