Pyspark dataframe as RecordIO protobuf

0

I want to save my pyspark dataframe in RecordIO protobuf format. I am using Amazon EMR to run my pyspark scripts, and I want to use AWS SageMaker to train a machine learning model. SageMaker pipe mode only accept RecordIO protobuf as input, hence my question

I have tried to save my pyspark dataframe as recordio protobuf as the following:

output_path = f"s3://my_path/output_processed" 
df_transformed.write.format("sagemaker").mode("overwrite").save(output_path)

But when I run the sagemaker model I get an error of missing values eventhough my dataframe does not have missing values. Any idea what might help?

Omar
질문됨 5달 전182회 조회
1개 답변
0

HI,

For missing value error, validate your data and ensure that the data types of columns in your dataset match the expected data types. Also check for any unexpected values or outliers in your dataset. When using RecordIO format, review the serialization process to ensure that it accurately captures all data points without misinterpreting or excluding any, leading to a perceived missing value issue. You can start with small data size and validate the data integrity checks. For more data preparation guide, you can refer Prepare data with advanced transformations documentation.

I hope it helps.

profile pictureAWS
BezuW
답변함 5달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠