- Neueste
- Die meisten Stimmen
- Die meisten Kommentare
Support for protobuf on Glue/Spark is very limited.
You could use a third party plugin, such as https://github.com/amanjpro/spark-proto (haven't use it myself, use at your own risk).
Or read as binary and then use the function to parse it https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/protobuf.html, which is really intended for protobuf messages on streaming.
If your usecase is telemetry, you could instead of write those files to s3, write them to Kinesis/Kafka and then use that function to easily parse them.
Thanks Gonzalo,
Sharing my insights (and a question) for your answer ;)
https://github.com/amanjpro/spark-proto. Nice, seems to support up to Spark 2.4.x
As for https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/protobuf.html, it works from spark version 3.4.0, whereas Glue 4 support version up to 4.3 ;)
Using event streaming: Kinesis, Kafka is a great option. As for Kafka, Do you suggest working with Pojo Class or other option?
Relevanter Inhalt
- AWS OFFICIALAktualisiert vor einem Jahr
- AWS OFFICIALAktualisiert vor einem Jahr
- AWS OFFICIALAktualisiert vor 2 Jahren
Any plugin in Spark will map from the binary directly to columns, you would only need a Pojo if you parse the binary yourself