- Newest
- Most votes
- Most comments
Support for protobuf on Glue/Spark is very limited.
You could use a third party plugin, such as https://github.com/amanjpro/spark-proto (haven't use it myself, use at your own risk).
Or read as binary and then use the function to parse it https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/protobuf.html, which is really intended for protobuf messages on streaming.
If your usecase is telemetry, you could instead of write those files to s3, write them to Kinesis/Kafka and then use that function to easily parse them.
Thanks Gonzalo,
Sharing my insights (and a question) for your answer ;)
https://github.com/amanjpro/spark-proto. Nice, seems to support up to Spark 2.4.x
As for https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/protobuf.html, it works from spark version 3.4.0, whereas Glue 4 support version up to 4.3 ;)
Using event streaming: Kinesis, Kafka is a great option. As for Kafka, Do you suggest working with Pojo Class or other option?
Relevant content
- Accepted Answerasked 2 months ago
- asked a year ago
- asked 4 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 days ago
- AWS OFFICIALUpdated a year ago
Any plugin in Spark will map from the binary directly to columns, you would only need a Pojo if you parse the binary yourself