Glue job that converts protobuf into avro

0

Hello... I have a query regarding the conversion of Protobuf objects stored in an S3 bucket (I possess a list of .pb.gz objects within the "Origin" bucket). I aim to develop a Glue job to transform these .pb.gz objects into another file format, such as .avro.

Concerning the Protobuf schema, I'm utilizing the OpenTelemetry metrics schema, and I have successfully incorporated it into the Glue Schema Registry. However, when I attempted to manually create a new Glue table, I encountered an issue in saving it, even though my schema appears to be correct. I suspect, but I'm not certain, that this issue arises because when creating a Glue table, I can select the Glue Schema Registry, but the protobuf format is not available in the format options list (JSON, CSV, Avro, Parquet, etc. are available, excluding protobuf). Nevertheless, protobuf can be added to the Glue Schema Registry.

If the Glue table approach works, the task can be easily accomplished using Glue ETL by simply selecting the table for .pb.gz files. However, if this approach is not feasible, I assume that I may need to employ Python code to perform the conversion. This might involve using a code snippet like the one provided below or manually compiling proto files ( from opentelemetry.proto.collector.metrics.v1.metrics_service_pb2 import ExportMetricsServiceRequest ).

I would appreciate hearing your thoughts on this matter or any examples you might have on implementing the conversion from S3 (OpenTelemetry metrics schema protobuf objects) to S3 (other file format objects) as described here. If you have have a working code example that loads protobuf from s3 to dynamicframe or something similar, It may help too ;)

Thank you for taking the time to read this. I appreciate receiving answers, ideas, or tips based on your experience.

erez
asked 4 months ago139 views
2 Answers
1

Support for protobuf on Glue/Spark is very limited.
You could use a third party plugin, such as https://github.com/amanjpro/spark-proto (haven't use it myself, use at your own risk).
Or read as binary and then use the function to parse it https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/protobuf.html, which is really intended for protobuf messages on streaming.
If your usecase is telemetry, you could instead of write those files to s3, write them to Kinesis/Kafka and then use that function to easily parse them.

profile pictureAWS
EXPERT
answered 4 months ago
0

Thanks Gonzalo,

Sharing my insights (and a question) for your answer ;)

https://github.com/amanjpro/spark-proto. Nice, seems to support up to Spark 2.4.x

As for https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/protobuf.html, it works from spark version 3.4.0, whereas Glue 4 support version up to 4.3 ;)

Using event streaming: Kinesis, Kafka is a great option. As for Kafka, Do you suggest working with Pojo Class or other option?

erez
answered 4 months ago
  • Any plugin in Spark will map from the binary directly to columns, you would only need a Pojo if you parse the binary yourself

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions