2 different schemas being streamed into a datalake via kinesis stream+firehose, should each have its own stream?

Question

I have an architecture where two entities are posting data to an API gateway. They both follow their own (different) json schema, API GW then pushes it into Kinesis Stream/Firehose. Should I create a separate stream+firehose for each schema? I understand that I could stream them both into the same Kinesis Stream / Firehose and use a lambda to parse each datapoint and determine how to write each data to s3 however I am afraid of lambda concurrency issues should the velocity of the data spike. What is the best practice in this context?

Answer

Mixing the entities is problematic. I recommend a separate Kinesis Data Stream / Lambda or Kinesis Data Firehose. Splitting them into separate streams will allow you to more easily apply Schema Registry to validate their JSON schemas and write Consumers that simply read their expected JSON from a Kinesis Data Stream. Kinesis Data Firehose is a reliable ETL service that can load your entities into various sinks. Keeping them separate will have downstream implications like ease of configuring AWS Glue schemas/tables for querying with Athena or machine learning.

Here's a blog on Schema Registry: https://aws.amazon.com/blogs/big-data/evolve-json-schemas-in-amazon-msk-and-amazon-kinesis-data-streams-with-the-aws-glue-schema-registry/

Here's a blog on Schema Registry:
https://aws.amazon.com/blogs/big-data/evolve-json-schemas-in-amazon-msk-and-amazon-kinesis-data-streams-with-the-aws-glue-schema-registry/

2 different schemas being streamed into a datalake via kinesis stream+firehose, should each have its own stream?

Relevant content