Persisting Kinesis messages to S3 in Parquet format - apache-spark

I have Kinesis stream to which my app writes ~10K messages per second, in proto format.
I would like to persist those messages to S3 in parquet format. For easy search afterwards, I need to partition my data by User ID field, which is part of the message.
Currently, I have a lambda function that is triggered by Kinesis event. It receives up to 10K of messages, group them by User ID, and then write those files to S3 in parquet format.
My problem is that the files this lambda function generates are very small, ~200KB, while I would like to create ~200MB files for better query performance (I query those files using AWS Athena).
Naive approach would be to write another lambda function that read those files and merge them (rollup) into a big file, but I feel like I'm missing something and there must be a better way of doing it.
I'm wondering if I should use Spark as described in this question.

Maybe you could use two additional services from AWS:
AWS Kinesis Data Analytics to consume data from Kinesis Stream and generate SQL analysis over your data (group, filter, etc). See more here: https://aws.amazon.com/kinesis/data-analytics/
AWS Kinesis Firehose plugged after Kinesis Data Analytics. With this service, we can create a parquet file on s3 at every X minutes or every Y MB with arrived data. See more here: https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html
A second way to do it is by using Spark Structured Streaming. So you could read from AWS Kinesis Stream, filter not usable data and export to s3 as described here:
https://databricks.com/blog/2017/08/09/apache-sparks-structured-streaming-with-amazon-kinesis-on-databricks.html
P.S.: This example show how to output into a local filesystem, but you can change it to s3 location.

Related

Is it safe to read data with boto3 from S3 if that data had been written using Stocator in pyspark?

I have an application that uses Stocator as a connector for Spark. This application writes the data to the S3 cos bucket.
Now I am working on a service that's supposed to read that data from S3. According to this thread here, you cannot specify the uri/protocol that boto3 uses. Is it safe to read that data using the default protocol of S3 REST API?
The reason I am asking is that I have been told that reading data using S3A (another protocol) that has been written using Stocator could result in reading duplicates.

Can Records in a single notification event ever have mixed event sources in AWS?

I've configured an S3 bucket to invoke a Lambda on s3:ObjectCreated:*. This results in the Lambda receiving an event with the Records-property. Since the property is an array I assume this means it can, potentially, contain multiple Records where eventSource equals "s3".
But what if I were to set up a similar Lambda-invocation trigger in a different AWS service as well? Could the Records-array contain, for instance, one Record from S3, and another from the other service?
There shouldn't be a mixture, but Records can have different structure. For example Kinesis events also have Records. DynamoDB streams also use Records.
But if you compare Kinesis records with S3 event records, you will see they have different eventSource. For kinesis this is aws:kinesis, show for s3 it is aws:s3.
Thus, if you are worrying about this, you have to check for eventSource, or some other characteristics property of the Record for a respective service.

How to implement Change Data Capture (CDC) using apache spark and kafka?

I am using spark-sql-2.4.1v with java 1.8. and kafka versions spark-sql-kafka-0-10_2.11_2.4.3 and kafka-clients_0.10.0.0.
I need to join streaming data with meta-data which is stored in RDS.
but RDS meta data could be added/changed.
If I read and load RDS table data in application , it would be stale for joining with streaming data.
I understood ,need to use Change Data Capture (CDC).
How can I implement Change Data Capture (CDC) in my scenario?
any clues or sample way to implement Change Data Capture (CDC) ?
thanks a lot.
You can stream a database into Kafka so that the contents of a table plus every subsequent change is available on a Kafka topic. From here it can be used in stream processing.
You can do CDC in two different ways:
Query-based: poll the database for changes, using Kafka Connect JDBC Source
Log-based: extract changes from the database's transaction log using e.g. Debezium
For more details and examples see http://rmoff.dev/ksny19-no-more-silos

SPARK STREAMING: I want to do some streaming exercises, how to get a good stream data source?

I want to do some streaming exercises, how to get a good stream data source ?
I am looking for both structure streaming data source and non structured streaming data source.
Will twitter work?
Local files can be used as sources in structured streaming, e.g.:
stream = spark.readStream.schema(mySchema).option("maxFilesPerTrigger", 1).json("/my/data")
With this you can experiment on data transformation and output very easily, and there are many sample datasets online, e.g. on kaggle
If you want to have something production-like, twitter api is a good option. You will need some sort of a messaging middleware though, like Kafka or Azure Event Hub - a simple app can send tweets there and you will be able to pick them up easily from Spark. You can also generate data yourself on the input side instead of depending on Twitter.

How to deal with concatenated Avro files?

I'm storing data generated from my web application in Apache Avro format. The data is encoded and sent to an Apache Kinesis Firehose that buffers and writes the data to Amazon S3 every 300 seconds or so. Since I have multiple web servers, this results in multiple blobs of Avro files being sent to Kinesis, upon which it concatenates and periodically writes them to S3.
When I grab the file from S3, I can't using the normal Avro tools to decode it since it's actually multiple files in one. I could add a delimiter I suppose, but that seems risky in the event that the data being logged also has the same delimiter.
What's the best way to deal with this? I couldn't find anything in the standard that supports multiple Avro files concatenated into the same file.
Looks like currently firehose doesn't provide any support to handle your use case, but it's doable with regular kinesis stream.
Instead of sending to firehose, you send your data to a kinesis stream,
you define your own AWS Lambda function (with kinesis event source), which reads the data from the stream and writes it to S3 as Avro file, here you won't face the problem firehose had, cause you already know it's avro format (and you probably own the schema), so it's up to you to decode/encode it properly (and write the file to S3 at once)

Resources