SPARK STREAMING: I want to do some streaming exercises, how to get a good stream data source? - apache-spark

I want to do some streaming exercises, how to get a good stream data source ?
I am looking for both structure streaming data source and non structured streaming data source.
Will twitter work?

Local files can be used as sources in structured streaming, e.g.:
stream = spark.readStream.schema(mySchema).option("maxFilesPerTrigger", 1).json("/my/data")
With this you can experiment on data transformation and output very easily, and there are many sample datasets online, e.g. on kaggle
If you want to have something production-like, twitter api is a good option. You will need some sort of a messaging middleware though, like Kafka or Azure Event Hub - a simple app can send tweets there and you will be able to pick them up easily from Spark. You can also generate data yourself on the input side instead of depending on Twitter.

Related

How can I do real-time log analysis using Spark-streaming?? (I attach architecture image)

Spark streaming application receives data in real time from a lot of IoT devices.
But They are all small amounts of data.
Overall flow looks like it -> Iot -> Kafka(1 topic/ all data) -> Spark-streaming(Filtering error log) -> DB(save) -> Alert screen
Is there a good way to do real-time log analysis using spark or python?
Apparantly, You can use the spark- Kafka connectors to stream the data from Kafka queue.
This doc has some reference to start with structured streaming with Kafka - https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Once you get the streaming dataframe from kafka,you can apply filter() function from spark to filter your incoming data set.
Also this document from databricks has some nice reference on how we can implement log analysis application using spark streaming.
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/app/index.html
You can use above as a reference!

Read from multiple sources using Nifi, group topics in Kafka and subscribe with Spark

We use Apache Nifi to get data from multiple sources like Twitter and Reddit in specific interval (for example 30s). Then we would like to send it to Apache Kafka and probably it should somehow group both Twitter and Reddit messages into 1 topic so that Spark would always receive data from both sources for given interval at once.
Is there any way to do that?
#Sebastian What you describe is basic NiFI routing. You would just route both Twitter and Redis to the same downstream Kafka Producer and same Topic. After you get data into NiFi from each service, you should run it to UpdateAttribute and set attribute topicName to what you want for each source. If there are additional steps per Data Source do them after Update Attribute and before PublishKafka.
If you code all the upstream routes as above, you could route all the different Data Sources to PublishKafka processor using ${topicName} dynamically.

Design Question Kafka Consumer/Producer vs Kafka Stream

I'm working with NodeJs MS, so far they communicate through Kafka Consumer/Producer. Now I need to buiid a Loggger MS which must record all the messages and do some processing (parse and save to db), but I'm not sure if the current approach could be improved using Kafka Stream or if I should continue using Consumers
The Streams API is a higher level abstraction that sits on top of the Consumer/Producer APIs. The Streams API allows you to filter and transform messages, and build a topology of processing steps.
For what you're describing, if you're just picking up a messages and doing a single processing step, the Consumer API is probably fine. That said, you could do the same thing with the Streams API too and not use the other features.
buiid a Loggger MS which must record all the messages and do some processing (parse and save to db)
I would suggest using something like Streams API or Nodejs Producer + Consumer to parse and write back to Kafka.
From your parsed/filtered/sanitized messages, you can run a Kafka Connect cluster to sink your data into a DB
could be improved using Kafka Stream or if I should continue using Consumers
Ultimately, depends what you need. The peek and foreach methods of Streams DSL are functionally equivalent to a Consumer

How to implement Change Data Capture (CDC) using apache spark and kafka?

I am using spark-sql-2.4.1v with java 1.8. and kafka versions spark-sql-kafka-0-10_2.11_2.4.3 and kafka-clients_0.10.0.0.
I need to join streaming data with meta-data which is stored in RDS.
but RDS meta data could be added/changed.
If I read and load RDS table data in application , it would be stale for joining with streaming data.
I understood ,need to use Change Data Capture (CDC).
How can I implement Change Data Capture (CDC) in my scenario?
any clues or sample way to implement Change Data Capture (CDC) ?
thanks a lot.
You can stream a database into Kafka so that the contents of a table plus every subsequent change is available on a Kafka topic. From here it can be used in stream processing.
You can do CDC in two different ways:
Query-based: poll the database for changes, using Kafka Connect JDBC Source
Log-based: extract changes from the database's transaction log using e.g. Debezium
For more details and examples see http://rmoff.dev/ksny19-no-more-silos

Persisting Kinesis messages to S3 in Parquet format

I have Kinesis stream to which my app writes ~10K messages per second, in proto format.
I would like to persist those messages to S3 in parquet format. For easy search afterwards, I need to partition my data by User ID field, which is part of the message.
Currently, I have a lambda function that is triggered by Kinesis event. It receives up to 10K of messages, group them by User ID, and then write those files to S3 in parquet format.
My problem is that the files this lambda function generates are very small, ~200KB, while I would like to create ~200MB files for better query performance (I query those files using AWS Athena).
Naive approach would be to write another lambda function that read those files and merge them (rollup) into a big file, but I feel like I'm missing something and there must be a better way of doing it.
I'm wondering if I should use Spark as described in this question.
Maybe you could use two additional services from AWS:
AWS Kinesis Data Analytics to consume data from Kinesis Stream and generate SQL analysis over your data (group, filter, etc). See more here: https://aws.amazon.com/kinesis/data-analytics/
AWS Kinesis Firehose plugged after Kinesis Data Analytics. With this service, we can create a parquet file on s3 at every X minutes or every Y MB with arrived data. See more here: https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html
A second way to do it is by using Spark Structured Streaming. So you could read from AWS Kinesis Stream, filter not usable data and export to s3 as described here:
https://databricks.com/blog/2017/08/09/apache-sparks-structured-streaming-with-amazon-kinesis-on-databricks.html
P.S.: This example show how to output into a local filesystem, but you can change it to s3 location.

Resources