Read from multiple sources using Nifi, group topics in Kafka and subscribe with Spark

We use Apache Nifi to get data from multiple sources like Twitter and Reddit in specific interval (for example 30s). Then we would like to send it to Apache Kafka and probably it should somehow group both Twitter and Reddit messages into 1 topic so that Spark would always receive data from both sources for given interval at once.
Is there any way to do that?

#Sebastian What you describe is basic NiFI routing. You would just route both Twitter and Redis to the same downstream Kafka Producer and same Topic. After you get data into NiFi from each service, you should run it to UpdateAttribute and set attribute topicName to what you want for each source. If there are additional steps per Data Source do them after Update Attribute and before PublishKafka.
If you code all the upstream routes as above, you could route all the different Data Sources to PublishKafka processor using ${topicName} dynamically.


How to implement Change Data Capture (CDC) using apache spark and kafka?

I am using spark-sql-2.4.1v with java 1.8. and kafka versions spark-sql-kafka-0-10_2.11_2.4.3 and kafka-clients_0.10.0.0.
I need to join streaming data with meta-data which is stored in RDS.
but RDS meta data could be added/changed.
If I read and load RDS table data in application , it would be stale for joining with streaming data.
I understood ,need to use Change Data Capture (CDC).
How can I implement Change Data Capture (CDC) in my scenario?
any clues or sample way to implement Change Data Capture (CDC) ?
thanks a lot.
You can stream a database into Kafka so that the contents of a table plus every subsequent change is available on a Kafka topic. From here it can be used in stream processing.
You can do CDC in two different ways:
Query-based: poll the database for changes, using Kafka Connect JDBC Source
Log-based: extract changes from the database's transaction log using e.g. Debezium
For more details and examples see

SPARK STREAMING: I want to do some streaming exercises, how to get a good stream data source?

I want to do some streaming exercises, how to get a good stream data source ?
I am looking for both structure streaming data source and non structured streaming data source.
Will twitter work?
Local files can be used as sources in structured streaming, e.g.:
stream = spark.readStream.schema(mySchema).option("maxFilesPerTrigger", 1).json("/my/data")
With this you can experiment on data transformation and output very easily, and there are many sample datasets online, e.g. on kaggle
If you want to have something production-like, twitter api is a good option. You will need some sort of a messaging middleware though, like Kafka or Azure Event Hub - a simple app can send tweets there and you will be able to pick them up easily from Spark. You can also generate data yourself on the input side instead of depending on Twitter.

App server Log process

I have a requirement from my client to process the application(Tomcat) server log file for a back end REST Based App server which is deployed on a cluster. Clint wants to generate "access" and "frequency" report from those data with different parameter.
My initial plan is that get those data from App server log --> push to Spark Streaming using kafka and process the data --> store those data to HIVE --> use zeppelin to get back those processed and centralized log data and generate reports as per client requirement.
But as per my knowledge Kafka does not any feature which can read data from log file and post them in Kafka broker by its own , in that case we have write a scheduler job process which will read the log time to time and send them in Kafka broker , which I do not prefer to do, as in that case it will not be a real time and there can be synchronization issue which we have to bother about as we have 4 instances of application server.
Another option, I think we have in this case is Apache Flume.
Can any one suggest me which one would be better approach in this case or if in Kafka, we have any process to read data from log file by its own and what are the advantage or disadvantages we can have in feature in both the cases?
I guess another option is Flume + kakfa together , but I can not speculate much what will happen as I have almost no knowledge about flume.
Any help will be highly appreciated...... :-)
Thanks a lot ....
You can use Kafka Connect (file source connector) to read/consume Tomcat logs files & push them to Kafka. Spark Streaming can then consume from Kafka topics and churn the data
tomcat -> logs ---> kafka connect -> kafka -> spark -> Hive

Spark Streaming from Kafka Consumer

I might need to work with Kafka and I am absolutely new to it. I understand that there are Kafka producers which will publish the logs(called events or messages or records in Kafka) to the Kafka topics.
I will need to work on reading from Kafka topics via consumer. Do I need to set up consumer API first then I can stream using SparkStreaming Context(PySpark) or I can directly use KafkaUtils module to read from kafka topics?
In case I need to setup the Kafka consumer application, how do I do that? Please can you share links to right docs.
Thanks in Advance!!
Spark provide internal kafka stream in which u dont need to create custom consumer there is 2 approach to connect with kafka 1 with receiver 2. direct approach.
For more detail go through this link
There's no need to set up kafka consumer application,Spark itself creates a consumer with 2 approaches. One is Reciever Based Approach which uses KafkaUtils class and other is Direct Approach which uses CreateDirectStream Method.
Somehow, in any case of failure ion Spark streaming,there's no loss of data, it starts from the offset of data where you left.
For more details,use this link:

Aggregate separate Flume streams in Spark

I am researching the ability to do some "realtime" logprocessing in our setup and I have a question on how to proceed.
So the current setup (or as we intend to do it) is as follow:
Server A generates logfiles through Rsyslog to a folder per customer.
Server B generates logfiles through Rsyslog to a folder per customer.
Both server A and B generate up to 15 logfiles (1 per customer) in a folder per customer, the structure looks like this:
On server C we have a Flume sink running that listens to Rsyslog tcp messages from server A and server B. Currently for testing we only have 1 flume sink for 1 customer, but I think we will need 1 flume sink per customer.
This Flume sink then forwards these loglines to a Spark application that should aggregate the results per customer.
Now my question is: how can I make sure that Spark (streaming) will aggregate the results per customer? So let's say each customer will have it's own Flume sink, so how can I make sure Spark aggregates each flume stream separately and doesn't mix 2 or more Flume streams together?
Or is Kafka more suitable for this kind of scenario?
Any insights would be appreciated.
You can use Kafka with customer id as partition key. So basic idea in Kafka is that a message can have both key and value. Now kafka guarantees that all the messages for same key go to same partition (Spark streaming understands concept of partitions in Kafka and lets you have have separate node handling every partition), If you want you can use flume's kafka sink to write messages to Kafka.
