Aggregate separate Flume streams in Spark - apache-spark

I am researching the ability to do some "realtime" logprocessing in our setup and I have a question on how to proceed.
So the current setup (or as we intend to do it) is as follow:
Server A generates logfiles through Rsyslog to a folder per customer.
Server B generates logfiles through Rsyslog to a folder per customer.
Both server A and B generate up to 15 logfiles (1 per customer) in a folder per customer, the structure looks like this:
/var/log/CUSTOMER/logfile.log
On server C we have a Flume sink running that listens to Rsyslog tcp messages from server A and server B. Currently for testing we only have 1 flume sink for 1 customer, but I think we will need 1 flume sink per customer.
This Flume sink then forwards these loglines to a Spark application that should aggregate the results per customer.
Now my question is: how can I make sure that Spark (streaming) will aggregate the results per customer? So let's say each customer will have it's own Flume sink, so how can I make sure Spark aggregates each flume stream separately and doesn't mix 2 or more Flume streams together?
Or is Kafka more suitable for this kind of scenario?
Any insights would be appreciated.

You can use Kafka with customer id as partition key. So basic idea in Kafka is that a message can have both key and value. Now kafka guarantees that all the messages for same key go to same partition (Spark streaming understands concept of partitions in Kafka and lets you have have separate node handling every partition), If you want you can use flume's kafka sink to write messages to Kafka.

Related

How do I setup Spark application to pull from single Kafka topic on multiple Spark nodes?

My application has a Kafka input stream for a single topic, it does some filtering and aggregating of the data, and then writes to Elasticsearch. What I'm seeing is that while the application is distributed to all of the spark nodes and processing the data properly, only one node is pulling data, and the rest are idle.
Also, I am using an R53 hostname for the Kafka nodes. Should I use a comma-separated list of the Kafka nodes instead?
The topic has 20 partitions. I am running Spark 3.2.1 using only Spark Streaming (no DFS).
The topic has 20 partitions
Then up to 20 executors should be able to consume in parallel.
using an R53 hostname for the Kafka nodes
Any Kafka client, including Spark, will need to communicate with the brokers individually. This means you'll need to expose each broker's advertised.listeners setting such that Spark can communicate with each broker directly, and not via a single DNS name / load balancer address. If only one is resolvable, then you'll only be able to consume (or produce) to just that one.
Should I use a comma-separated list of the Kafka nodes instead
It's recommended, but not necessary. For example, what if the broker at the one address provided is not responding? The bootstrap protocol will return all advertised.listener addresses back to the client based on its associated listeners protocol.

Read from multiple sources using Nifi, group topics in Kafka and subscribe with Spark

We use Apache Nifi to get data from multiple sources like Twitter and Reddit in specific interval (for example 30s). Then we would like to send it to Apache Kafka and probably it should somehow group both Twitter and Reddit messages into 1 topic so that Spark would always receive data from both sources for given interval at once.
Is there any way to do that?
#Sebastian What you describe is basic NiFI routing. You would just route both Twitter and Redis to the same downstream Kafka Producer and same Topic. After you get data into NiFi from each service, you should run it to UpdateAttribute and set attribute topicName to what you want for each source. If there are additional steps per Data Source do them after Update Attribute and before PublishKafka.
If you code all the upstream routes as above, you could route all the different Data Sources to PublishKafka processor using ${topicName} dynamically.

App server Log process

I have a requirement from my client to process the application(Tomcat) server log file for a back end REST Based App server which is deployed on a cluster. Clint wants to generate "access" and "frequency" report from those data with different parameter.
My initial plan is that get those data from App server log --> push to Spark Streaming using kafka and process the data --> store those data to HIVE --> use zeppelin to get back those processed and centralized log data and generate reports as per client requirement.
But as per my knowledge Kafka does not any feature which can read data from log file and post them in Kafka broker by its own , in that case we have write a scheduler job process which will read the log time to time and send them in Kafka broker , which I do not prefer to do, as in that case it will not be a real time and there can be synchronization issue which we have to bother about as we have 4 instances of application server.
Another option, I think we have in this case is Apache Flume.
Can any one suggest me which one would be better approach in this case or if in Kafka, we have any process to read data from log file by its own and what are the advantage or disadvantages we can have in feature in both the cases?
I guess another option is Flume + kakfa together , but I can not speculate much what will happen as I have almost no knowledge about flume.
Any help will be highly appreciated...... :-)
Thanks a lot ....
You can use Kafka Connect (file source connector) to read/consume Tomcat logs files & push them to Kafka. Spark Streaming can then consume from Kafka topics and churn the data
tomcat -> logs ---> kafka connect -> kafka -> spark -> Hive

Spark Streaming from Kafka Consumer

I might need to work with Kafka and I am absolutely new to it. I understand that there are Kafka producers which will publish the logs(called events or messages or records in Kafka) to the Kafka topics.
I will need to work on reading from Kafka topics via consumer. Do I need to set up consumer API first then I can stream using SparkStreaming Context(PySpark) or I can directly use KafkaUtils module to read from kafka topics?
In case I need to setup the Kafka consumer application, how do I do that? Please can you share links to right docs.
Thanks in Advance!!
Spark provide internal kafka stream in which u dont need to create custom consumer there is 2 approach to connect with kafka 1 with receiver 2. direct approach.
For more detail go through this link http://spark.apache.org/docs/latest/streaming-kafka-integration.html
There's no need to set up kafka consumer application,Spark itself creates a consumer with 2 approaches. One is Reciever Based Approach which uses KafkaUtils class and other is Direct Approach which uses CreateDirectStream Method.
Somehow, in any case of failure ion Spark streaming,there's no loss of data, it starts from the offset of data where you left.
For more details,use this link: http://spark.apache.org/docs/latest/streaming-kafka-integration.html

Worker Queue option in Kafka

We are developing an application , which will receive time series sensor data as byte array from a set of devices via UDP. This data needs to be parsed and stored in a Cassandra Database...
We were using RabbitMQ as the message broker and using the Work Queues based consumers to parse the data and push it in to cassandra... Because of increasing traffic, we are concerned about RabbitMQ perfomance and are planning to move to Kafka... Our understanding is that the same can be implemented using consumer group in kafka .. is our understanding correct
With Apache Kafka, you can scale a topic relatively easier. In order to be able to process more data in same time you'll need:
Having multiple consumers in same consumer group, you'll be able to consume multiple messages in same time. You are limited to the number of partitions of a topic.
Increase the number of partitions for a topic, and increase the number of consumers.
Increase the number of brokers, if you still to process more data.
I will approach the scalability in the order described above, but Kafka can handle a lot. In a setup with 2 brokers, 4 partitions per topic and 2 consumers (each consumer use one thread per partition), consumer decode json to java object, enrich and store to Cassandra, it can handle 30k/s (data is batched in batch of 200 insert statements).

Resources