Spark Streaming from Kafka Consumer - apache-spark

I might need to work with Kafka and I am absolutely new to it. I understand that there are Kafka producers which will publish the logs(called events or messages or records in Kafka) to the Kafka topics.
I will need to work on reading from Kafka topics via consumer. Do I need to set up consumer API first then I can stream using SparkStreaming Context(PySpark) or I can directly use KafkaUtils module to read from kafka topics?
In case I need to setup the Kafka consumer application, how do I do that? Please can you share links to right docs.
Thanks in Advance!!

Spark provide internal kafka stream in which u dont need to create custom consumer there is 2 approach to connect with kafka 1 with receiver 2. direct approach.
For more detail go through this link http://spark.apache.org/docs/latest/streaming-kafka-integration.html

There's no need to set up kafka consumer application,Spark itself creates a consumer with 2 approaches. One is Reciever Based Approach which uses KafkaUtils class and other is Direct Approach which uses CreateDirectStream Method.
Somehow, in any case of failure ion Spark streaming,there's no loss of data, it starts from the offset of data where you left.
For more details,use this link: http://spark.apache.org/docs/latest/streaming-kafka-integration.html

Related

Read from multiple sources using Nifi, group topics in Kafka and subscribe with Spark

We use Apache Nifi to get data from multiple sources like Twitter and Reddit in specific interval (for example 30s). Then we would like to send it to Apache Kafka and probably it should somehow group both Twitter and Reddit messages into 1 topic so that Spark would always receive data from both sources for given interval at once.
Is there any way to do that?
#Sebastian What you describe is basic NiFI routing. You would just route both Twitter and Redis to the same downstream Kafka Producer and same Topic. After you get data into NiFi from each service, you should run it to UpdateAttribute and set attribute topicName to what you want for each source. If there are additional steps per Data Source do them after Update Attribute and before PublishKafka.
If you code all the upstream routes as above, you could route all the different Data Sources to PublishKafka processor using ${topicName} dynamically.

Design Question Kafka Consumer/Producer vs Kafka Stream

I'm working with NodeJs MS, so far they communicate through Kafka Consumer/Producer. Now I need to buiid a Loggger MS which must record all the messages and do some processing (parse and save to db), but I'm not sure if the current approach could be improved using Kafka Stream or if I should continue using Consumers
The Streams API is a higher level abstraction that sits on top of the Consumer/Producer APIs. The Streams API allows you to filter and transform messages, and build a topology of processing steps.
For what you're describing, if you're just picking up a messages and doing a single processing step, the Consumer API is probably fine. That said, you could do the same thing with the Streams API too and not use the other features.
buiid a Loggger MS which must record all the messages and do some processing (parse and save to db)
I would suggest using something like Streams API or Nodejs Producer + Consumer to parse and write back to Kafka.
From your parsed/filtered/sanitized messages, you can run a Kafka Connect cluster to sink your data into a DB
could be improved using Kafka Stream or if I should continue using Consumers
Ultimately, depends what you need. The peek and foreach methods of Streams DSL are functionally equivalent to a Consumer

Spark streamming task shutdown gracefully when kafka client send message asynchronously

i am building a spark streamming application, read input message from kafka topic, transformation message and output the result message into another kafka topic. Now i am confused how to prevent data loss when application restart, including kafka read and output. Setting the spark configuration "spark.streaming.stopGracefullyOnShutdow" true can help?
You can configure Spark to do checkpoint to HDFS and store the Kafka offsets in Zookeeper (or Hbase, or configure elsewhere for fast, fault tolerant lookups)
Though, if you process some records and write the results before you're able to commit offsets, then you'll end up reprocessing those records on restart. It's claimed that Spark can do exactly once with Kafka, but that is a only with proper offset management, as far as I know, for example, Set enable.auto.commit to false in the Kafka priorities, then only commit after the you've processed and written the data to its destination
If you're just moving data between Kafka topics, Kafka Streams is the included Kafka library to do that, which doesn't require YARN or a cluster scheduler

App server Log process

I have a requirement from my client to process the application(Tomcat) server log file for a back end REST Based App server which is deployed on a cluster. Clint wants to generate "access" and "frequency" report from those data with different parameter.
My initial plan is that get those data from App server log --> push to Spark Streaming using kafka and process the data --> store those data to HIVE --> use zeppelin to get back those processed and centralized log data and generate reports as per client requirement.
But as per my knowledge Kafka does not any feature which can read data from log file and post them in Kafka broker by its own , in that case we have write a scheduler job process which will read the log time to time and send them in Kafka broker , which I do not prefer to do, as in that case it will not be a real time and there can be synchronization issue which we have to bother about as we have 4 instances of application server.
Another option, I think we have in this case is Apache Flume.
Can any one suggest me which one would be better approach in this case or if in Kafka, we have any process to read data from log file by its own and what are the advantage or disadvantages we can have in feature in both the cases?
I guess another option is Flume + kakfa together , but I can not speculate much what will happen as I have almost no knowledge about flume.
Any help will be highly appreciated...... :-)
Thanks a lot ....
You can use Kafka Connect (file source connector) to read/consume Tomcat logs files & push them to Kafka. Spark Streaming can then consume from Kafka topics and churn the data
tomcat -> logs ---> kafka connect -> kafka -> spark -> Hive

Spark streaming with JMS - No API

Is there any API/ way to integrate Spark Streaming with JMS. I am able to integrate with Kafka and Sockets but to integrate with Jms queue or topic I am unable to.
I think you should try calling reciever api in spark. You need to create custom receiver
http://spark.apache.org/docs/latest/streaming-custom-receivers.html
Also check rely from tathagat das who is spark contributor from
www.apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-JMS-td5371.html
If you need help in detail let me know
I know this is an old post . Since I am working on something similar. You can use spark jms receiver
Spark JMS receiver

Resources