Spark Streaming Design for 1000+ topics - apache-spark

I have to design a spark streaming application with below use case. I am looking for best possible approach for this.
I have application which pushing data into 1000+ different topics each has different purpose . Spark streaming will receive data from each topic and after processing it will write back to corresponding another topic.
Ex.
Input Type 1 Topic --> Spark Streaming --> Output Type 1 Topic
Input Type 2 Topic --> Spark Streaming --> Output Type 2 Topic
Input Type 3 Topic --> Spark Streaming --> Output Type 3 Topic
.
.
.
Input Type N Topic --> Spark Streaming --> Output Type N Topic and so on.
I need to answer following questions.
Is it a good idea to launch 1000+ spark streaming application per topic basis ? Or I should have one streaming application for all topics as processing logic going to be same ?
If one streaming context , then how will I determine which RDD belongs to which Kafka topic , so that after processing I can write it back to its corresponding OUTPUT Topic?
Client may add/delete topic from Kafka , how do dynamically handle in Spark streaming ?
How do I restart job automatically on failure ?
Any other issue you guys see here ?
Highly appreicate your response.

1000 different Spark applications will not be maintainable, imagine deploying, or upgrading each application.
You will have to use the recommended "Direct approach" instead of the Receiver approach, otherwise your application is going to use more than 1000 cores, if you don't have more, it will be able to receive data from your Kafka's topic but not to process them. From Spark Streaming Doc :
Note that, if you want to receive multiple streams of data in parallel in your streaming application, you can create multiple input DStreams (discussed further in the Performance Tuning section). This will create multiple receivers which will simultaneously receive multiple data streams. But note that a Spark worker/executor is a long-running task, hence it occupies one of the cores allocated to the Spark Streaming application.
You can see in the Kafka Integration (there is one for Kafka 0.8 and one for 0.10) doc how to see in which topic belongs a message
If a client add new topics or partitions, you will need to update your Spark Streaming's topics conf, and redeploy it. If you use Kafka 0.10 you can also use RegEx for topics' name, see Consumer Strategies. I've experienced reading from a deleted topic in Kafka 0.8, and there was no problems, still verify ("Trust, but verify")
See Spark Streaming's doc about Fault Tolerance, also use the mode --supervise when submiting your application to your cluster, see the Deploying documentation for more information
To achieve exactly-once semantic, I suggest this Github from Spark Streaming's main commiter : https://github.com/koeninger/kafka-exactly-once
Bonus, good similar StackOverFlow post : Spark: processing multiple kafka topic in parallel
Bonus2: Watch out for the soon-to-be-released Spark 2.2 and the Structured Streaming component

Related

Is it possible to have a single kafka stream for multiple queries in structured streaming?

I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.

Advantages of moving message parsing logic from spark to kafka

Currently, I am working on a use-case which requires reading JSON messages from Kafka and process them in Spark via Spark Streaming. We are expecting around 35 Million records per day. With this kind of a load, is it preferred to move the parsing logic (and some filtering logic based on JValue) to Kafka using Custom Kafka Deserializer (extending org.apache.kafka.common.serialization.Deserializer class). Will this have any performance overhead?
Thank you.

Use Akka with apache spark streaming & Kafka?

Below is the high level usecase which im trying to workon.
we have stream of students data published into a Kafka topic and our module has to read the student ids as stream and fetch associated data from multiple sources for each student and perform some calculation for each student and publish the associated calculation for each student into a kafka topic.
So here the question is it better to write a single big Spark job or use Akka to have separate service for each source so that actors can work parallely take bunch of student ids and get the data from respective source and perform some bunch Transformations and actions and finally a calculation associated with each student .
Or do i really need to use Akka here? Will Spark handles this efficiently internally?
Appreciate any thoughts here.
If your transformations take data from Kafka as input and produce output back into Kafka, it appears the most natural fit is Kafka Streams. I'd look to that first. Kafka Streams take advantage of the partitioning of data on Kafka to process partition groups in parallel to each other, but process messages sequentially within in each group, similarly how akka actors work in parallel to each other but each actor internally processes messages sequentially.
However, if your calculation requires e.g. machine learning or in general some iterative data-processing which does re-partitioning (shuffling in spark lingo) of the data between iterations, then Kafka Streams would no longer be that good a fit, I think. Then I'd consider Spark or Flink.
Akka is really powerful and you can use it in both these cases and more. However, it's a lower level library than Kafka Streams, Spark or Flink. Which means you have more power but also more considerations to think about. If using akka, I'd go for akka-streams. They have a good integration with kafka via the akka-stream-kafka (aka reactive-kafka) library.

How to do multiple Kafka topics to multiple Spark jobs in parallel

Please forgive if this question doesn't make sense, as I am just starting out with Spark and trying to understand it.
From what I've read, Spark is a good use case for doing real time analytics on streaming data, which can then be pushed to a downstream sink such as hdfs/hive/hbase etc.
I have 2 questions about that. I am not clear if there is only 1 spark streaming job running or multiple at any given time. Say I have different analytics I need to perform for each topic from Kafka or each source that is streaming into Kafka, and then push the results of those downstream.
Does Spark allow you to run multiple streaming jobs in parallel so you can keep aggregate analytics separate for each stream, or in this case each Kafka topic. If so, how is that done, any documentation you could point me to ?
Just to be clear, my use case is to stream from different sources, and each source could have potentially different analytics I need to perform as well as different data structure. I want to be able to have multiple Kafka topics and partitions. I understand each Kafka partition maps to a Spark partition, and it can be parallelized.
I am not sure how you run multiple Spark streaming jobs in parallel though, to be able to read from multiple Kafka topics, and tabulate separate analytics on those topics/streams.
If not Spark is this something thats possible to do in Flink ?
Second, how does one get started with Spark, it seems there is a company and or distro to choose for each component, Confluent-Kafka, Databricks-Spark, Hadoop-HW/CDH/MAPR. Does one really need all of these, or what is the minimal and easiest way to get going with a big data pipleine while limiting the number of vendors ? It seems like such a huge task to even start on a POC.
You have asked multiple questions so I'll address each one separately.
Does Spark allow you to run multiple streaming jobs in parallel?
Yes
Is there any documentation on Spark Streaming with Kafka?
https://spark.apache.org/docs/latest/streaming-kafka-integration.html
How does one get started?
a. Book: https://www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624/
b. Easy way to run/learn Spark: https://community.cloud.databricks.com
I agree with Akbar and John that we can run multiple streams reading from different sources in parallel.
I like add that if you want to share data between streams, you can use Spark SQL API. So you can register your RDD as a SQL table and access the same table in all the streams. This is possible since all the streams share the same SparkContext

Apache Kafka and Spark Streaming

I'm reading through this blog post:
http://blog.jaceklaskowski.pl/2015/07/20/real-time-data-processing-using-apache-kafka-and-spark-streaming.html
It discusses about using Spark Streaming and Apache Kafka to do some near real time processing. I completely understand the article. It does show how I could use Spark Streaming to read messages from a Topic. I would like to know if there is a Spark Streaming API that I can use to write messages into Kakfa topic?
My use case is pretty simple. I have a set of data that I can read from a given source at a constant interval (say every second). I do this using reactive streams. I would like to do some analytics on this data using Spark. I want to have fault-tolerance, so Kafka comes into play. So what I would essentially do is the following (Please correct me if I was wrong):
Using reactive streams get the data from external source at constant intervals
Pipe the result into Kafka topic
Using Spark Streaming, create the streaming context for the consumer
Perform analytics on the consumed data
One another question though, is the Streaming API in Spark an implementation of the reactive streams specification? Does it have back pressure handling (Spark Streaming v1.5)?
No, at the moment, none of Spark Streaming's built-in receiver APIs are an implementation of the Reactive Streams implementation. But there's an issue for that you will want to follow.
But Spark Streaming 1.5 has internal back-pressure-based dynamic throttling. There's some work to extend that beyond throttling in the pipeline. This throttling is compatible with the Kafka direct stream API.
You can write to Kafka in a Spark Streaming application, here's one example.
(Full disclosure: I'm one of the implementers of some of the back-pressure work)
If you have to write the results stream to another Kafka topic let say 'topic_x', First, you must have columns by the name 'Key' and 'Value' in the result stream which you are trying to write to the topic_x.
result_stream = result_stream.selectExpr('CAST (key AS STRING)','CAST (value AS STRING)')
kafkaOutput = result_stream \
.writeStream \
.format('kafka') \
.option('kafka.bootstrap.servers','192.X.X.X:9092') \
.option('topic','topic_x') \
.option('checkpointLocation','./resultCheckpoint') \
.start()
kafkaOutput.awaitTermination()
For more details check the documentation at https://spark.apache.org/docs/2.4.1/structured-streaming-kafka-integration.html

Resources