Structured Streaming Job not using all workers - apache-spark

I have a Spark 2.0.2 structured streaming job connecting to Apache Kafka data stream as the source. The job takes in Twitter data (JSON) from Kafka and uses CoreNLP to annotate the data with things like sentiment, parts of speech tagging etc.. It works well with a local[*] master. However, when I setup a stand alone Spark cluster, only one worker gets used to process the data. I have two workers with the same capability.
Is there something I need to set when submitting my job that I'm missing. I've tried setting the --num-executors in my spark-submit command but I have had no luck.
Thanks in advance for the pointer in the right direction.

I ended up creating the kafka source stream with more partitions. This seems to have sped up the processing part 9 folds. Spark and kafka have a lot of knobs. Lots to sift through... See Kafka topic partitions to Spark streaming

Related

How to make sure that spark structured streaming is processing all the data in kafka

I developed a spark structured streaming application that reads data from a Kafka topic, aggregates the data, and then outputs to S3.
Now, I'm trying to find the most appropriate hardware resources necessary for the application to run properly while also minimizing the costs. Finding very little information on how to calculate the right-sizing of the spark cluster knowing the size of the input, I opted for a trial and error strategy. I deploy applications with minimal resources and add resources until the spark application runs in a stable manner.
That being said, how can I make sure that the spark application is able to process all the data in its Kafka input, and that the application is not falling behind? Is there a specific metric to look for? Job duration time vs trigger processing time?
Thank you for your answers!
Track kafka consumer lag. There should Consumer group created for your Spark streaming job.
> bin/kafka-consumer-groups.sh --bootstrap-server broker1:9092 --describe --group test-consumer-group
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
test-foo 0 1 3 2 consumer-1-a5d61779-4d04-4c50-a6d6-fb35d942642d /127.0.0.1 consumer-1
If you have a metric saving and plotting tools like prometheus and Grafhana
Save the all Kafka metrics including Kafka consumer lag to prometheus/graphite
Use Grafana to query prometheus and plot them on the graph

Kafka Spark Streaming ingestion for multiple topics

We are currently ingesting Kafka messages into HDFS using Spark Streaming. So far we spawn a whole Spark job for each topic.
Since messages are produced pretty rarely for some topics (average of 1 per day), we're thinking about organising the ingestion in pools.
The idea is to avoid creating a whole container (and related resources) for this "unfrequent" topics. In fact Spark Streaming accepts a list of topics in input, so we're thinking about using this feature in order to have a single job consuming all of them.
Do you guys think the one exposed is a good strategy? We also thought about batch ingestion, but we like to keep real-time behavior so we excluded this option. Do you have any tip or suggestion?
Does Spark Streaming handle well multiple topics as a source in case of failures in terms of offset consistency etc.?
Thanks!
I think Spark should be able to handle multiple topics fine as they have support for this from a long time and yes Kafka connect is not confluent API. Confluent does provide connectors for their cluster but you can use it too. You can see that Apache Kafka also has documentation for Connect API.
It is little difficult with Apache version of Kafka, but you can use it.
https://kafka.apache.org/documentation/#connectapi
Also if you're opting for multiple kafka topics in single spark streaming job, you may need to think about not creating small files as your frequency seems very less.

Are Spark Streaming, Structured Streaming and Kafka Streaming the same thing?

I have come across three popular streaming techniques that are Spark Streaming, Structured Streaming and Kafka Streaming.
I have gone through various sites but not getting this answer, are these three the same thing or different?
If not same what is the basic difference.
I am not looking for an in depth answer. But an answer to above question (yes or no) and a little intro to each of them so that I can explore more. :)
Thanks in advance
Subrat
I guess you are referring to Kafka Streams when you say "Kafka Streaming".
Kafka Streams is a JVM library, part of Apache Kafka. It is a way of processing data in Kafka topics providing an abstraction layer. Applications running KafkaStreams library can be run anywhere (not just in the Kafka cluster, actually, it is not recommended to). They'll consume, process and produce data to/from the Kafka cluster.
Spark Streaming is a part of Apache Spark distributed data processing library, that provides Stream (as oppposed to batch) processing. Spark initially provided batch computation only, so a specific layer Spark Streaming was provided for stream processing. Spark Streaming can be fed with Kafka data, but it can be connected to other sources as well.
Structured Streaming, within the realm of Apache Spark, is a different approach that came to overcome certain limitations to stream processing of the previous approach that Spark Streaming was using. It was added to Spark from a certain version onwards(2.0 IIRC).

Where does Spark Streaming run?

As I understand, Spark can analyze streams with Spark Streaming.
And Kafka can receive data from multiple sources.
What I don't understand is, if i have a Kafka cluster receiving data from multiple sources, will the data be send to a database with Spark Streaming running? Or is Spark Streaming running on a application server?
If you use Spark Streaming, you need to set up a Spark cluster and you will submit you Spark Streaming job to the cluster. Thus, you will have to 2 cluster: Kafka + Spark (or actually 3, as you also need a Zookeeper cluster for Kafka).

Spark Streaming and Kafka: one cluster or several standalone boxes?

I am about taking a decision about using Spark-Streaming Kafka integration.
I have a Kafka topic (I can break it into several topics) queuing several dozens of thousands of messages per minute, my spark streaming application ingest the messages by applying transformations, and then update a UI.
Knowing that all failures are handled and data are replicated in Kafka, what is the best option for implementing the Spark Streaming application in order to achieve the best possible performance and robustness:
One Kafka topic and one Spark cluster.
Several Kafka topics and several stand-alone Spark boxes (one machine with stand alone spark cluster for each topic)
Several Kafka topics and one Spark cluster.
I am tempted to go for the second option, but I couldn't find people talking about such a solution.
An important element to consider in this case is the partitioning of the topic.
The parallelism level of your Kafka-Spark integration will be determined by the number of partitions of the topic. The direct Kafka model simplifies the consumption model by establishing a 1:1 mapping between the number of partitions of the topic and RDD partitions for the corresponding Spark job.
So, the recommended setup would be: one Kafka topic with n partitions (where n is tuned for your usecase) and a Spark cluster with enough resources to process the data from those partitions in parallel.
Option #2 feels like trying to re-implement what Spark gives you out of the box: Spark gives you resilient distributed computing. Option #2 is trying to parallelize the payload over several machines and deal with failure by having independent executors. You get that with a single Spark cluster, with the benefit of improved resource usage and a single deployment.
Option 1 is straight forward, simple and probably more efficient. If your requirements are met, that's the one to go for (And honor the KISS Principle).

Resources