How to run multiple spark streaming batch jobs at the same time? - apache-spark

I have been using spark streaming to process data in Spark 2.1.0.
9 receivers receive data in 10 second intervals by streaming.
The average processed time is about 10 seconds since my submitted the streaming application. However, the queued batches delayed more than one day.
Is the queue in the driver? Or is it in each receiver executer?
And in Active Batch processing, only one real data processing batch is processed except 9 receivers. So there are always only 10 batches running.
I am inquiring how to increase the number of active batches processing data.
And there is only one Streaming Batch job at a time. I set spark.scheduler.mode to FAIR in SparkConf and set the schedule pool to fair, but the batch job only runs one at a time.
In the spark job scheduling guide, the fair pool is supposed to operate as a FIFO in the same pool. Is this right?
How to run multiple spark streaming batch jobs at the same time?
spark streaming run spark-yarn client mode
8 node cluster, 1 node : 32core, 128G
executor_memory: 6g
executor_cores: 4
driver-memory: 4g
sparkConf.set("spark.scheduler.mode", "FAIR")
ssc.sparkContext.setLocalProperty("spark.scheduler.pool",
"production")
production is FAIR pool
sparkConf.set("spark.dynamicAllocation.enabled", false)

Related

gracefully exit without running pending batches in spark Dstreams

I am trying spark Dstreams with Kafka.
I am not able to commit for pending/ lag batches that are consumed by spark Dstream. This issues happens after the spark streaming context is stopped using ssc.stop(true,true). Here in this case the streaming context is stopped but spark context is still running the pending batched.
Here are a few things I have done.
create Dstream to get data from kafka topic. (Successfullyy)
commit offset manually back to kafka using
stream.asInstanceOf[canCommitOffsets].commitAsync()
Batch Interval is 60 seconds.
Batch time (time taken to perform some operation on incoming data ) is 2 mins.
streamingContext.stop(true,true)
Please tell me if there is a way to commit the offset for pending batches as well, or gracefully exit after the currently running batch and discard the pending batches, that way the offset for pending batches is not commited.

Running 16 processes on single jvm machine

I'm using a 64 GB RAM & 24 core machine and allocated 32 GB to JVM. I wanted to run following processes :-
7 Kafka Brokers
3 instance of zookeeper
Elastic Search
Cassandra
Spark
MongoDB
Mysql
Kafka Manager
Node.js
& running 4-5 Spark Application on 5-6 executor with 1GB each simulatenously. The working of Spark jobs are as follows :-
1) 1 Spark job takes data from kafka and inserts into Cassandra
2) 1 Spark job takes data from another kafka topic and inserts into different Cassandra Table.
3) 2 Spark Job takes data from Cassandra, did some processing/analysis and writes data into their respective different cassandra table.
So, Sometimes my insertion application gets hang. It is taking around 500 records/second from Kafka. After running for sometime, It starts creating batches in queue and there is no error still processing time in Spark dashboard is increasing gradually.
I have used TOP to check CPU usage and found there is one process "0QrmJB" which is taking 1500+ CPU% usage and java is taking 200% usage.
What might be the issues ? I'm not able to analyse.Is it ok to run these many processes on single JVM machine? Thanks,

Spark kafka Streaming pull more messages

I'm using Kafka 0.9 and Spark 1.6. Spark Streaming application streams messages from Kafka through direct stream API (Version 2.10-1.6.0).
I have 3 workers with 8 GB memory each. For every minute I get 4000 messages to Kafka and in spark each worker is streaming 600 messages. I always see a lag on the Kafka offset to Spark offset.
I have 5 Kafka partitions.
Is there a way to make Spark stream more messages for each pull from Kafka?
My streaming frequency is 2 seconds
spark configurations in the app
"maxCoresForJob": 3,
"durationInMilis": 2000,
"auto.offset.reset": "largest",
"autocommit.enable": "true",
Would you please explain more? did you check which piece of code taking longer to execute? From cloudera manager-> Yarn--> Application -> selection your application --> Application master --> Streaming, then select one batch and click. Try to find out what task is taking longer time to execute. How many executors are you using? for 5 partitions, it is better to have 5 executors.
You can post your transformation logic, there could be some way to tune.
Thanks

Spark streaming with Kafka: when recovering form checkpointing all data are processed in only one micro batch

I'm running a Spark Streaming application that reads data from Kafka.
I have activated checkpointing to recover the job in case of failure.
The problem is that if the application fails, when it restarts it tries to execute all the data from the point of failure in only one micro batch.
This means that if a micro-batch usually receives 10.000 events from Kafka, if it fails and it restarts after 10 minutes it will have to process one micro-batch of 100.000 events.
Now if I want the recovery with checkpointing to be successful I have to assign much more memory than what I would do normally.
Is it normal that, when restarting, Spark Streaming tries to execute all the past events from checkpointing at once or am I doing something wrong?
Many thanks.
If your application finds it difficult to process all events in one micro batch after recovering it from failure, you can provide spark.streaming.kafka.maxRatePerPartition configuration is spark-conf, either in spark-defaults.conf or inside your application.
i.e if you believe your system/app can handle 10K events per minute second safely, and your kafka topic has 2 partitions, add this line to spark-defaults.conf
spark.streaming.kafka.maxRatePerPartition 5000
or add it inside your code :
val conf = new SparkConf()
conf.set("spark.streaming.kafka.maxRatePerPartition", "5000")
Additionally, I suggest you to set this number little bit higher and enable backpressure. This will try to stream data at a rate, which doesn't destabilizes your streaming app.
conf.set("spark.streaming.backpressure.enabled","true")
update: There was a mistake, The configuration is for number of seconds per seconds not per minute.

What's the difference between "Job Duration" and "Batch Processing Time" in Spark Streaming?

The job duration of my application in Spark UI is as below:
Job Duration in Spark UI
And here is the batch processing time in Spark UI:
Batch Processing Time in Spark UI
Note that the batch processing time is generally longer than job duration. So, what is the difference between them ?
Spark processes all the jobs in a batch together to utilize CPU time better. That is why batch processing times are greater than the jobs processing time.

Resources