Spark kafka Streaming pull more messages - apache-spark

I'm using Kafka 0.9 and Spark 1.6. Spark Streaming application streams messages from Kafka through direct stream API (Version 2.10-1.6.0).
I have 3 workers with 8 GB memory each. For every minute I get 4000 messages to Kafka and in spark each worker is streaming 600 messages. I always see a lag on the Kafka offset to Spark offset.
I have 5 Kafka partitions.
Is there a way to make Spark stream more messages for each pull from Kafka?
My streaming frequency is 2 seconds
spark configurations in the app
"maxCoresForJob": 3,
"durationInMilis": 2000,
"auto.offset.reset": "largest",
"autocommit.enable": "true",

Would you please explain more? did you check which piece of code taking longer to execute? From cloudera manager-> Yarn--> Application -> selection your application --> Application master --> Streaming, then select one batch and click. Try to find out what task is taking longer time to execute. How many executors are you using? for 5 partitions, it is better to have 5 executors.
You can post your transformation logic, there could be some way to tune.
Thanks

Related

large differences in Spark Structured Streaming task duration

I have a Spark Structured Streaming job reading from Kafka that has task durations that vary greatly.
I don't know why this is the case since the topic partitions are not skewed, and I am using maxOffsetsPerTrigger on the readStream to cap the limit. I think each executor should be getting the same amount of data.
Yet it is common for a stage to have a minimum task duration of 0.8s and maximum of 12s. In the Spark UI under Event Timeline I can see the green bars for Executor Computing Time show the variation.
Details of the job:
is running on Spark-Kubernetes
uses PySpark via Jupyter Notebook
reads from a Kafka topic with n partitions
creates n executors to match the topic partition number
sets maxOffsetsPerTrigger on the readStream
has enough memory and CPU
to isolate where the lag is happening, the output sink is noop but normally this would be a Kafka sink
How can I even out the task durations?

Spark Structured Streaming Print Offsets Per Batch Per Executor

I have a simple job (20 executors, 8G memory each) that reads from Kafka (with 50 partitions), checkpoints to HDFS, and posts data to a HTTP endpoint (1000 events per second). I recently started to see some straggling executors which would take far longer compared to other executors. As part of investigation I was trying to rule out data skew; is there a way to print partition:offsets for executors? Or is there any other way to track why an executor maybe straggling?
I know I can implement StreamingQueryListener but that'll only give me partition:offsets per batch, and won't tell me which executor is processing a specific partition.
You can have it printed if you have used a sink with foreach. forEach in structured spark streaming. The open method has those details and it gets executed for every executor. so u have those details

How to run multiple spark streaming batch jobs at the same time?

I have been using spark streaming to process data in Spark 2.1.0.
9 receivers receive data in 10 second intervals by streaming.
The average processed time is about 10 seconds since my submitted the streaming application. However, the queued batches delayed more than one day.
Is the queue in the driver? Or is it in each receiver executer?
And in Active Batch processing, only one real data processing batch is processed except 9 receivers. So there are always only 10 batches running.
I am inquiring how to increase the number of active batches processing data.
And there is only one Streaming Batch job at a time. I set spark.scheduler.mode to FAIR in SparkConf and set the schedule pool to fair, but the batch job only runs one at a time.
In the spark job scheduling guide, the fair pool is supposed to operate as a FIFO in the same pool. Is this right?
How to run multiple spark streaming batch jobs at the same time?
spark streaming run spark-yarn client mode
8 node cluster, 1 node : 32core, 128G
executor_memory: 6g
executor_cores: 4
driver-memory: 4g
sparkConf.set("spark.scheduler.mode", "FAIR")
ssc.sparkContext.setLocalProperty("spark.scheduler.pool",
"production")
production is FAIR pool
sparkConf.set("spark.dynamicAllocation.enabled", false)

Spark streaming with Kafka: when recovering form checkpointing all data are processed in only one micro batch

I'm running a Spark Streaming application that reads data from Kafka.
I have activated checkpointing to recover the job in case of failure.
The problem is that if the application fails, when it restarts it tries to execute all the data from the point of failure in only one micro batch.
This means that if a micro-batch usually receives 10.000 events from Kafka, if it fails and it restarts after 10 minutes it will have to process one micro-batch of 100.000 events.
Now if I want the recovery with checkpointing to be successful I have to assign much more memory than what I would do normally.
Is it normal that, when restarting, Spark Streaming tries to execute all the past events from checkpointing at once or am I doing something wrong?
Many thanks.
If your application finds it difficult to process all events in one micro batch after recovering it from failure, you can provide spark.streaming.kafka.maxRatePerPartition configuration is spark-conf, either in spark-defaults.conf or inside your application.
i.e if you believe your system/app can handle 10K events per minute second safely, and your kafka topic has 2 partitions, add this line to spark-defaults.conf
spark.streaming.kafka.maxRatePerPartition 5000
or add it inside your code :
val conf = new SparkConf()
conf.set("spark.streaming.kafka.maxRatePerPartition", "5000")
Additionally, I suggest you to set this number little bit higher and enable backpressure. This will try to stream data at a rate, which doesn't destabilizes your streaming app.
conf.set("spark.streaming.backpressure.enabled","true")
update: There was a mistake, The configuration is for number of seconds per seconds not per minute.

New directStream API reads topic's partitions sequentially. Why?

I am trying to read kafka topic with new directStream method in KafkaUtils.
I have Kafka topic with 8 partitions.
I am running streaming job on yarn with 8 execuors with 1 core(--num-executors 8 --executor-cores 1) for each one.
So noticed that spark reads all topic's partitions in one executor sequentially - this is obviously not what I want.
I want spark to read all partitions in parallel.
How can I achieve that?
Thank you, in advance.
An initial communication to Kafka at job creation occurs, solely to set the offsets of the KafkaRDD - more specifically, the offsets for each KafkaRDD partition that makes up the KafkaRDD across the cluster.
They are then used to fetch data once the job is actually executed, on each Executor. Depending on what you noticed it's possible you may have seen that initial communication (from the driver). If you have seen all your jobs executing on the same executor, then something else would be going wrong than just using Kafka.

Resources