repartition in spark streaming is taking more time? - apache-spark

I am running a spark application where data comes in every 1 minute. No of repartitions i am doing is 48. It is running on 12 executor with 4G as executor memory and executor-cores=4.
Below are the streaming batches processing time
Here we can see that some of the batches are taking around 20 sec but some are taking around 45 sec
I further drilled down in one of the batch which is taking less time. Below is the image.
and the one which is taking more time.
Here we can see more time is taken in repartitioning task, but above one was not taking much time in repartitioning. Its happening with every 3-4 batch. The data is coming from kafka Stream and has only value, no key.
Is there any reason related to spark configuration?

Try reducing "spark.sql.shuffle.partitions" size, the default value is 200 which is an overkill. Reduce the values and analyse the performance.

Related

How long should spark take to read 500GB of data? Suggestions for perf improvement?

It currently takes each task about 2m to read 500MB of data. Is this typical? More details below:
There are 30 executors each with 5 cores. The entire stage takes about 20 minutes overall (or 46h aggregate, or about 1.5h per executor) to read 500-600GB of data into memory. There is 12GB memory allocated per executor and it splits the job into about 1500 tasks. We filter the data down somewhat, but I assume that does not affect the read time since it needs to pull the entire dataset into memory.
Does this amount of time sound right? Are there any easy (or less-easy) improvements I can make to decrease this time (other than just linearly scaling the job). 20 minutes is a long time and we'd like to cut this down if possible. It is a batch job btw.

Spark-Streaming Application Optimisation using Repartition

I am trying to Optimize My Spark Streaming Application and I am able to Optimize it by repartition. However I am not able to Understand How exactly Repartition is working here and optimising the Streaming Process.
can anyone help me to understand below scenario.
I have created 2 Kafka Topics. let's say SrcTopic, DestTopic With 6 Partitions.While Processing the data from SrcTopic to DestTopic In My Streaming Application I have batchInterval of 5 Min, And Kept maxOffsetPerTrigger as 10000, So Streaming Application will Process the data after every 5 min and will Take max 10K Record in a batch and will produce in DestTopic.This Processing is Fine as expected and Taking Avg 250-300 Sec to Process one complete batch(Consume from SrcTopic and Produce in DestTopic).
Now , I have Updated My SparkStreming Job Delated checkpoints and Again Processing data for the same source and destination (all the configurations are exactly same for the topics/Using same topics which I mentioned In first Point), Here Only Change I did it like Before Writing the data in DestTopic I have repartitioned my Dataframe (df.repartition(6)) and Then Sink into Kafka Topic.for This Process also I am Taking batchInterval of 5 Min, And Kept maxOffsetPerTrigger as 10000,So Streaming Application will Process the data after every 5 min and will Take max 10K Record in a batch and will produce in DestTopic.This Processing is Also Fine as expected but Taking Avg 25-30 Sec to Process one complete batch(Consume from SrcTopic and Produce in DestTopic).
Now My doubt is.
For the first and 2nd Process No of Partitions are exactly same.
Both The Process has 6 Partitions in SrcTopic and DestTopic.
I checked the count of each partitions( 0,1,2,3,4,5) It's same in Both the cases(partition and repartition).
Executing Both the Application With Exactly same Configuration.
What extra repartition is doing here, so It's taking 10 time less time as compared to Normal Partition.
can You Help me to Understand the Process.

Spark Job Internals

I tried looking through the various posts but did not get an answer. Lets say my spark job has 1000 input partitions but I only have 8 executor cores. The job has 2 stages. Can someone help me understand exactly how spark processes this. If you can help answer the below questions, I'd really appreciate it
As there are only 8 executor cores, will spark process the Stage 1 of my job 8 partitions at a time?
If the above is true, after the first set of 8 partitions are processed where is this data stored when spark is running the second set of 8 partitions?
If I dont have any wide transformations, will this cause a spill to disk?
For a spark job, what is the optimal file size. I mean spark better with processing 1 MB files and 1000 spark partitions or say a 10MB file with 100 spark partitions?
Sorry, if these questions are vague. This is not a real use case but as I am learning about spark I am trying to understand the internal details of how the different partitions get processed.
Thank You!
Spark will run all jobs for the first stage before starting the second. This does not mean that it will start 8 partitions, wait for them all to complete, and then start another 8 partitions. Instead, this means that each time an executor finishes a partition, it will start another partition from the first stage until all partions from the first stage is started, then spark will wait until all stages in the first stage are complete before starting the second stage.
The data is stored in memory, or if not enough memory is available, spilled to disk on the executor memory. Whether a spill happens will depend on exactly how much memory is available, and how much intermediate data results.
The optimal file size is varies, and is best measured, but some key factors to consider:
The total number of files limits total parallelism, so should be greater than the number of cores.
The amount of memory used processing a partition should be less than the amount available to the executor. (~4GB for AWS glue)
There is overhead per file read, so you don't want too many small files.
I would be inclined towards 10MB files or larger if you only have 8 cores.

Processing time for my Spark program does not decrease on increasing the number of nodes in the cluster

I have a Cloudera cluster with 3 nodes on which Apache Spark is installed. I am running a Spark program which reads data from HBase tables, transforms the data and stores it in a different HBase table. With 3 nodes the time taken in approximately 1 minutes 10 seconds for 5 million rows HBase data. On decreasing or increasing the number of nodes, the time taken came similar whereas it was expected to reduce after increasing the number of nodes and increase by increasing the number of nodes.Below was the time taken:
1) With 3 nodes: Approximately 1 minute 10 seconds for 5 million rows.
2) With 1 node: Approximately 1 minute 10 seconds for 5 million rows.
3) With 6 nodes: Approximately 1 minute 10 seconds for 5 million rows.
What can be the reason for same time taken despite increasing or decreasing the number of nodes?
Thank You.
By default, Hbase will probably read the 5 million rows from a single region or maybe 2 regions (degree of parallelism). The write will occur to a single region or maybe 2 based on the scale of the data.
Is Spark your bottleneck? If you allocate variable resources (more/less cores or memory) it will only lead to change in overall times of the job if the computation on the job is the bottleneck.
If your computation (the transform) is relatively simple, the bottleneck might be reading from HBase or writing from HBase. In that case irrespective of how many node/cores you may give it. The run time will be constant.
From the runtimes you have mentioned it seems that's the issue.
The bottleneck may be one or both hbase and spark side. You can check the hbase side for your tables number of region servers. It is same meaning with the read and write parallelism of data. The more the better usually. You must notice the hotspotting issue
The spark side parallelism can be checked with your number of rdd for your data. May be you should repartition your data. Added to this,cluster resource utilization may be your problem. For checking this you can monitor spark master web interface. Number of nodes, number of workers per node, and number of job, task per worker etc. Also you must check number of cpu and amont of ram usage per worker within this interface.
For details here

Spark streaming : batch processing time slowly increase

I use spark with the cassandra spark connector and direct kafka.
And I seed batch procession increasing slowly over the time.
Even when there is nothing to process incoming from kafka.
I think it is about a few milliseconds by batch, but after a long time, a batch can take several more seconds until it reaches the batch interval and finally crash.
I thought first it was a memory leak, but I think the processing time would be less linear but exponentially instead.
I don't really know if it is stages that become longer and longer or the latency
between stage that increases.
I use spark 1.4.0
Any pointers about this?
EDIT :
A attentive look at the evolution of the processing time of each batch, comparing total jobs processing time.
And it appears that even if batch processing time increases, the job processing time are not increasing.
exemple : for a batch that take 7s the sum of each job processing time is 1.5s. (as shown in the image below)
Is it because the computing time driver side increases, and not the computing time executor side?
And this driver computing time is not shown in job processing ui?
If it's the case how can correct it?
I finally found the solution to my problem.
I had this code in the function that add filter and transform to my rdd.
TypeConverter.registerConverter(new SomethingToOptionConverter[EventCC])
TypeConverter.registerConverter(new OptionToSomethingConverter[EventCC])
Because it's call at each batch there is a lot of time the same object inside TypeConverter.
And I don't really know how it works Cassandra Spark converter but it's look like to make reflection internaly whit objects.
And make slow reflection x time batch make all the processing time of the batch increasing.

Resources