I am trying to Optimize My Spark Streaming Application and I am able to Optimize it by repartition. However I am not able to Understand How exactly Repartition is working here and optimising the Streaming Process.
can anyone help me to understand below scenario.
I have created 2 Kafka Topics. let's say SrcTopic, DestTopic With 6 Partitions.While Processing the data from SrcTopic to DestTopic In My Streaming Application I have batchInterval of 5 Min, And Kept maxOffsetPerTrigger as 10000, So Streaming Application will Process the data after every 5 min and will Take max 10K Record in a batch and will produce in DestTopic.This Processing is Fine as expected and Taking Avg 250-300 Sec to Process one complete batch(Consume from SrcTopic and Produce in DestTopic).
Now , I have Updated My SparkStreming Job Delated checkpoints and Again Processing data for the same source and destination (all the configurations are exactly same for the topics/Using same topics which I mentioned In first Point), Here Only Change I did it like Before Writing the data in DestTopic I have repartitioned my Dataframe (df.repartition(6)) and Then Sink into Kafka Topic.for This Process also I am Taking batchInterval of 5 Min, And Kept maxOffsetPerTrigger as 10000,So Streaming Application will Process the data after every 5 min and will Take max 10K Record in a batch and will produce in DestTopic.This Processing is Also Fine as expected but Taking Avg 25-30 Sec to Process one complete batch(Consume from SrcTopic and Produce in DestTopic).
Now My doubt is.
For the first and 2nd Process No of Partitions are exactly same.
Both The Process has 6 Partitions in SrcTopic and DestTopic.
I checked the count of each partitions( 0,1,2,3,4,5) It's same in Both the cases(partition and repartition).
Executing Both the Application With Exactly same Configuration.
What extra repartition is doing here, so It's taking 10 time less time as compared to Normal Partition.
can You Help me to Understand the Process.
Related
I have a dataset of 8Billion records stored in parquet files in Azure Data Lake Gen 2.
I wanted to separate out a sample dataset of 2Billion records in a different location for some benchmarking needs so I did the following
df = spark.read.option('inferSchema', 'true').format('parquet').option('badRecordsPath', f'/tmp/badRecords/').load(read_path)
df.limit(2000000000).write.option('badRecordsPath', f'/tmp/badRecords/').format('parquet').save(f'{write_path}/advertiser/2B_parquet')
This job is running on 8 nodes of 8core 28GB RAM machines [ 8 WorkerNodes + 1 Master Node ]. It's been running for over an hour with not a single file is written yet. The load did finish within 2s, so I know the limit + write action is what's causing the bottleneck [ although load just infers schema and creates a list of files but not actually reading the data ].
So I started inspecting the Spark UI for some clues and here are my observations
2 Jobs have been created by Spark
The first job took 35 mins. Here's the DAG
The second job has been running for about an hour now with no progress at all. The second job has two stages in it.
If you notice, stage 3 has one running task, but if I open the stages panel, I can't see any details of the task. I also don't understand why it's trying to do a shuffle when all I have is a limit on my DF. Does limit really need a shuffle? Even if it's shuffling, it seems like 1hr is awfully long to shuffle data around.
Also if this is what's really performing the limit, what did the first job really do? Just read the data? 35mins for that also seems too long, but for now I'd just settle on the job being completed.
Stage 4 is just stuck which is believed to be the actual writing stage and I believe is waiting for this shuffle to end.
I am new to spark and I'm kinda clueless about what's happening here. Any insights on what I'm doing wrong will be very useful.
I am new Big data and Spark. I have to work on real-time data and old data from the past 2 years. There are around a million rows for each day. I am using PySpark and Databricks. Data is partitioned on created date. I have to perform some transformations and load it to a database.
For real-time data, I will be using spark streaming (readStream to read, perform transformation and then writeStream).
How do I work with the data from the past 2 years? I tried filtering data from 30 days I got good throughput. Should I be running the process on all 2 years of data at once or should I doing it in batches? If I perform this processes in batches, does Spark provide a way to batch it or do I do it in Python. Also, do I run these batches in parallel or in sequence?
It is kind of open ended but let me try to address your concerns.
How do I work with the data from the past 2 years? I tried filtering data from 30 days I got good throughput. Should I be running the process on all 2 years of data at once or should I doing it in batches?
Since you are new to Spark, do it in batches and start by running 1 day at a time, then 1 week and so one. Get your program to run successfully and optimize. As you increase the batch size you can increase your cluster size using Pyspark Dataframes (not Pandas). If your job is verified and efficient, you can run monthly, bi-monthly or larger batches (smaller jobs are better in your case).
If I perform this processes in batches, does Spark provide a way to batch it or do I do it in Python. Also, do I run these batches in parallel or in sequence?
You can use the date range as parameters to your Databricks job and use data bricks to schedule your jobs to ran back to back. Sure you can run them in parallel on different clusters but the whole idea with Spark is to use Sparks distributed capability and run your job on as many worker nodes as your job requires. Again, get one small job to work and validate your results, then validate a larger set and so on. If you feel confident, start a large cluster (many and fat workers) and run a large date range.
It is not an easy task for a newbie but should be a lot of fun. Best wishes.
I am running a spark application where data comes in every 1 minute. No of repartitions i am doing is 48. It is running on 12 executor with 4G as executor memory and executor-cores=4.
Below are the streaming batches processing time
Here we can see that some of the batches are taking around 20 sec but some are taking around 45 sec
I further drilled down in one of the batch which is taking less time. Below is the image.
and the one which is taking more time.
Here we can see more time is taken in repartitioning task, but above one was not taking much time in repartitioning. Its happening with every 3-4 batch. The data is coming from kafka Stream and has only value, no key.
Is there any reason related to spark configuration?
Try reducing "spark.sql.shuffle.partitions" size, the default value is 200 which is an overkill. Reduce the values and analyse the performance.
I'm trying to run some tests regarding processing times for a Spark Streaming Application, in local mode in my 4 core machine.
Here is my code:
SparkConf sparkConf = new SparkConf().setMaster("local[2]").setAppName("sparkstreaminggetjson");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
JavaReceiverInputDStream<String> streamData1 = ssc.socketTextStream(args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER);
streamData1.print();
I am receiving 1 JSON message per second.
So, I test this for 4 different scenerios:
1) setMaster(...local[2]) and 1 partition
2) setMaster(...local[*]) and 1 partition
3)setMaster(...local[2]) and 4 partitions (using streamData1.repartition(4))
4) setMaster(...local[*]) and 4 partitions (using streamData1.repartition(4))
When I check the average processing times in the UI, this is what I get for each scenario:
1) 30 ms
2) 28 ms
3) 72 ms
4) 75 ms
My question is: why are the processing times pretty much the same for 1 and 2, and 3 and 4?
I realize that the increase from 2 to 4 for example is normal, because repartition is a shuffle operation. What I don't get is, for example in 4), why is the processing so similar to 3? Shouldn't it be much smaller since I am increasing the level of paralelization, and I have more cores to distribute the tasks to?
Hope I wasn't confusing,
Thank you so much in advance.
Some of this depends on what your JSON message looks like, I'll assume each message is a single string without line breaks. In that case, with 1 message per second and batch interval of 1 second, at each batch you will get an RDD with just a single item. You can't split that up into multiple partitions, so when you repartition you still have the same situation data-wise, but with the overhead of the repartition step.
Even with larger amounts of data I would not expect too much of difference when all you do to the data is print() it: this will take the first 10 items of your data, which if they can come from just one partition, I would expect Spark to optimize that to only calculate that one partition. In any case you will get more representative numbers if you significantly increase the amount of data per batch, and do some actual processing on the whole set, at minimum something like streamData1.count().print().
To get a better understanding of what happens, it is also useful to dig into the other parts of Spark's UI, like the Stages tab that can tell you how much of the execution time is shuffling, serialization, etc rather than actual execution, and things that affect performance like DAGs that tell you which bits may be cached, and tasks that Spark was able to skip.
I have a simple Spark streaming WordCount application, which reads data from a Kafka topic. In this application, checkpoint is enabled to calculate accumulated word count. The stream internals is 1000ms. The following picture shows a table (delay, execution time, total delay, events) of micro-batches in this stream application. What makes me confused is that every 10 seconds, there is a micro-batch which takes around 4 seconds to execute, which is much more than the execution time of other micro-batches. Why does this situation occur? My application is just a very simple word count program.