I am checking a spark batch job where it is taking almost 1 hour always for processing only few files. At max there are only 32 files. But even that it takes a lot of time in listing file phase.
There are other batch jobs running as well but they are running fine which means there are no space issue or other resource issue from environment.
How can I improve this job ? What should be the approach should I use to handle it ?
Related
I have a dataset of 8Billion records stored in parquet files in Azure Data Lake Gen 2.
I wanted to separate out a sample dataset of 2Billion records in a different location for some benchmarking needs so I did the following
df = spark.read.option('inferSchema', 'true').format('parquet').option('badRecordsPath', f'/tmp/badRecords/').load(read_path)
df.limit(2000000000).write.option('badRecordsPath', f'/tmp/badRecords/').format('parquet').save(f'{write_path}/advertiser/2B_parquet')
This job is running on 8 nodes of 8core 28GB RAM machines [ 8 WorkerNodes + 1 Master Node ]. It's been running for over an hour with not a single file is written yet. The load did finish within 2s, so I know the limit + write action is what's causing the bottleneck [ although load just infers schema and creates a list of files but not actually reading the data ].
So I started inspecting the Spark UI for some clues and here are my observations
2 Jobs have been created by Spark
The first job took 35 mins. Here's the DAG
The second job has been running for about an hour now with no progress at all. The second job has two stages in it.
If you notice, stage 3 has one running task, but if I open the stages panel, I can't see any details of the task. I also don't understand why it's trying to do a shuffle when all I have is a limit on my DF. Does limit really need a shuffle? Even if it's shuffling, it seems like 1hr is awfully long to shuffle data around.
Also if this is what's really performing the limit, what did the first job really do? Just read the data? 35mins for that also seems too long, but for now I'd just settle on the job being completed.
Stage 4 is just stuck which is believed to be the actual writing stage and I believe is waiting for this shuffle to end.
I am new to spark and I'm kinda clueless about what's happening here. Any insights on what I'm doing wrong will be very useful.
Configuration:
Spark 3.0.1
Cluster Databricks( Driver c5x.2xlarge, Worker (2) same as driver )
Source : S3
Format : Parquet
Size : 50 mb
File count : 2000 ( too many small files as they are gettin ng dumped from kinesis stream with 1 min batch as we cannot have more latency 99)
Problem Statement : I have 10 jobs with similar configuration and processing similar volume of data as above. When I run them individually, they take 5-6 mins each including cluster spin up time.
But when I run them together, they all seem kind of stuck at the same point in the code and takes 40-50 mins to complete.
When I check the spark UI, I see, all the jobs spent 90% of the time while taking the source count :
df = spark.read.parquet('s3a//....') df.cache() df.count() ----- problematic step ....more code logic
Now I know taking the count before doing cache should be faster for parquet files, but they were taking even more time if I don't cache the dataframe before taking the count, probably because of the huge number of small files.
But what I fail to understand is how the job is running way faster when ran one at a time?
Is S3 my bottleneck? They are all reading from the same bucket but different paths.
** I'm using privecera tokens for authentication.
They'll all be using the same s3a filesystem class instances in the worker nodes, there are some options there about the #of HTTP connections to have, fs.s3a.connection.maximum, default is 48. If all work is against the same bucket, set it to a number of 2x+ the number of worker threads. Do the same for "fs.s3a.max.total.tasks".
If using hadoop 2.8+ binaries switch the s3a client into the random IO mode which delivers best performance when seeking around parquet files, fs.s3a.experimental.fadvise = random.
change #2 should deliver speedup on single workloads, so do it anyway
Throttling would surface as 503 responses, which are handled in the AWS SDK and don't get collected/reported. I'd recommend that at least for debugging this you turn on S3 bucket logging, and scan the logs for 503 responses, which indicate throttling is taking place. It's what I do. Tip: set up a rule to delete old logs and so keep costs down; 1-2 weeks logs is generally enough for me.
Finally, lots of small files are bad on HDFS, awful with object stores as the time to list/open is so high. Try and make coalescing files step #1 in processing data
I have ORC data on HDFS (non partitioned), ~8billion rows, 250GB in size.
Iam reading the data in DF, writing the DF without ay transformations using partitionBy
ex:
df.write.mode("overwrite").partitionBy("some_column").orc("hdfs path")
As i monitored job status in spark UI - the job and stage is getting completed in 20minutes. But "SQL" tab in spark UI is showing 40minutes.
After running the job in debug mode and going through spark log, i realised the tasks writing to "_temporary" are getting completed in 20minutes.
After that, the merge of "_temporary" to the actual output path is taking 20minutes.
So my question is, is Driver process merging the data from "_temporary" to the output path sequntially? Or is it done by executor tasks?
Is there anything i can do to improve the performance?
You may want to check spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option in your app's config. With version 1, driver does commit temp. files sequentially, which has been known to create a bottleneck. But franky, people usually observe this problem only on a much larger number of files than in your case. Depending on the version of Spark, you may be able to set commit version to 2, see SPARK-20107 for details.
On a separate note, having 8 cores per executor is not recommended as it might saturate disk IO when all 8 tasks are writing output at once.
I am new to Apache Airflow. I have created an airflow dag in which there are couple of image processing tasks running in parallel. Basically, I am trying to read a PDF (consisting of 10 pages), converting each page into image parallely, do some image processing on each page and dump the output into JSON (after combining output for every single page). Check the image below:
So far I have used local executor (default configuration) running on top of 8 core cpu (single machine). The process took around 40 mins to complete.
I have also tweaked the configuration by changing max_threads to 8, parallelism to 8 and dag_concurrency to 8. It took around 20 mins.
I expect the whole process to complete within 5-10 mins for the same number of pages. Is it possible using current executor configuration ?
Thanks a lot.
When my spark program is executing, it is creating 1000 stages. However, I have seen recommended is 200 only. I have two actions at the end to write data to S3 and after that i have unpersisted dataframes. Now, when my spark program writes the data into S3, it still runs for almost 30 mins more. Why it is so? Is it due to large number of dataframes i have persisted?
P.S -> I am running program for 5 input records only.
Probably cluster takes a longer time to append data to an existing dataset and in particular, all of Spark jobs have finished, but your command has not finished, it is because driver node is moving the output files of tasks from the job temporary directory to the final destination one-by-one, which is slow with cloud storage. Try setting the configuration mapreduce.fileoutputcommitter.algorithm.version to 2.