Spark limit + write is too slow - apache-spark

I have a dataset of 8Billion records stored in parquet files in Azure Data Lake Gen 2.
I wanted to separate out a sample dataset of 2Billion records in a different location for some benchmarking needs so I did the following
df = spark.read.option('inferSchema', 'true').format('parquet').option('badRecordsPath', f'/tmp/badRecords/').load(read_path)
df.limit(2000000000).write.option('badRecordsPath', f'/tmp/badRecords/').format('parquet').save(f'{write_path}/advertiser/2B_parquet')
This job is running on 8 nodes of 8core 28GB RAM machines [ 8 WorkerNodes + 1 Master Node ]. It's been running for over an hour with not a single file is written yet. The load did finish within 2s, so I know the limit + write action is what's causing the bottleneck [ although load just infers schema and creates a list of files but not actually reading the data ].
So I started inspecting the Spark UI for some clues and here are my observations
2 Jobs have been created by Spark
The first job took 35 mins. Here's the DAG
The second job has been running for about an hour now with no progress at all. The second job has two stages in it.
If you notice, stage 3 has one running task, but if I open the stages panel, I can't see any details of the task. I also don't understand why it's trying to do a shuffle when all I have is a limit on my DF. Does limit really need a shuffle? Even if it's shuffling, it seems like 1hr is awfully long to shuffle data around.
Also if this is what's really performing the limit, what did the first job really do? Just read the data? 35mins for that also seems too long, but for now I'd just settle on the job being completed.
Stage 4 is just stuck which is believed to be the actual writing stage and I believe is waiting for this shuffle to end.
I am new to spark and I'm kinda clueless about what's happening here. Any insights on what I'm doing wrong will be very useful.

Related

Is there a more systematic way to resolve a slow AWS Glue + PySpark execution stage?

I have this code snippet that I ran locally in standalone mode using 100 records only:
from awsglue.context import GlueContext
glue_context = GlueContext(sc)
glue_df = glue_context.create_dynamic_frame.from_catalog(database=db, table_name=table)
df = glue_df.toDF()
print(df.count())
The schema contains 89 columns all having string data type except 5 columns that have array of struct data type. The data size is 3.1 MB.
Also, here is some info about the environment used to run the code:
spark.executor.cores: 2
spark.executor.id: driver
spark.driver.memory: 1000M
Problem is I can't find out why stage 1 took 12 minutes to finish where it only has to count 100 records. I can't find what "Scan parquet" and "Exchange" Tasks mean as shown in this image:
Stage 1 DAG Visualization
My question is, is there a more systematic way to understand what those tasks mean. As a beginner, I heavily relied on Spark UI but it doesn't give much information about the tasks it has executed. I was able to find which task took the most time but I have no idea why it is the case and how to systematically resolve it.
The running time in spark code is calculating based on the cluster kick-off time, DAG scheduler optimisation time, running stages time. In your case, the issue could be because of the followings:
The number of parquet files. To test this easily read the table and write it back as one parquet file. You are calling a table but behind the scene, it's reading the physical parquet files so the number of files is an item to consider.
Number of spark clusters. The number of clusters should be a relevant number of computing resources you have. For example, in your case, you have 2 core with a small-size table. So it's more efficient to have just a few partitions instead of the default partition numbers which is 200.
To get more clarification on the spark stages use explain function and read the DAG result. As a result of this function you could see and compare Analyzed Logical Plan, Optimized Logical Plan, and Physical Plan that has been calculated by internal optimiser processes.
To find a more detailed description of the explain function please visit this LINK

Large number of stages in my spark program

When my spark program is executing, it is creating 1000 stages. However, I have seen recommended is 200 only. I have two actions at the end to write data to S3 and after that i have unpersisted dataframes. Now, when my spark program writes the data into S3, it still runs for almost 30 mins more. Why it is so? Is it due to large number of dataframes i have persisted?
P.S -> I am running program for 5 input records only.
Probably cluster takes a longer time to append data to an existing dataset and in particular, all of Spark jobs have finished, but your command has not finished, it is because driver node is moving the output files of tasks from the job temporary directory to the final destination one-by-one, which is slow with cloud storage. Try setting the configuration mapreduce.fileoutputcommitter.algorithm.version to 2.

Processing Pipeline using Spark SQL- jobs, stages and DAG sizes

I have a processing pipeline that is built using Spark SQL. The objective is to read data from Hive in the first step and apply a series of functional operations (using Spark SQL) in order to achieve the functional output. Now, these operations are quite in number (more than 100), which means I am running around 50 to 60 spark sql queries in a single pipeline. While the application completes successfully without any issues, my focus area has shifted to optimizing the overall process. I have been able to speed up the executions using spark.sql.shuffle.partitions, changing the executor memory and reducing the size of the spark.memory.fraction from default 0.6 to 0.2. I got great benefits by doing all these changes and the over all execution time reduced from 20-25 mins to around 10 mins. Data volume is around 100k rows (source side).
The observations that I have from the Cluster are:
-The number of jobs triggered as apart of application id are 235.
-The total number of stages across all the jobs created are around 600.
-8 executors are used in a two node cluster (64 GB RAM in total with 10 cores).
-The resource manager UI of Yarn (for an application id) becomes very slow to retrieve the details of jobs/stages.
In one of the videos of Spark tuning, I heard that we should try to reduce the number of stages to a bare minimum, also DAG size should be smaller. What are the guidelines to do this. How to find the number of shuffles that are happening (my SQLs have many joins and group by clauses).
I would like to have suggestions on the above scenario of what possible things I can do in order to improvise the performance and handle the data skews in the SQL queries that are JOIN/GROUP_BY heavy.
Thanks

Apache Spark Delay Between Jobs

my as you can see, my small application has 4 jobs which run for a total duration of 20.2 seconds, however there is a big delay between job 1 and 2 causing the total time to be over a minute. Job number 1 runJob at SparkHadoopMapReduceWriter.scala:88 is performing a bulkupload of HFiles into a HBase table. Here is the code I used to load to load the files
val outputDir = new Path(HBaseUtils.getHFilesStorageLocation(resolvedTableName))
val job = Job.getInstance(hBaseConf)
job.getConfiguration.set(TableOutputFormat.OUTPUT_TABLE, resolvedTableName)
job.setOutputFormatClass(classOf[HFileOutputFormat2])
job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
job.setMapOutputValueClass(classOf[KeyValue])
val connection = ConnectionFactory.createConnection(job.getConfiguration)
val hBaseAdmin = connection.getAdmin
val table = TableName.valueOf(Bytes.toBytes(resolvedTableName))
val tab = connection.getTable(table).asInstanceOf[HTable]
val bulkLoader = new LoadIncrementalHFiles(job.getConfiguration)
preBulkUploadCallback.map(callback => callback())
bulkLoader.doBulkLoad(outputDir, hBaseAdmin, tab, tab.getRegionLocator)
If anyone has any ideas, I would be very greatful
I can see there are 26 tasks in job 1 which is based on the number of hfiles created. Even though the job 2 shows completed in 2s, it takes some time to copy these files to target location and that's why you are getting a delay between job 2 and 3. This can be avoided by reducing the number of tasks in job 1.
Decrease the number of Regions for the output table in Hbase, which will result in reducing the number of task for your second job.
TableOutputFormat determines the split based on the number of regions for a given table in Hbase
Job number 1 runJob at SparkHadoopMapReduceWriter.scala:88 is performing a bulkupload
This is not quite true. This job merely creates HFiles outside of HBase. The gap you see between this job and the next one could be explained by the actual bulk loading at bulkLoader.doBulkLoad. This operation involves only metadata trasfer and usually performs faster (from my experience), so you should check the driver logs to see where it hangs.
Thanks for your input guys, I lowered the number of HFiles created in task 0. This has decreased the lag by about 20%. I used
HFileOutputFormat2.configureIncrementalLoad(job, tab, tab.getRegionLocator)
which automatically calculates the number of reduce tasks to match the current number of regions for the table. I will say that we are are using HBase backed by S3 in AWS EMR instead of the classical HDFS. I'm am going to investigate now whether this could be contributing to the lag.

Spark write to CSV fails even after 8 hours

I have a dataframe with roughly 200-600 gb of data I am reading, manipulating, and then writing to csv using the spark shell (scala) on an elastic map reduce cluster.Spark write to CSV fails even after 8 hours
here's how I'm writing to csv:
result.persist.coalesce(20000).write.option("delimiter",",").csv("s3://bucket-name/results")
The result variable is created through a mix of columns from some other dataframes:
var result=sources.join(destinations, Seq("source_d","destination_d")).select("source_i","destination_i")
Now, I am able to read the csv data it is based on in roughly 22 minutes. In this same program, I'm also able to write another (smaller) dataframe to csv in 8 minutes. However, for this result dataframe it takes 8+ hours and still fails ... saying one of the connections was closed.
I'm also running this job on 13 x c4.8xlarge instances on ec2, with 36 cores each and 60 gb of ram, so I thought I'd have the capacity to write to csv, especially after 8 hours.
Many stages required retries or had failed tasks and I can't figure out what I'm doing wrong or why it's taking so long. I can see from the Spark UI that it never even got to the write CSV stage and was busy with persist stages, but without the persist function it was still failing after 8 hours. Any ideas? Help is greatly appreciated!
Update:
I've ran the following command to repartition the result variable into 66K partitions:
val r2 = result.repartition(66000) #confirmed with numpartitions
r2.write.option("delimiter",",").csv("s3://s3-bucket/results")
However, even after several hours, the jobs are still failing. What am I doing wrong still?
note, I'm running spark shell via spark-shell yarn --driver-memory 50G
Update 2:
I've tried running the write with a persist first:
r2.persist(StorageLevel.MEMORY_AND_DISK)
But I had many stages fail, returning a, Job aborted due to stage failure: ShuffleMapStage 10 (persist at <console>:36) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3' or saying Connection from ip-172-31-48-180.ec2.internal/172.31.48.180:7337 closed
Executors page
Spark web UI page for a node returning a shuffle error
Spark web UI page for a node returning an ec2 connection closed error
Overall Job Summary page
I can see from the Spark UI that it never even got to the write CSV
stage and was busy with persist stages, but without the persist
function it was still failing after 8 hours. Any ideas?
It is FetchFailedException i.e Failed to fetch a shuffle block
Since you are able to deal with small files, only huge data its failed...
I strongly feel that not enough partitions.
Fist thing is verify/Print source.rdd.getNumPartitions(). and destinations.rdd.getNumPartitions(). and result.rdd.getNumPartitions().
You need to repartition after the data is loaded in order to partition the data (via shuffle) to other nodes in the cluster. This will give you the parallelism that you need for faster processing with out fail
Further more, to verify the other configurations applied...
print all the config like this, adjust them to correct values as per demand.
sc.getConf.getAll
Also have a look at
SPARK-5928
Spark-TaskRunner-FetchFailedException Possible reasons : OOM or Container memory limits
repartition both source and destination before joining, with number of partitions such that each partition would be 10MB - 128MB(try to tune), there is no need to make it 20000(imho too many).
then join by those two columns and then write, without repartitioning(ie. output partitions should be same as reparitioning before join)
if you still have trouble, try to make same thing after converting to both dataframes to rdd(there are some differences between apis, and especially regarding repartitions, key-value rdds etc)

Resources