I am having trouble in running a databricks notebook ( scala) , And I see the job is having high write shuffle size. and it already run over an hour. Let's have a look on the following screen
enter image description here
Any idea on checking how why ?
shuffle write: 35.5GB/ 1796240509
what's the meaning of 35.5GB and 1796240509 ??
I have a dataset of 8Billion records stored in parquet files in Azure Data Lake Gen 2.
I wanted to separate out a sample dataset of 2Billion records in a different location for some benchmarking needs so I did the following
df = spark.read.option('inferSchema', 'true').format('parquet').option('badRecordsPath', f'/tmp/badRecords/').load(read_path)
df.limit(2000000000).write.option('badRecordsPath', f'/tmp/badRecords/').format('parquet').save(f'{write_path}/advertiser/2B_parquet')
This job is running on 8 nodes of 8core 28GB RAM machines [ 8 WorkerNodes + 1 Master Node ]. It's been running for over an hour with not a single file is written yet. The load did finish within 2s, so I know the limit + write action is what's causing the bottleneck [ although load just infers schema and creates a list of files but not actually reading the data ].
So I started inspecting the Spark UI for some clues and here are my observations
2 Jobs have been created by Spark
The first job took 35 mins. Here's the DAG
The second job has been running for about an hour now with no progress at all. The second job has two stages in it.
If you notice, stage 3 has one running task, but if I open the stages panel, I can't see any details of the task. I also don't understand why it's trying to do a shuffle when all I have is a limit on my DF. Does limit really need a shuffle? Even if it's shuffling, it seems like 1hr is awfully long to shuffle data around.
Also if this is what's really performing the limit, what did the first job really do? Just read the data? 35mins for that also seems too long, but for now I'd just settle on the job being completed.
Stage 4 is just stuck which is believed to be the actual writing stage and I believe is waiting for this shuffle to end.
I am new to spark and I'm kinda clueless about what's happening here. Any insights on what I'm doing wrong will be very useful.
I have written pyspark job , and my job is running longer . I want to analyze job execution and fix the code part that is causing slowness. Due to access issue over spark history ui I can not analyze job plan. Hence I have to do some tricks around the code and understand at what section spark is consuming more time.
I have tried to run count on data-frame but it seems this is not that much help to understand job slowness.
below are step I am doing on my code:
step-1 : read from cassandra table:
cassandra_data = spark_session.read \
.format('org.apache.spark.sql.cassandra') \
.options(table=table, keyspace=keyspace) \
return data
step -2 : add a column in data-frame read from cassandra that has value of md5 over entire row .
data_wth_hash = prepare_data_md5(cassandra_data)
step -3 : write into aws s3 folder .
Job is taking much more time while writing into s3 , I do not have access to spark history ui to understand where it is consuming more time.
I have this code snippet that I ran locally in standalone mode using 100 records only:
from awsglue.context import GlueContext
glue_context = GlueContext(sc)
glue_df = glue_context.create_dynamic_frame.from_catalog(database=db, table_name=table)
df = glue_df.toDF()
The schema contains 89 columns all having string data type except 5 columns that have array of struct data type. The data size is 3.1 MB.
Also, here is some info about the environment used to run the code:
spark.executor.cores: 2
spark.executor.id: driver
spark.driver.memory: 1000M
Problem is I can't find out why stage 1 took 12 minutes to finish where it only has to count 100 records. I can't find what "Scan parquet" and "Exchange" Tasks mean as shown in this image:
Stage 1 DAG Visualization
My question is, is there a more systematic way to understand what those tasks mean. As a beginner, I heavily relied on Spark UI but it doesn't give much information about the tasks it has executed. I was able to find which task took the most time but I have no idea why it is the case and how to systematically resolve it.
The running time in spark code is calculating based on the cluster kick-off time, DAG scheduler optimisation time, running stages time. In your case, the issue could be because of the followings:
The number of parquet files. To test this easily read the table and write it back as one parquet file. You are calling a table but behind the scene, it's reading the physical parquet files so the number of files is an item to consider.
Number of spark clusters. The number of clusters should be a relevant number of computing resources you have. For example, in your case, you have 2 core with a small-size table. So it's more efficient to have just a few partitions instead of the default partition numbers which is 200.
To get more clarification on the spark stages use explain function and read the DAG result. As a result of this function you could see and compare Analyzed Logical Plan, Optimized Logical Plan, and Physical Plan that has been calculated by internal optimiser processes.
To find a more detailed description of the explain function please visit this LINK
I have ORC data on HDFS (non partitioned), ~8billion rows, 250GB in size.
Iam reading the data in DF, writing the DF without ay transformations using partitionBy
df.write.mode("overwrite").partitionBy("some_column").orc("hdfs path")
As i monitored job status in spark UI - the job and stage is getting completed in 20minutes. But "SQL" tab in spark UI is showing 40minutes.
After running the job in debug mode and going through spark log, i realised the tasks writing to "_temporary" are getting completed in 20minutes.
After that, the merge of "_temporary" to the actual output path is taking 20minutes.
So my question is, is Driver process merging the data from "_temporary" to the output path sequntially? Or is it done by executor tasks?
Is there anything i can do to improve the performance?
You may want to check spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option in your app's config. With version 1, driver does commit temp. files sequentially, which has been known to create a bottleneck. But franky, people usually observe this problem only on a much larger number of files than in your case. Depending on the version of Spark, you may be able to set commit version to 2, see SPARK-20107 for details.
On a separate note, having 8 cores per executor is not recommended as it might saturate disk IO when all 8 tasks are writing output at once.
I just started the work on the qualification of a big data platform, and I would like to have proposals on how to test the performance of reading and writing on hdfs.
If you are running the spark jobs for read and write operation then you can see the job time on application manager (localhost:50070) and if you are using spark-shell then you have to measure time manually or you can use time function.