Method to optimize PySpark dataframe saving time - python-3.x

I'm running a notebook on Azure databricks using a multinode cluster with 1 driver and 1-8 workers(each with 16 cores and 56 gb ram). Reading the source data from Azure ADLS which has 30K records. Notebook is consist of few transformation steps, also using two UDFs which are necessary for code implementation. While my entire transformation steps are running within 12 minutes(which is expected), it is taking more than 2 hours to save the final dataframe to ADSL Delta table. I'm providing some code snippet here(can't provide the entire code), suggest me ways to reduce this dataframe saving time.
# All the data reading and transformation code
# only one display statement before saving it to delta table. Up to this statement it is taking 12 minutes to run
data.display()
# Persisting the data frame
from pyspark import StorageLevel
data.persist(StorageLevel.MEMORY_ONLY)
# Finally writing the data to delta table
# This part is taking more than 2 hours to run
# Persist Brand Extraction Output
(
data
.write
.format('delta')
.mode('overwrite')
.option('overwriteSchema', 'true')
.saveAsTable('output_table')
)
Another save option tried but not much improvement
mount_path = "/mnt/********/"
table_name = "********"
adls_path = mount_path + table_name
(data.write.format('delta').mode('overwrite').option('overwriteSchema', 'true').save(adls_path))

Related

How to optimally use my Spark resources in Azure Synapse when calling a notebook reference in a loop?

I'm running a data transformation in Synapse and would like to speed it up.
My pool is configured as "4 vCores, 28GB of memory with dynamic executors from 1..7".
My data in ADL Gen2 consists of roughly 300 directories. Every directory holds between 100 and 70,000 JSON files.
These JSON files need to be converted into parquet format, then the parquet is transformed (splitting out a nested array), and the results are again stored in ADL using parquet.
I created a notebook that accepts a single directory as input, creates the intermediary parquet data and transforms that data into the final structure. For the largest directory (70k files) this takes about 10 minutes, most directories complete within seconds to minutes.
Another notebook gets all directories and executes the notebook responsible for processing a single directory:
from notebookutils import mssparkutils
for folder in folders:
print('Processing folder ' + folder + '(' + str(folders.index(folder) + 1) + ' of ' + str(len(folders)) + ')')
mssparkutils.notebook.run("RawJson/CopyToDataLake_SingleFolder", 1800, {
"folderKey": folder,
"outputSubfolder": outputSubfolder
})
Almost all time is spent converting the thousands of JSON files into parquet (see below, excerpt from the notebook called in the loop above), the final transformation (flattening nested arrays) is almost neglectable:
from notebookutils import mssparkutils
jsonDataInputPath = '{}/{}'.format(dataLakeRoot, folderKey)
repartitionedDataOutputPath = '{}/repartitioned/{}'.format(dataLakeRoot, folderKey)
# Read JSON.
input_df = spark \
.read \
.schema(jsonSchema) \
.json('{}/*/*.json'.format(jsonDataInputPath), multiLine=True)
# Reduce to a max of 16 partitions.
partitioned_df = input_df.coalesce(16)
# Store as parquet.
partitioned_df \
.write \
.mode("overwrite") \
.parquet(repartitionedDataOutputPath)
The job runs for hours and I would like to speed it up. If I check the session configuration while it is running I always see 0% utilization and although my pool is configured for dynamic executors, only 2 are used:
Can I run multiple notebook executions parallel?

Partition pruning not working on Hudi dataset

We have created a Hudi dataset which has two level partition like this
s3://somes3bucket/partition1=value/partition2=value
where partition1 and partition2 is of type string
When running a simple count query using Hudi format in spark-shell, it takes almost 3 minutes to complete
spark.read.format("hudi").load("s3://somes3bucket").
where("partition1 = 'somevalue' and partition2 = 'somevalue'").
count()
res1: Long = ####
attempt 1: 3.2 minutes
attempt 2: 2.5 minutes
Here is also the metrics in Spark UI where ~9000 tasks (which is approximately equivalent to the total no of files in the ENTIRE dataset s3://somes3bucket) are used for computation. Seems like spark is reading the entire dataset instead of partition pruning....and then filtering the dataset based on the where clause
Whereas, if I use the parquet format to read the dataset, the query only takes ~30 seconds (vis-a-vis 3 minutes with Hudi format)
spark.read.parquet("s3://somes3bucket").
where("partition1 = 'somevalue' and partition2 = 'somevalue'").
count()
res2: Long = ####
~ 30 seconds
Here is the spark UI, where only 1361 files are scanned (vis-a-vis ~9000 files in Hudi) and takes only 15 seconds
Any idea why partition pruning is not working when using Hudi format? Wondering if I am missing any configuration during the creation of the dataset?
PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0

Performance tuning for PySpark data load from a non-partitioned hive table to a partitioned hive table

We have a requirement to ingest data from a non-partitioned EXTERNAL hive table work_db.customer_tbl to a partitioned EXTERNAL hive table final_db.customer_tbl through PySpark, previously done through hive query. The final table is partitioned by the column load_date (format of load_date column is yyyy-MM-dd).
So we have a simple PySpark script which uses an insert query (same as the hive query which was used earlier), to ingest the data using spark.sql() command. But we have some serious performance issues because the table we are trying to ingest after ingestion has around 3000 partitions and each partitions has around 4 MB of data except for the last partition which is around 4GB. Total table size is nearly 15GB. Also, after ingestion each partition has 217 files. The final table is a snappy compressed parquet table.
The source work table has a single 15 GB file with filename in the format customers_tbl_unload.dat.
Earlier when we were using the hive query through a beeline connection it usually takes around 25-30 minutes to finish. Now when we are trying to use the PySpark script it is taking around 3 hours to finish.
How can we tune the spark performance to make the ingestion time less than what it took for beeline.
The configurations of the yarn queue we use is:
Used Resources: <memory:5117184, vCores:627>
Demand Resources: <memory:5120000, vCores:1000>
AM Used Resources: <memory:163072, vCores:45>
AM Max Resources: <memory:2560000, vCores:500>
Num Active Applications: 45
Num Pending Applications: 45
Min Resources: <memory:0, vCores:0>
Max Resources: <memory:5120000, vCores:1000>
Reserved Resources: <memory:0, vCores:0>
Max Running Applications: 200
Steady Fair Share: <memory:5120000, vCores:474>
Instantaneous Fair Share: <memory:5120000, vCores:1000>
Preemptable: true
The parameters passed to the PySpark script is:
num-executors=50
executor-cores=5
executor-memory=10GB
PySpark code used:
insert_stmt = """INSERT INTO final_db.customers_tbl PARTITION(load_date)
SELECT col_1,col_2,...,load_date FROM work_db.customer_tbl"""
spark.sql(insert_stmt)
Even after nearly using 10% resources of the yarn queue the job is taking so much time. How can we tune the job to make it more efficient.
You need to reanalyze your dataset and look if you are using the correct approach by partitioning yoir dataset on date column or should you be probably partitioning on year?
To understand why you end up with 200 plus files for each partition, you need to understand the difference between the Spark and Hive partitions.
A direct approach you should try first is to read your input dataset as a dataframe and partition it by the key you are planning to use as a partition key in Hive and then save it using df.write.partitionBy
Since the data seems to be skewed too on date column, try partitioning it on additional columns which might have equal distribution of data. Else, filter out the skewed data and process it separately

How to avoid re-evaluation of each transformation on pyspark data frame again and again

I have a spark data frame. I'm doing multiple transformations on the data frame. My code looks like this:
df = df.withColumn ........
df2 = df.filter......
df = df.join(df1 ...
df = df.join(df2 ...
Now I have around 30 + transformations like this. Also I'm aware of persisting of a data frame. So if I have some transformations like this:
df1 = df.filter.....some condition
df2 = df.filter.... some condtion
df3 = df.filter... some other conditon
I'm persisting the data frame "df" in the above case.
Now the problem is spark is taking too long to run (8 + mts) or sometimes it fails with Java heap space issue.
But after some 10+ transformations if I save to a table (persistent hive table) and read from table in the next line, it takes around 3 + mts to complete. Its not working even if I save it to a intermediate in memory table.
Cluster size is not the issue either.
# some transformations
df.write.mode("overwrite").saveAsTable("test")
df = spark.sql("select * from test")
# some transormations ---------> 3 mts
# some transformations
df.createOrReplaceTempView("test")
df.count() #action statement for view to be created
df = spark.sql("select * from test")
# some more transformations --------> 8 mts.
I looked at spark sql plan(still do not completely understand it). It looks like spark is re evaluating same dataframe again and again.
What I'm i doing wrong? I don have to write it to intermediate table.
Edit: I'm working on azure databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
Edit2: The issue is rdd long lineage. It looks like my spark application is getting slower and slower if the rdd lineage is increasing.
You should use caching.
Try using
df.cache
df.count
Using count to force caching all the information.
Also I recommend you take a look at this and this

Incremental Data loading and Querying in Pyspark without restarting Spark JOB

Hi All I want to do incremental data query.
df = spark .read.csv('csvFile', header=True) #1000 Rows
df.persist() #Assume it takes 5 min
df.registerTempTable('data_table') #or createOrReplaceTempView
result = spark.sql('select * from data_table where column1 > 10') #100 rows
df_incremental = spark.read.csv('incremental.csv') #200 Rows
df_combined = df.unionAll(df_incremental)
df_combined.persist() #It will take morethan 5 mins, I want to avoid this, because other queries might be running at this time
df_combined.registerTempTable("data_table")
result = spark.sql('select * from data_table where column1 > 10') # 105 Rows.
read a csv/mysql Table data into spark dataframe.
Persist that dataframe in memory Only(reason: I need performance & My dataset can fit to memory)
Register as temp table and run spark sql queries. #Till this my spark job is UP and RUNNING.
Next day i will receive a incremental Dataset(in a temp_mysql_table or a csv file). Now I want to run same query on a Total set i:e persisted_prevData + recent_read_IncrementalData. i will call it mixedDataset.
*** there is no certainty that when incremental data comes to system, it can come 30 times a day.
Till here also I don't want the spark-Application to be down,. It should always be Up. And I need performance of querying mixedDataset with same time measure as if it is persisted.
My Concerns :
In P4, Do i need to unpersist the prev_data and again persist the union-Dataframe of prev&Incremantal data?
And my most important concern is i don't want to restart the Spark-JOB to load/start with Updated Data(Only if server went down, i have to restart of course).
So, on a high level, i need to query (faster performance) dataset + Incremnatal_data_if_any dynamically.
Currently i am doing this exercise by creating a folder for all the data, and incremental file also placed in the same directory. Every 2-3 hrs, i am restarting the server and my sparkApp starts with reading all the csv files present in that system. Then queries running on them.
And trying to explore hive persistentTable and Spark Streaming, will update here if found any result.
Please suggest me a way/architecture to achieve this.
Please comment, if anything is not clear on Question, without downvoting it :)
Thanks.
Try streaming instead it will be much faster since the session is already running and it will be triggered everytime you place something in the folder:
df_incremental = spark \
.readStream \
.option("sep", ",") \
.schema(input_schema) \
.csv(input_path)
df_incremental.where("column1 > 10") \
.writeStream \
.queryName("data_table") \
.format("memory") \
.start()
spark.sql("SELECT * FROM data_table).show()

Resources