Performance tuning for PySpark data load from a non-partitioned hive table to a partitioned hive table

Performance tuning for PySpark data load from a non-partitioned hive table to a partitioned hive table - apache-spark

We have a requirement to ingest data from a non-partitioned EXTERNAL hive table work_db.customer_tbl to a partitioned EXTERNAL hive table final_db.customer_tbl through PySpark, previously done through hive query. The final table is partitioned by the column load_date (format of load_date column is yyyy-MM-dd).
So we have a simple PySpark script which uses an insert query (same as the hive query which was used earlier), to ingest the data using spark.sql() command. But we have some serious performance issues because the table we are trying to ingest after ingestion has around 3000 partitions and each partitions has around 4 MB of data except for the last partition which is around 4GB. Total table size is nearly 15GB. Also, after ingestion each partition has 217 files. The final table is a snappy compressed parquet table.
The source work table has a single 15 GB file with filename in the format customers_tbl_unload.dat.
Earlier when we were using the hive query through a beeline connection it usually takes around 25-30 minutes to finish. Now when we are trying to use the PySpark script it is taking around 3 hours to finish.
How can we tune the spark performance to make the ingestion time less than what it took for beeline.
The configurations of the yarn queue we use is:
Used Resources: <memory:5117184, vCores:627>
Demand Resources: <memory:5120000, vCores:1000>
AM Used Resources: <memory:163072, vCores:45>
AM Max Resources: <memory:2560000, vCores:500>
Num Active Applications: 45
Num Pending Applications: 45
Min Resources: <memory:0, vCores:0>
Max Resources: <memory:5120000, vCores:1000>
Reserved Resources: <memory:0, vCores:0>
Max Running Applications: 200
Steady Fair Share: <memory:5120000, vCores:474>
Instantaneous Fair Share: <memory:5120000, vCores:1000>
Preemptable: true
The parameters passed to the PySpark script is:
num-executors=50
executor-cores=5
executor-memory=10GB
PySpark code used:
insert_stmt = """INSERT INTO final_db.customers_tbl PARTITION(load_date)
SELECT col_1,col_2,...,load_date FROM work_db.customer_tbl"""
spark.sql(insert_stmt)
Even after nearly using 10% resources of the yarn queue the job is taking so much time. How can we tune the job to make it more efficient.

You need to reanalyze your dataset and look if you are using the correct approach by partitioning yoir dataset on date column or should you be probably partitioning on year?
To understand why you end up with 200 plus files for each partition, you need to understand the difference between the Spark and Hive partitions.
A direct approach you should try first is to read your input dataset as a dataframe and partition it by the key you are planning to use as a partition key in Hive and then save it using df.write.partitionBy
Since the data seems to be skewed too on date column, try partitioning it on additional columns which might have equal distribution of data. Else, filter out the skewed data and process it separately

Related

What amount of data read is normal for spark reading parquet on S3 during a select statement?

We have a table of 130GB and 4000 columns. When we select 2 of these columns, our Spark UI reports a total of 30GB read. However, if we select those two columns and store them as a separate dataset, the total size of the dataset is just 17MB. Given that parquet is columnar storage, something appears not to be working properly. I've found this question but I'm unsure how to diagnose further and what attempts to take to reduce the amount of I/O required.
It was my understanding that the benefit of columnar storage is that each column can be read more or less independently of each other.
We're running on Hadoop 2.7.X on Databricks. It occurs both on the 6.X and 7.X versions of databricks (spark 2.4/3.0)

Alter table to add partition taking long time on Hive external table

I'm trying to execute a spark job through EMR cluster with 6 nodes(8 cores and 56GB memory on each node). Spark job does an incremental load on partitions on Hive table and at the end it does a refresh table in order to update the metadata.
Refresh command takes as long as 3-6 hours to complete which is too long.
Nature of data in Hive:
27Gb of data located on S3.
Stored in parquet.
Partitioned on 2 columns.(ex: s3a//bucket-name/table/partCol1=1/partCol2=2020-10-12).
Note- Its a date wise partition and cannot be changed.
Spark config used:
Num-executors= 15
Executor-memory =16Gb
Executor-cores = 2
Driver-memory= 49Gb
Spark-shuffle-partitions=48
Hive.exec.dynamic.partition.mode=nonstrict
Spark.sql.sources.partitionOverwriteMode=dynamic.
Things tried:
Tuning the spark cores/memory/executors but no luck.
Refresh table command.
Alter table add partition command.
Hive cli taking 3-4 hours to complete MSCK repair table tablename
All the above had no effect on reducing the time to refresh the partition on Hive.
Some assumptions:
Am I missing any parameter in tuning as the data is stored in Amazon-S3.?
Currently number of partitions on table are close to 10k is this an issue.?
Any help will be much appreciated.

incase possible, make the partitions to 1 column. It kills when we have multi level (multi column partitions)
use R type instance. It provides more memory compared to M type instances at same price
use coalesce to merge the files in source if there are many small files.
Check the number of mapper tasks. The more the task, lesser the performance
use EMRFS rather than S3 to keep the metadata info
use below
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
Follow some of the instructions from below Link

How Spark SQL reads Parquet partitioned files

I have a parquet file of around 1 GB. Each data record is a reading from an IOT device which captures the energy consumed by the device in the last one minute.
Schema: houseId, deviceId, energy
The parquet file is partitioned on houseId and deviceId. A file contains the data for the last 24 hours only.
I want to execute some queries on the data residing in this parquet file using Spark SQL An example query finds out the average energy consumed per device for a given house in the last 24 hours.
Dataset<Row> df4 = ss.read().parquet("/readings.parquet");
df4.as(encoder).registerTempTable("deviceReadings");
ss.sql("Select avg(energy) from deviceReadings where houseId=3123).show();
The above code works well. I want to understand that how spark executes this query.
Does Spark read the whole Parquet file in memory from HDFS without looking at the query? (I don't believe this to be the case)
Does Spark load only the required partitions from HDFS as per the query?
What if there are multiple queries which need to be executed? Will Spark look at multiple queries while preparing an execution plan? One query may be working with just one partition whereas the second query may need all the partitions, so a consolidated plan shall load the whole file from disk in memory (if memory limits allow this).
Will it make a difference in execution time if I cache df4 dataframe above?

Does Spark read the whole Parquet file in memory from HDFS without looking at the query?
It shouldn't scan all data files, but it might in general, access metadata of all files.
Does Spark load only the required partitions from HDFS as per the query?
Yes, it does.
Does Spark load only the required partitions from HDFS as per the query?
It does not. Each query has its own execution plan.
Will it make a difference in execution time if I cache df4 dataframe above?
Yes, at least for now, it will make a difference - Caching dataframes while keeping partitions

spark behavior on hive partitioned table

I use Spark 2.
Actually I am not the one executing the queries so I cannot include query plans. I have been asked this question by the data science team.
We are having hive table partitioned into 2000 partitions and stored in parquet format. When this respective table is used in spark, there are exactly 2000 tasks that are executed among the executors. But we have a block size of 256 MB and we are expecting the (total size/256) number of partitions which will be much lesser than 2000 for sure. Is there any internal logic that spark uses physical structure of data to create partitions. Any reference/help would be greatly appreciated.
UPDATE: It is the other way around. Actually our table is very huge like 3 TB having 2000 partitions. 3TB/256MB would actually come to 11720 but we are having exactly same number of partitions as the table is partitioned physically. I just want to understand how the tasks are generated on data volume.

In general Hive partitions are not mapped 1:1 to Spark partitions. 1 Hive partition can be split into multiple Spark partitions, and one Spark partition can hold multiple hive-partitions.
The number of Spark partitions when you load a hive-table depends on the parameters:
spark.files.maxPartitionBytes (default 128MB)
spark.files.openCostInBytes (default 4MB)
You can check the partitions e.g. using
spark.table(yourtable).rdd.partitions
This will give you an Array of FilePartitions which contain the physical path of your files.
Why you got exactly 2000 Spark partitions from your 2000 hive partitions seems a coincidence to me, in my experience this is very unlikely to happen. Note that the situation in spark 1.6 was different, there the number of spark partitions resembled the number of files on the filesystem (1 spark partition for 1 file, unless the file was very large)

I just want to understand how the tasks are generated on data volume.
Tasks are a runtime artifact and their number is exactly the number of partitions.
The number of tasks does not correlate to data volume in any way. It's a Spark developer's responsibility to have enough partitions to hold the data.

increasing number of partitions in spark

I was using Hive for executing SQL queries on a project. I used ORC with 50k Stride for my data and have created the hive ORC tables using this configuration with a certain date column as partition.
Now I wanted to use Spark SQL to benchmark the same queries operating on the same data.
I executed the following query
val q1 = sqlContext.sql("select col1,col2,col3,sum(col4),sum(col5) from mytable where date_key=somedatkye group by col1,col2,col3")
In hive it takes 90 seconds for this query. But spark takes 21 minutes for the same query and on looking at the job, i found the issue was because Spark creates 2 stages and on the first stage, it has only 7 tasks, one each for each of the 7 blocks of data within that given partition in orc file. The blocks are of different size, one is 5MB while the other is 45MB and because of this stragglers take more time leading to taking too much time for the job.
How do i mitigate this issue in spark. How do i manually increase the number of partitions, resulting in increasing the number of tasks in stage 1, even though there are only 7 physical blocks for the given range of the query.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string