Partition pruning not working on Hudi dataset - apache-spark

We have created a Hudi dataset which has two level partition like this
s3://somes3bucket/partition1=value/partition2=value
where partition1 and partition2 is of type string
When running a simple count query using Hudi format in spark-shell, it takes almost 3 minutes to complete
spark.read.format("hudi").load("s3://somes3bucket").
where("partition1 = 'somevalue' and partition2 = 'somevalue'").
count()
res1: Long = ####
attempt 1: 3.2 minutes
attempt 2: 2.5 minutes
Here is also the metrics in Spark UI where ~9000 tasks (which is approximately equivalent to the total no of files in the ENTIRE dataset s3://somes3bucket) are used for computation. Seems like spark is reading the entire dataset instead of partition pruning....and then filtering the dataset based on the where clause
Whereas, if I use the parquet format to read the dataset, the query only takes ~30 seconds (vis-a-vis 3 minutes with Hudi format)
spark.read.parquet("s3://somes3bucket").
where("partition1 = 'somevalue' and partition2 = 'somevalue'").
count()
res2: Long = ####
~ 30 seconds
Here is the spark UI, where only 1361 files are scanned (vis-a-vis ~9000 files in Hudi) and takes only 15 seconds
Any idea why partition pruning is not working when using Hudi format? Wondering if I am missing any configuration during the creation of the dataset?
PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0

Related

Method to optimize PySpark dataframe saving time

I'm running a notebook on Azure databricks using a multinode cluster with 1 driver and 1-8 workers(each with 16 cores and 56 gb ram). Reading the source data from Azure ADLS which has 30K records. Notebook is consist of few transformation steps, also using two UDFs which are necessary for code implementation. While my entire transformation steps are running within 12 minutes(which is expected), it is taking more than 2 hours to save the final dataframe to ADSL Delta table. I'm providing some code snippet here(can't provide the entire code), suggest me ways to reduce this dataframe saving time.
# All the data reading and transformation code
# only one display statement before saving it to delta table. Up to this statement it is taking 12 minutes to run
data.display()
# Persisting the data frame
from pyspark import StorageLevel
data.persist(StorageLevel.MEMORY_ONLY)
# Finally writing the data to delta table
# This part is taking more than 2 hours to run
# Persist Brand Extraction Output
(
data
.write
.format('delta')
.mode('overwrite')
.option('overwriteSchema', 'true')
.saveAsTable('output_table')
)
Another save option tried but not much improvement
mount_path = "/mnt/********/"
table_name = "********"
adls_path = mount_path + table_name
(data.write.format('delta').mode('overwrite').option('overwriteSchema', 'true').save(adls_path))

Performance tuning for PySpark data load from a non-partitioned hive table to a partitioned hive table

We have a requirement to ingest data from a non-partitioned EXTERNAL hive table work_db.customer_tbl to a partitioned EXTERNAL hive table final_db.customer_tbl through PySpark, previously done through hive query. The final table is partitioned by the column load_date (format of load_date column is yyyy-MM-dd).
So we have a simple PySpark script which uses an insert query (same as the hive query which was used earlier), to ingest the data using spark.sql() command. But we have some serious performance issues because the table we are trying to ingest after ingestion has around 3000 partitions and each partitions has around 4 MB of data except for the last partition which is around 4GB. Total table size is nearly 15GB. Also, after ingestion each partition has 217 files. The final table is a snappy compressed parquet table.
The source work table has a single 15 GB file with filename in the format customers_tbl_unload.dat.
Earlier when we were using the hive query through a beeline connection it usually takes around 25-30 minutes to finish. Now when we are trying to use the PySpark script it is taking around 3 hours to finish.
How can we tune the spark performance to make the ingestion time less than what it took for beeline.
The configurations of the yarn queue we use is:
Used Resources: <memory:5117184, vCores:627>
Demand Resources: <memory:5120000, vCores:1000>
AM Used Resources: <memory:163072, vCores:45>
AM Max Resources: <memory:2560000, vCores:500>
Num Active Applications: 45
Num Pending Applications: 45
Min Resources: <memory:0, vCores:0>
Max Resources: <memory:5120000, vCores:1000>
Reserved Resources: <memory:0, vCores:0>
Max Running Applications: 200
Steady Fair Share: <memory:5120000, vCores:474>
Instantaneous Fair Share: <memory:5120000, vCores:1000>
Preemptable: true
The parameters passed to the PySpark script is:
num-executors=50
executor-cores=5
executor-memory=10GB
PySpark code used:
insert_stmt = """INSERT INTO final_db.customers_tbl PARTITION(load_date)
SELECT col_1,col_2,...,load_date FROM work_db.customer_tbl"""
spark.sql(insert_stmt)
Even after nearly using 10% resources of the yarn queue the job is taking so much time. How can we tune the job to make it more efficient.
You need to reanalyze your dataset and look if you are using the correct approach by partitioning yoir dataset on date column or should you be probably partitioning on year?
To understand why you end up with 200 plus files for each partition, you need to understand the difference between the Spark and Hive partitions.
A direct approach you should try first is to read your input dataset as a dataframe and partition it by the key you are planning to use as a partition key in Hive and then save it using df.write.partitionBy
Since the data seems to be skewed too on date column, try partitioning it on additional columns which might have equal distribution of data. Else, filter out the skewed data and process it separately

Spark (pyspark) speed test

I am connected via jdbc to a DB having 500'000'000 of rows and 14 columns.
Here is the code used:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
properties = {'jdbcurl': 'jdbc:db:XXXXXXXXX','user': 'XXXXXXXXX', 'password': 'XXXXXXXXX'}
data = spark.read.jdbc(properties['jdbcurl'], table='XXXXXXXXX', properties=properties)
data.show()
The code above took 9 seconds to display the first 20 rows of the DB.
Later I created a SQL temporary view via
data[['XXX','YYY']].createOrReplaceTempView("ZZZ")
and I ran the following query:
sqlContext.sql('SELECT AVG(XXX) FROM ZZZ').show()
The code above took 1355.79 seconds (circa 23 minutes). Is this ok? It seems to be a large amount of time.
In the end I tried to count the number of rows of the DB
sqlContext.sql('SELECT COUNT(*) FROM ZZZ').show()
It took 2848.95 seconds (circa 48 minutes).
Am I doing something wrong or are these amounts standard?
When you read jdbc source with this method you loose parallelism, main advantage of spark. Please read the official spark jdbc guidelines, especially regarding partitionColumn, lowerBound, upperBound and numPartitions. This will allow spark to run multiple JDBC queries in parallel, resulting with partitioned dataframe.
Also tuning fetchsize parameter may help for some databases.

Apache Ignite takes forever to save values on Spark

I am using using Apache Ignite with Spark to save results from Spark, however, when I execute saveValues, it takes very long time and the computer's CPU and fan speed goes insane. I have 3GHz CPU and 16 GB memory.
I have an RDD in which I map the final DataFrame in it:
val visitsAggregatedRdd :RDD[VisitorsSchema] = aggregatedVenuesDf.rdd.map(....)
println("COUNT: " + visitsAggregatedRdd.count().toString())
visitsCache.saveValues(visitsAggregatedRdd)
The total count of rows is 71 which means Spark has already done processing data and it is quite small; 71 rows each one is small object with few numbers and very short strings. So why 'visitsCache.saveValues' is taking this infinite time and processing!?
It turned out to be a problem in Spark Dataframe partitions. Spark was saving those 71 rows were saved in 6000 partitions! A simple solution is to reduce the number of partitions before saving in Ignite:
df = df.coalesce(1)

increasing number of partitions in spark

I was using Hive for executing SQL queries on a project. I used ORC with 50k Stride for my data and have created the hive ORC tables using this configuration with a certain date column as partition.
Now I wanted to use Spark SQL to benchmark the same queries operating on the same data.
I executed the following query
val q1 = sqlContext.sql("select col1,col2,col3,sum(col4),sum(col5) from mytable where date_key=somedatkye group by col1,col2,col3")
In hive it takes 90 seconds for this query. But spark takes 21 minutes for the same query and on looking at the job, i found the issue was because Spark creates 2 stages and on the first stage, it has only 7 tasks, one each for each of the 7 blocks of data within that given partition in orc file. The blocks are of different size, one is 5MB while the other is 45MB and because of this stragglers take more time leading to taking too much time for the job.
How do i mitigate this issue in spark. How do i manually increase the number of partitions, resulting in increasing the number of tasks in stage 1, even though there are only 7 physical blocks for the given range of the query.

Resources