Pyspark insert overwrite with dynamic partition is very slow

Pyspark insert overwrite with dynamic partition is very slow - apache-spark

I am reading a 60gb sized csv file using pyspark, doing few basic transformations and loading it into hive dynamic partition table. Hdfs block size is 128mb, so 400+ partitions are created in spark. Transformation is completing in few minutes. But while loading it's taking nearly an hour. Hive execution load is on tez. Tried to load the unpartitioned table, taking less than 4 minutes. How can i improve the performance in this scenario?
I'm using hive warehouse connector.

Related

Spark SQL data storage life cycle

I recently had a issue with with one of my spark jobs, where I was reading a hive table having several billion records, that resulted in job failure due to high disk utilization, But after adding AWS EBS volume, the job ran without any issues. Although it resolved the issue, I have few doubts, I tried doing some research but couldn't find any clear answers. So my question is?
when a spark SQL reads a hive table, where the data is stored for processing initially and what is the entire life cycle of data in terms of its storage , if I didn't explicitly specify anything? And How adding EBS volumes solves the issue?

Spark will read the data, if it does not fit in memory, it will spill it out on disk.
A few things to note:
Data in memory is compressed, from what I read, you gain about 20% (e.g. a 100MB file will take only 80MB of memory).
Ingestion will start as soon as you read(), it is not part of the DAG, you can limit how much you ingest in the SQL query itself. The read operation is done by the executors. This example should give you a hint: https://github.com/jgperrin/net.jgp.books.spark.ch08/blob/master/src/main/java/net/jgp/books/spark/ch08/lab300_advanced_queries/MySQLWithWhereClauseToDatasetApp.java
In latest versions of Spark, you can push down the filter (for example if you filter right after the ingestion, Spark will know and optimize the ingestion), I think this works only for CSV, Avro, and Parquet. For databases (including Hive), the previous example is what I'd recommend.
Storage MUST be seen/accessible from the executors, so if you have EBS volumes, make sure they are seen/accessible from the cluster where the executors/workers are running, vs. the node where the driver is running.

Initially the data is in table location in HDFS/S3/etc. Spark spills data on local storage if it does not fit in memory.
Read Apache Spark FAQ
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.

Whenever spark reads data from hive tables, it stores it in RDD. One point i want to make clear here is hive is just a warehouse so it is like a layer which is above HDFS, when spark interacts with hive , hive provides the spark the location where the hdfs loaction exists.
Thus, Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop (whatever the InputFormat used to read this file. ex: if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (note:the split between partitions would be done on line split, not the exact block split), unless you have a compressed file format like Avro/parquet.
If you manually add rdd.repartition(x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (256MB) it would be stored in 40blocks, which means that the RDD you read from this file would have 40partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)
Now its all up to spark that how it will process the data as Spark is doing lazy evaluation , before doing the processing, spark prepare a DAG for optimal processing. One more point spark need configuration for driver memory, no of cores , no of executors etc and if the configuration is inappropriate the job will fail.
Once it prepare the DAG , then it start processing the data. So it divide your job into stages and stages into tasks. Each task will further use specific executors, shuffle , partitioning. So in your case when you do processing of bilions of records may be your configuration is not adequate for the processing. One more point when we say spark load the data in RDD/Dataframe , its managed by spark, there are option to keep the data in memory/disk/memory only etc ref -storage_spark.
Briefly,
Hive-->HDFS--->SPARK>>RDD(Storage depends as its a lazy evaluation).
you may refer the following link : Spark RDD - is partition(s) always in RAM?

Performance tuning for PySpark data load from a non-partitioned hive table to a partitioned hive table

We have a requirement to ingest data from a non-partitioned EXTERNAL hive table work_db.customer_tbl to a partitioned EXTERNAL hive table final_db.customer_tbl through PySpark, previously done through hive query. The final table is partitioned by the column load_date (format of load_date column is yyyy-MM-dd).
So we have a simple PySpark script which uses an insert query (same as the hive query which was used earlier), to ingest the data using spark.sql() command. But we have some serious performance issues because the table we are trying to ingest after ingestion has around 3000 partitions and each partitions has around 4 MB of data except for the last partition which is around 4GB. Total table size is nearly 15GB. Also, after ingestion each partition has 217 files. The final table is a snappy compressed parquet table.
The source work table has a single 15 GB file with filename in the format customers_tbl_unload.dat.
Earlier when we were using the hive query through a beeline connection it usually takes around 25-30 minutes to finish. Now when we are trying to use the PySpark script it is taking around 3 hours to finish.
How can we tune the spark performance to make the ingestion time less than what it took for beeline.
The configurations of the yarn queue we use is:
Used Resources: <memory:5117184, vCores:627>
Demand Resources: <memory:5120000, vCores:1000>
AM Used Resources: <memory:163072, vCores:45>
AM Max Resources: <memory:2560000, vCores:500>
Num Active Applications: 45
Num Pending Applications: 45
Min Resources: <memory:0, vCores:0>
Max Resources: <memory:5120000, vCores:1000>
Reserved Resources: <memory:0, vCores:0>
Max Running Applications: 200
Steady Fair Share: <memory:5120000, vCores:474>
Instantaneous Fair Share: <memory:5120000, vCores:1000>
Preemptable: true
The parameters passed to the PySpark script is:
num-executors=50
executor-cores=5
executor-memory=10GB
PySpark code used:
insert_stmt = """INSERT INTO final_db.customers_tbl PARTITION(load_date)
SELECT col_1,col_2,...,load_date FROM work_db.customer_tbl"""
spark.sql(insert_stmt)
Even after nearly using 10% resources of the yarn queue the job is taking so much time. How can we tune the job to make it more efficient.

You need to reanalyze your dataset and look if you are using the correct approach by partitioning yoir dataset on date column or should you be probably partitioning on year?
To understand why you end up with 200 plus files for each partition, you need to understand the difference between the Spark and Hive partitions.
A direct approach you should try first is to read your input dataset as a dataframe and partition it by the key you are planning to use as a partition key in Hive and then save it using df.write.partitionBy
Since the data seems to be skewed too on date column, try partitioning it on additional columns which might have equal distribution of data. Else, filter out the skewed data and process it separately

Spark hbase bulk load generates more than 15X data

I have Spark dataframe with just 2 columns like { Key| Value}. And this dataframe has 10 million records. I am inserting this into HBase table (has 10 pre-split regions) using bulk load approach from Spark. This works fine and loads the data successfully. When I checked the size table it was like 151GB (453 gb with 3x hadoop replication). I ran major compaction on that table, and table size got reduced to 35GB (105gb with 3x replication).
I am trying to run the same code and same data in a different cluster. But here I have quota limitation of 2TB to my namespace. My process fails while loading HFiles to HBase saying its quota limit exceeded.
I would like to know whether Spark creates much more data files than the required 151GB during the bulk load? If so how to avoid that? or is there better approach to load the same?
The question is that if actual data is around 151gb (before major_compact), then why 2TB size is not enough?

Alter table to add partition taking long time on Hive external table

I'm trying to execute a spark job through EMR cluster with 6 nodes(8 cores and 56GB memory on each node). Spark job does an incremental load on partitions on Hive table and at the end it does a refresh table in order to update the metadata.
Refresh command takes as long as 3-6 hours to complete which is too long.
Nature of data in Hive:
27Gb of data located on S3.
Stored in parquet.
Partitioned on 2 columns.(ex: s3a//bucket-name/table/partCol1=1/partCol2=2020-10-12).
Note- Its a date wise partition and cannot be changed.
Spark config used:
Num-executors= 15
Executor-memory =16Gb
Executor-cores = 2
Driver-memory= 49Gb
Spark-shuffle-partitions=48
Hive.exec.dynamic.partition.mode=nonstrict
Spark.sql.sources.partitionOverwriteMode=dynamic.
Things tried:
Tuning the spark cores/memory/executors but no luck.
Refresh table command.
Alter table add partition command.
Hive cli taking 3-4 hours to complete MSCK repair table tablename
All the above had no effect on reducing the time to refresh the partition on Hive.
Some assumptions:
Am I missing any parameter in tuning as the data is stored in Amazon-S3.?
Currently number of partitions on table are close to 10k is this an issue.?
Any help will be much appreciated.

incase possible, make the partitions to 1 column. It kills when we have multi level (multi column partitions)
use R type instance. It provides more memory compared to M type instances at same price
use coalesce to merge the files in source if there are many small files.
Check the number of mapper tasks. The more the task, lesser the performance
use EMRFS rather than S3 to keep the metadata info
use below
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
Follow some of the instructions from below Link

How Spark SQL reads Parquet partitioned files

I have a parquet file of around 1 GB. Each data record is a reading from an IOT device which captures the energy consumed by the device in the last one minute.
Schema: houseId, deviceId, energy
The parquet file is partitioned on houseId and deviceId. A file contains the data for the last 24 hours only.
I want to execute some queries on the data residing in this parquet file using Spark SQL An example query finds out the average energy consumed per device for a given house in the last 24 hours.
Dataset<Row> df4 = ss.read().parquet("/readings.parquet");
df4.as(encoder).registerTempTable("deviceReadings");
ss.sql("Select avg(energy) from deviceReadings where houseId=3123).show();
The above code works well. I want to understand that how spark executes this query.
Does Spark read the whole Parquet file in memory from HDFS without looking at the query? (I don't believe this to be the case)
Does Spark load only the required partitions from HDFS as per the query?
What if there are multiple queries which need to be executed? Will Spark look at multiple queries while preparing an execution plan? One query may be working with just one partition whereas the second query may need all the partitions, so a consolidated plan shall load the whole file from disk in memory (if memory limits allow this).
Will it make a difference in execution time if I cache df4 dataframe above?

Does Spark read the whole Parquet file in memory from HDFS without looking at the query?
It shouldn't scan all data files, but it might in general, access metadata of all files.
Does Spark load only the required partitions from HDFS as per the query?
Yes, it does.
Does Spark load only the required partitions from HDFS as per the query?
It does not. Each query has its own execution plan.
Will it make a difference in execution time if I cache df4 dataframe above?
Yes, at least for now, it will make a difference - Caching dataframes while keeping partitions

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string