Saving Spark DataFrame to Parquet partitioned by Date - apache-spark

I have huge dataframe that has several columns, one of which is callDate(DateType). I want to save that dataframe to parquet on S3 and partition it by this call_date column. This will be initial load for our project(containing history data), and afterwards in production, after a day finishes it should add up new partition and not delete older ones.
In a case when I omit .partitionBy method, job finishes in 12 minutes. Action example:
allDataDF.write.mode("overwrite").parquet(resultPath)
On the other hand when I do this:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
allDataDF.write.mode("overwrite").partitionBy("call_date").parquet(resultPath)
job doesn't finish in 30 minutes. I am not doing any repartition before partitionBy, so I guessed speed should be somewhat similar, as every executor should save it's own partition to the specific date? What am I missing here?

Related

Spark Job stuck writing dataframe to partitioned Delta table

Running databricks to read csv files and then saving as a partitioned delta table.
Total records in file are 179619219 . It is being split on COL A (8419 unique values) and Year ( 10 Years) and Month.
df.write.partitionBy("A","year","month").format("delta") \
.mode("append").save(path)
Job gets stuck on the write step and aborts after running for 5-6 hours
This is very bad partitioning schema. You simply have too many unique values for column A, and additional partitioning is creating even more partitions. Spark will need to create at least 90k partitions, and this will require creation a separate files (small), etc. And small files are harming the performance.
For non-Delta tables, partitioning is primarily used to perform data skipping when reading data. But for Delta lake tables, partitioning may not be so important, as Delta on Databricks includes things like data skipping, you can apply ZOrder, etc.
I would recommend to use different partitioning schema, for example, year + month only, and do OPTIMIZE with ZOrder on A column after the data is written. This will lead to creation of only few partitions with bigger files.

How to merge partitions in HDFS?

Assuming I have a partitioned table in my HDFS, that gets new information all the time. New data will be partitioned by days by default, while all of the other files are partitioned by months. How can I merge partitions so by this example I would be able to merge all days partitions that came in the last month to be a month partition? Is there a way to repartition only some of the table’s partitions? I’d like to repartition only some of my partitions so only partitions that are small enough would be merged.
Also, does it even possible to merge partitions or should I try to read them, delete and write again to one partition? I'm thinking of something like concatenating the files.
I’d like to know what is the best way to merge only some partitions of a table.

how to speed up saving partitioned data with only one partition?

The spark data saving operation is quite slow if:
the dataframe df partitioned by date (year, month, day), df contains data from exactly one day, say 2019-02-14.
If I save the df by:
df.write.partitionBy("year", "month", "day").parquet("/path/")
It will be slow due to all data belong to one partition, which is processed by one task (??).
If saving df with explicit partition path:
df.write.parquet("/path/year=2019/month=02/day=14/")
It works well, but it will create the _metadata, _common_metadata, _SUCCESS files in "/path/year=2019/month=02/day=14/"
in stead of "/path/". Drop partition columns are required to keep same fields as using method partitionBy.
So, how to speed up saving data with only one partition without changing metadata files location, which can be updated in each OP.
Is it safe to use explicit partition path instead of using partitionBy?

Repartition to avoid large number of small files

Currently I have a ETL job that reads few tables, performs certain transformations and writes them back to the daily table.
I use the following query in spark sql
"INSERT INTO dbname.tablename PARTITION(year_month)
SELECT * from Spark_temp_table "
The target table in which all these records are being inserted is partitioned at a year X month level. Records which are generated on a daily basis are not that much hence I am partitioning on year X month level.
However, when I check the partition, it has small ~50MB files for each day my code runs (code has to run daily) and eventually I will end up having around 30 files in my partition totalling ~1500MB
I want to know if there is way for me to just create one (or maybe 2-3 files as per block size restrictions) in one partition while I append my records on a daily basis
The way I think I can do it is to just read everything from the concerned partition in my spark dataframe, append it with the latest record and repartition it before writing back. How do I ensure I only read data from the concerned partition and only that partition is over written with lesser number of files?
you can use DISTRIBUTE BY clause to control how the records will be distributed in files inside each partition.
to have a single file per partition, you can use DISTRIBUTE BY year, month
and to have 3 file per partition, you can use DISTRIBUTE BY year, month, day % 3
the full query:
INSERT INTO dbname.tablename
PARTITION(year_month)
SELECT * from Spark_temp_table
DISTRIBUTE BY year, month, day % 3

Writing Spark Dataframe directly to HIVE is taking too much time

I am writing 2 dataframes from Spark directly to Hive using PySpark. The first df has only one row and 7 columns. The second df has 20M rows and 20 columns. It took 10 mins to write the 1 df(1row) and around 30Mins to write 1M rows in the second DF. I dont know how long it will take to write the entire 20M, I killed the code before it can complete.
I have tried two approaches to write the df. I also cached the df to see if it would make the write faster but didn't seem to have any effect:
df_log.write.mode("append").insertInto("project_alpha.sends_log_test")
2nd Method
#df_log.registerTempTable("temp2")
#df_log.createOrReplaceTempView("temp2")
sqlContext.sql("insert into table project_alpha.sends_log_test select * from temp2")
In the 2nd approach I tried using both registerTempTable() as well as createOrReplaceTempView() but there was no difference in the run time.
Is there a way to write it faster or more efficiently. Thanks.
Are you sure the final tables are cached? It might be the issue that before writing the data it calculates the whole pipeline. You can check that in terminal/console where Spark runs.
Also, please check if the table you append to on Hive is not a temporary view - then it could be the issue of recalculating the view before appending new rows.
When I write data to Hive I always use:
df.write.saveAsTable('schema.table', mode='overwrite')
Please try:
df.write.saveAsTable('schema.table', mode='append')
Its bad idea(or design) to do insert into hive table. You have to save it as file and create a table on top of it or add as a partition to existing table.
Can you please try that route.
try repartition to small number of files lets say like .repartition(2000) and then write to hive. Large number of partitions in spark sometimes takes time to write.

Resources