How to partition and write DataFrame in Spark without deleting partitions with no new data? - apache-spark

I am trying to save a DataFrame to HDFS in Parquet format using DataFrameWriter, partitioned by three column values, like this:
dataFrame.write.mode(SaveMode.Overwrite).partitionBy("eventdate", "hour", "processtime").parquet(path)
As mentioned in this question, partitionBy will delete the full existing hierarchy of partitions at path and replaced them with the partitions in dataFrame. Since new incremental data for a particular day will come in periodically, what I want is to replace only those partitions in the hierarchy that dataFrame has data for, leaving the others untouched.
To do this it appears I need to save each partition individually using its full path, something like this:
singlePartition.write.mode(SaveMode.Overwrite).parquet(path + "/eventdate=2017-01-01/hour=0/processtime=1234567890")
However I'm having trouble understanding the best way to organize the data into single-partition DataFrames so that I can write them out using their full path. One idea was something like:
dataFrame.repartition("eventdate", "hour", "processtime").foreachPartition ...
But foreachPartition operates on an Iterator[Row] which is not ideal for writing out to Parquet format.
I also considered using a select...distinct eventdate, hour, processtime to obtain the list of partitions, and then filtering the original data frame by each of those partitions and saving the results to their full partitioned path. But the distinct query plus a filter for each partition doesn't seem very efficient since it would be a lot of filter/write operations.
I'm hoping there's a cleaner way to preserve existing partitions for which dataFrame has no data?
Thanks for reading.
Spark version: 2.1

This is an old topic, but I was having the same problem and found another solution, just set your partition overwrite mode to dynamic by using:
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
So, my spark session is configured like this:
spark = SparkSession.builder.appName('AppName').getOrCreate()
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')

The mode option Append has a catch!
df.write.partitionBy("y","m","d")
.mode(SaveMode.Append)
.parquet("/data/hive/warehouse/mydbname.db/" + tableName)
I've tested and saw that this will keep the existing partition files. However, the problem this time is the following: If you run the same code twice (with the same data), then it will create new parquet files instead of replacing the existing ones for the same data (Spark 1.6). So, instead of using Append, we can still solve this problem with Overwrite. Instead of overwriting at the table level, we should overwrite at the partition level.
df.write.mode(SaveMode.Overwrite)
.parquet("/data/hive/warehouse/mydbname.db/" + tableName + "/y=" + year + "/m=" + month + "/d=" + day)
See the following link for more information:
Overwrite specific partitions in spark dataframe write method
(I've updated my reply after suriyanto's comment. Thnx.)

I know this is very old. As I can not see any solution posted, I will go ahead and post one. This approach assumes you have a hive table over the directory you want to write to.
One way to deal with this problem is to create a temp view from dataFrame which should be added to the table and then use normal hive-like insert overwrite table ... command:
dataFrame.createOrReplaceTempView("temp_view")
spark.sql("insert overwrite table table_name partition ('eventdate', 'hour', 'processtime')select * from temp_view")
It preserves old partitions while (over)writing to only new partitions.

Related

Spark - Stream kafka to file that changes every day?

I have a kafka stream I will be processing in spark. I want to write the output of this stream to a file. However, I want to partition these files by day, so everyday it will start writing to a new file. Can something like this be done? I want this to be left running and when a new day occurs, it will switch to write to a new file.
val streamInputDf = spark.readStream.format("kafka")
.option("kafka.bootstrapservers", "XXXX")
.option("subscribe", "XXXX")
.load()
val streamSelectDf = streamInputDf.select(...)
streamSelectDf.writeStream.format("parquet)
.option("path", "xxx")
???
Adding partition from spark can be done with partitionBy provided in
DataFrameWriter for non-streamed or with DataStreamWriter for
streamed data.
Below are the signatures :
public DataFrameWriter partitionBy(scala.collection.Seq
colNames)
DataStreamWriter partitionBy(scala.collection.Seq colNames)
Partitions the output by the given columns on the file system.
DataStreamWriter partitionBy(String... colNames) Partitions the
output by the given columns on the file system.
Description :
partitionBy public DataStreamWriter partitionBy(String... colNames)
Partitions the output by the given columns on the file system. If
specified, the output is laid out on the file system similar to Hive's
partitioning scheme. As an example, when we partition a dataset by
year and then month, the directory layout would look like:
- year=2016/month=01/
- year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize
physical data layout. It provides a coarse-grained index for skipping
unnecessary data reads when queries have predicates on the partitioned
columns. In order for partitioning to work well, the number of
distinct values in each column should typically be less than tens of
thousands.
Parameters: colNames - (undocumented) Returns: (undocumented) Since:
2.0.0
so if you want to partition data by year and month spark will save the data to folder like:
year=2019/month=01/05
year=2019/month=02/05
Option 1 (Direct write):
You have mentioned parquet - you can use saving as a parquet format with:
df.write.partitionBy('year', 'month','day').format("parquet").save(path)
Option 2 (insert in to hive using same partitionBy ):
You can also insert into hive table like:
df.write.partitionBy('year', 'month', 'day').insertInto(String tableName)
Getting all hive partitions:
Spark sql is based on hive query language so you can use SHOW PARTITIONS
To get list of partitions in the specific table.
sparkSession.sql("SHOW PARTITIONS partitionedHiveParquetTable")
Conclusion :
I would suggest option 2 ... since Advantage is later you can query data based on partition (aka query on raw data to know what you have received) and underlying file can be parquet or orc.
Note :
Just make sure you have .enableHiveSupport() when you are creating session with SparkSessionBuilder and also make sure whether you have hive-conf.xml etc. configured properly.
Based on this answer spark should be able to write to a folder based on the year, month and day, which seems to be exactly what you are looking for. Have not tried it in spark streaming, but hopefully this example gets you on the right track:
df.write.partitionBy("year", "month", "day").format("parquet").save(outPath)
If not, you might be able to put in a variable filepath based on current_date()

Overwrite only some partitions in a partitioned spark Dataset

How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of data.
Default Spark behaviour is to overwrite the whole table, even if only some partitions are going to be written.
Since Spark 2.3.0 this is an option when overwriting a table. To overwrite it, you need to set the new spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite.
Example in scala:
spark.conf.set(
"spark.sql.sources.partitionOverwriteMode", "dynamic"
)
data.write.mode("overwrite").insertInto("partitioned_table")
I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.
Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.
Just FYI, for PySpark users make sure to set overwrite=True in the insertInto otherwise the mode would be changed to append
from the source code:
def insertInto(self, tableName, overwrite=False):
self._jwrite.mode(
"overwrite" if overwrite else "append"
).insertInto(tableName)
this how to use it:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC")
data.write.insertInto("partitioned_table", overwrite=True)
or in the SQL version works fine.
INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement
for doc look at here
This works also for me, since is easier and straighforward
df.write.partitionBy('dt').mode('overwrite').format('parquet').option(
"partitionOverwriteMode", "dynamic").save(path)
Source: https://kontext.tech/article/1067/spark-dynamic-and-static-partition-overwrite

Need less parquet files

I am doing the following process
rdd.toDF.write.mode(SaveMode.Append).partitionBy("Some Column").parquet(output_path)
However, under each partition, there are too many parquet files and each of them, the size is very small, that will makes my following steps become very slow to load all the parquet files. Is there a better way that under each partition, make less parquet files and increase the single parquet file size?
You can repartition before save:
rdd.toDF.repartition("Some Column").write.mode(SaveMode.Append).partitionBy("Some Column")
I used to have this problem.
Actually you can't control the partition of files because it depends on the executor doing.
The way to work around it is using method coalesce to make a shuffle and you can make how many partition you want but it's not efficient way you also need to set driver memory enough to handle this operation.
df = df.coalesce(numPartitions).write.partitionBy(""yyyyy").parquet("xxxx")
I also faced this issue. The problem is if you use coalesce each partition gets same number of parquet files. Now different partitions have different size so ideally I need different coalesce for each partition.
It's going to be really quite expensive if you open a lot of small files. Let's say you open 1k files and each filesize are far from the value of your parquet.block.size.
Here are my suggestions:
Create a job that will first merge your input parquet files to have smaller number of files where their sizes are near or equal to parquet.block.size. The default block size for 128Mb, though it's configurable by updating parquet.block.size. Spark would love if your parquet file is near before or equal the value of your parquet.block.size. The block size is the size of a row group being buffered in memory.
Or update your spark job to just read limited number of files
Or if you have a big machine and/or resources, just do the right tuning.
Hive query has a way to merge small files into larger one. This is not available in spark sql. Also, reducing spark.sql.shuffle.partitions wont help with Dataframe API.
I tried below solution and it generated lesser number of parquet files(from 800 parquet files to 29).
Suppose the data is loaded to a dataframe df
Create a temporary table in hive.
df.createOrReplaceTempView("tempTable")
spark.sql("CREATE TABLE test_temp LIKE test")
spark.sql("INSERT INTO TABLE test_temp SELECT * FROM tempTable")
The test_temp will contain small parquet files.
Populate final hive table from temporary table
spark.sql("INSERT INTO test SELECT * FROM test_temp")
The final table will contain lesser files. Drop temporary table after populating final table.

Spark dataframe saveAsTable vs save

I am using spark 1.6.1 and I am trying to save a dataframe to an orc format.
The problem I am facing is that the save method is very slow, and it takes about 6 minutes for 50M orc file on each executor.
This is how I am saving the dataframe
dt.write.format("orc").mode("append").partitionBy("dt").save(path)
I tried using saveAsTable to an hive table which is also using orc formats, and that seems to be faster about 20% to 50% faster, but this method has its own problems - it seems that when a task fails, retries will always fail due to file already exist.
This is how I am saving the dataframe
dt.write.format("orc").mode("append").partitionBy("dt").saveAsTable(tableName)
Is there a reason save method is so slow?
Am I doing something wrong?
The problem is due to partitionBy method. PartitionBy reads the values of column specified and then segregates the data for every value of the partition column.
Try to save it without partition by, there would be significant performance difference.
See my previous comments above regarding cardinality and partitionBy.
If you really want to partition it, and it's just one 50MB file, then use something like
dt.write.format("orc").mode("append").repartition(4).saveAsTable(tableName)
repartition will create 4 roughly even partitions, rather than what you are doing to partition on a dt column which could end up writing a lot of orc files.
The choice of 4 partitions is a bit arbitrary. You're not going to get much performance/parallelizing benefit from partitioning tiny files like that. The overhead of reading more files is not worth it.
Use save() to save at particular location may be at some blob location.
Use saveAsTable() to save dataframe as spark SQL tables

Spark - Saving data to Parquet file in case of dynamic schema

I have a JavaPairRDD of the following typing:
Tuple2<String, Iterable<Tuple2<String, Iterable<Tuple2<String, String>>>>>
that denotes the following object:
(Table_name, Iterable(Tuple_ID, Iterable(Column_name, Column_value)))
This means each record in the RDD will create one Parquet file.
The idea is, as you may have guessed, to save each object as a new Parquet table called Table_name. In this table, there is one column called ID that stores the value Tuple_ID, and each column Column_name stores the value Column_value.
The challenge I'm facing is that the table's columns (the schema) are collected on the fly on runtime, AND, as it is not possible to create nested RDDs in Spark, I can't create an RDD within the previous RDD (for each record) and save it finally to a Parquet file --after converting it to a DataFrame of course.
And I can't just convert the previous RDD to a DataFrame, for the obvious reason (need to iterate to get column/value).
As a temporarily workaround, I flattened the RDD into a list of the same typing as the RDD using collect(), but this is not the proper way as the data could be larger than the available disk space on the driver machine, causing an out of memory.
Any advice on how to achieve this? please let me know if the question is not clear enough.
Take a look at answer for this [question][1]
[1]: Writing RDD partitions to individual parquet files in its own directory. I used this answer to create separate (one or more) parquet file for each partition. This technique I believe you can use the same to create separate file each with different schema if you like.

Resources