When writing to hdfs, how do I overwrite only the necessary folders? - apache-spark

So, I have this folder, let's call it /data.
And it has partitions in it, e.g.:
/data/partition1, /data/partition2.
I read new data from kafka, and imagine I only need to update /data/partition2. I do:
dataFrame
.write
.mode(SaveMode.Overwrite)
.partitionBy("date", "key")
.option("header", "true")
.format(format)
.save("/data")
and it successfully updates /data/partition2, but /data/partition1 is gone... How can I force spark's SaveMode.Overwrite to not touch HDFS partitions that don't need to be updated?

You are using SaveMode.Overwrite which deletes previously existing directories. You should instead use SaveMode.Append
NOTE: The append operation is not without cost. When you call save using append mode, spark needs to ensure uniqueness of the file names so that it won't overwrite an existing file by accident. The more files you already have in the directory, the longer the save operation takes. If you are talking about a handful of files, then it's a very cost effective operation. But if you have many terabytes of data in thousands of files in the original directory (which was my case), you should use a different approach.

Related

With Delta Lake, how to remove original file after compaction

Basically I have a spark streaming job (with delta) writing a small file to hdfs every 5min . I also have a compaction job that runs everyday to compact data from previous day into some big files (# of files depend on job repartition number). The big files are in the same directory as the original small files. Is there any way to effectively remove the original small files as they are useless?
I have tried the vacuum function for delta tables, but that basically removes all data out of retention period, regardless of compaction or not.
Here's how I compact my data (I'm using Java):
spark.read()
.format("delta")
.load(path) // hdfs path of the data
.where(whereCondition) // my data is partitioned by date, so here should be "date = '2021-06-29'"
.repartition(repartitionNum)
.write()
.option("dataChange", "false")
.format("delta")
.mode("overwrite")
.option("replaceWhere", whereCondition)
.save(path);
It would be great if anyone can tell me
If I'm doing it correctly with compaction
How to remove original small files which shouldn't be referenced by delta anymore.
Any comment is appreciated, thanks!
a) You may also consider using coalesce instead of repartition. Because coalesce is more efficient than repartition. However, coalesce can be used only for decreasing the no. of files but repartition can be used for either decreasing or increasing no. of files. But in compaction, we need to decrease the no. of files always. So, I believe coalesce will be better than repartition.
b) If you are using Databricks, then you may consider using OPTIMIZE command for compaction.
For deleting old files, you need to use vacuum. There is no other way to do that.

Why apache spark save function with a folder contain multi sub-file?

When save spark dataframe, spark save to multi file inside a folder instead only one file.
df.write.format("json") \
.option("header", "true") \
.save('data.json', mode='append')
When run this code, data.json will be folder name instead file name.
And I want to know what are the advantages for that ?
When you write the dataframe or rdd the spark uses HadoopAPI underneath
The actual data that contains result is in the part- files which are created as the same number of partition on dataframe. If you have n numbers of partition then it creates n number of part files.
The main advantage of multiple part file is that if you have multiple workers can access and write the file in parallel.
Other files like _SUCCESS is to indicate that it has completed successfully and .crc is for the ckeck.
Hope this helps you.

How to load lots of files into one RDD in Spark

I use saveAsTextFile method to save RDD, but it is not in a file, instead there are many parts files as the following picture.
So, my question is how to reload these files into one RDD.
You are trying to use Spark locally, rather than in a distributed manner is my guess. When you use saveAsTextFile it is just saving these using Hadoop's file writer and creating a file per RDD partition. One thing you could do is coalesce the partition to 1 file before writing if you want a single file. But if you go up one folder you will find that the folder's name is that which you saved. So you can just sc.textFile using that same path and it will pull everything into the partitions once again.
you know what? I just found it very elegant:
say your files are all in the /output directory, just use the following command to merge them into one, and then you can easily reload as one RDD:
hadoop fs -getmerge /output /local/file/path
Not a big deal, I'm Leifeng.

PySpark: Writing input files to separate output files without repartitioning

I have a sequence of very large daily gzipped files. I'm trying to use PySpark to re-save all the files in S3 in Parquet format for later use.
If for a single file (in example, 2012-06-01) I do:
dataframe = spark.read.csv('s3://mybucket/input/20120601.gz', schema=my_schema, header=True)
dataframe.write.parquet('s3://mybucket/output/20120601')
it works, but since gzip isn't splittable it runs on a single host and I get no benefit of using the cluster.
I tried reading in a chunk of files at once, and using partitionBy to write the output to daily files like this (in example, reading in a month):
dataframe = spark.read.csv('s3://mybucket/input/201206*.gz', schema=my_schema, header=True)
dataframe.write.partitionBy('dayColumn').parquet('s3://mybucket/output/')
This time, individual files are read in different executors like I want, but the executors later die and the process fails. I believe since the files are so large, and the partitionBy is somehow using unnecessary resources (a shuffle?) it's crashing the tasks.
I don't actually need to re-partition my dataframe since this is just a 1:1 mapping. Is there anyway to make each individual task write to a separate, explicitly named parquet output file?
I was thinking something like
def write_file(date):
# get input/output locations from date
dataframe = spark.read.csv(input_location, schema=my_schema, header=True)
dataframe.write.parquet(output_location)
spark.sparkContext.parallelize(my_dates).for_each(write_file)
except this doesn't work since you can't broadcast the spark session to the cluster. Any suggestions?
Writing input files to separate output files without repartitioning
TL;DR This is what your code is already doing.
partitionBy is causing a unnecessary shuffle
No. DataFrameWriter.partitionBy doesn't shuffle at all.
it works, but since gzip isn't splittable
You can:
Drop compression completely - Parquet uses internal compression.
Use splittable compression like bzip2.
Unpack the files to a temporary storage before submitting the job.
If you are concerned about resources used by partitionBy (it might open larger number of files for each executor thread) you can actually shuffle to improve performance - DataFrame partitionBy to a single Parquet file (per partition). Single file is probably to much but
dataframe \
.repartition(n, 'dayColumn', 'someOtherColumn') \
.write.partitionBy('dayColumn') \
.save(...)
where someOtherColumn can be chosen to get reasonable cardinality, should improve things.

Do Parquet Metadata Files Need to be Rolled-back?

When a Parquet file data is written with partitioning on its date column we get a directory structure like:
/data
_common_metadata
_metadata
_SUCCESS
/date=1
part-r-xxx.gzip
part-r-xxx.gzip
/date=2
part-r-xxx.gzip
part-r-xxx.gzip
If the partition date=2 is deleted without the involvement of Parquet utilities (via the shell or file browser, etc) do any of the metadata files need to be rolled back to when there was only the partition date=1?
Or is it ok to delete partitions at will and rewrite them (or not) later?
If you're using DataFrame there is no need to roll back the metadata files.
For example:
You can write your DataFrame to S3
df.write.partitionBy("date").parquet("s3n://bucket/folderPath")
Then, manually delete one of your partitions (date=1 folder in S3) using S3 browser (e.g. CloudBerry)
Now you can
Load your data and see that the data is still valid except the data you had in partition date=1 sqlContext.read.parquet("s3n://bucket/folderPath").count
Or rewrite your DataFrame (or any other DataFrame with the same schema) using append mode
df2.write.mode("append").partitionBy("date").parquet("s3n://bucket/folderPath")
You can also take a look at this question from databricks forum.

Resources