After I sorted all the entries and use write() function to S3, I want to re-load the data with exactly the same order and same partitions.
I tried to use read() and load() function but none of these work. Do we have a way to load the partitioned parquet files with same order and partitions?
if read() and load() did not help i would suggest to read file names from S3 and order it in fashion you need and then read back those files in the order in the spark. You can always build up your DataFrame (if you are and keep adding data to it from these partitions that you just read)
Related
We all know parquet is column-oriented so we can get only columns we desired and reduce IO.
But what if the parquet file is stored in HDFS, should we download the entire file first, and then apply column filter locally?
For example, if we use spark to read a parquet column from HDFS/Hive:
spark.sql("select name from wide_table")
Still we must download the entire parquet file, is that right?
Or maybe there is a way we can filter the columns just before the network transfer?
Actually "predicate pushdown" which is a feature of Spark SQL will try to use column filters to reduce the amount of information that is processed by spark. Technically the entire hdfs block is still read into memory, but it uses smart logic to only return relevant results. This is normally called out in the physical plan. You can read this by using .explain() on your query to see if the feature is being used. (Not all versions of hdfs support this.)
Hi I want to save my spark dataframe to a file with custom file format,
such that it partitions data to different files while writing to the file.
Also I need single part file for each partition key.
I have tried extending TextBasedFileFormat and change writer to suit my needs.
The data is getting partitioned while writing to file without shuffle.
But I feel each rdd partition will write data to different part file
When you write the dataframe, each partition of underlying RDD will be written by separate tasks. Now each of these RDD partitions might correspond to data which belongs to different partition key. So each task will end up creating multiple part files.
To solve this, you have to repartition your dataframe by the partitionKey. This will involve a shuffle and all the data corresponding to same partitionKey will come into same RDD partition. This can be done by -
val newDf = df.repartition("partitionKey")
Now this RDD can be written to any file format (say parquet, csv etc) and their should be 1 file per partition. If the file size is going big, it might create multiple files. This can be controlled by config "spark.sql.files.maxRecordsPerFile".
val newDf = df.repartition("partitionKey")
newDf.write.partitionBy("partitionKey").parquet("<directory_path>")
I have a sequence of very large daily gzipped files. I'm trying to use PySpark to re-save all the files in S3 in Parquet format for later use.
If for a single file (in example, 2012-06-01) I do:
dataframe = spark.read.csv('s3://mybucket/input/20120601.gz', schema=my_schema, header=True)
dataframe.write.parquet('s3://mybucket/output/20120601')
it works, but since gzip isn't splittable it runs on a single host and I get no benefit of using the cluster.
I tried reading in a chunk of files at once, and using partitionBy to write the output to daily files like this (in example, reading in a month):
dataframe = spark.read.csv('s3://mybucket/input/201206*.gz', schema=my_schema, header=True)
dataframe.write.partitionBy('dayColumn').parquet('s3://mybucket/output/')
This time, individual files are read in different executors like I want, but the executors later die and the process fails. I believe since the files are so large, and the partitionBy is somehow using unnecessary resources (a shuffle?) it's crashing the tasks.
I don't actually need to re-partition my dataframe since this is just a 1:1 mapping. Is there anyway to make each individual task write to a separate, explicitly named parquet output file?
I was thinking something like
def write_file(date):
# get input/output locations from date
dataframe = spark.read.csv(input_location, schema=my_schema, header=True)
dataframe.write.parquet(output_location)
spark.sparkContext.parallelize(my_dates).for_each(write_file)
except this doesn't work since you can't broadcast the spark session to the cluster. Any suggestions?
Writing input files to separate output files without repartitioning
TL;DR This is what your code is already doing.
partitionBy is causing a unnecessary shuffle
No. DataFrameWriter.partitionBy doesn't shuffle at all.
it works, but since gzip isn't splittable
You can:
Drop compression completely - Parquet uses internal compression.
Use splittable compression like bzip2.
Unpack the files to a temporary storage before submitting the job.
If you are concerned about resources used by partitionBy (it might open larger number of files for each executor thread) you can actually shuffle to improve performance - DataFrame partitionBy to a single Parquet file (per partition). Single file is probably to much but
dataframe \
.repartition(n, 'dayColumn', 'someOtherColumn') \
.write.partitionBy('dayColumn') \
.save(...)
where someOtherColumn can be chosen to get reasonable cardinality, should improve things.
I am writing this code
val inputData = spark.read.parquet(inputFile)
spark.conf.set("spark.sql.shuffle.partitions",6)
val outputData = inputData.sort($"colname")
outputData.write.parquet(outputFile) //write on HDFS
If I want to read the content of the file "outputFile" from HDFS, I don't find the same number of partitions and the data is not sorted. Is this normal?
I am using Spark 2.0
This is an unfortunate deficiency of Spark. While write.parquet saves files as part-00000.parquet, part-00001.parquet, ... , it saves no partition information, and does not guarantee that part-00000 on disk is read back as the first partition.
We have added functionality for our project to a) read back partitions in the same order (this involves doing some somewhat-unsafe partition casting and sorting based on the contained filename), and b) serialize partitioners to disk and read them back.
As far as I know, there is nothing you can do in stock Spark at the moment to solve this problem. I look forward to seeing a resolution in future versions of Spark!
Edit: My experience is in Spark 1.5.x and 1.6.x. If there is a way to do this in native Spark with 2.0, please let me know!
You should make use of the repartition() instead. This would write the parquet file the way you want it:
outputData.repartition(6).write.parquet("outputFile")
Then, it would be the same if you try to read it back .
Parquet preserves the order of rows. You should use take() instead of show() to check the contents. take(n) returns the first n rows and the way it works is by first reading the first partition to get an idea of the partition size and then getting the rest of the data in batches..
I have a job that needs to save the result in parquet/avro format from all the worker nodes. Can I do a separate parquet file for each of the individual partition and read all the resulting files as a single table? Or is there a better way of going about this?
Input is divided into 96 partitions and result needs to be saved on HDFS. When I tried to save it as a file it created over a million small files.
You can do a repartition (or coalesce if you always want fewer partitions) to the desired number of partitions just before you call write. Your data will then be written into the same number of files. When you want to read in the data, you simply point to the folder with the files rather than to a specific file. Like this:
sqlContext.read.parquet("s3://my-bucket/path/to/files/")