I have a spark-submit job I wrote that reads an in directory of json docs, does some processing on them using data frames, and then writes to an out directory. For some reason, though, it creates individual avro, parquet or json files when I use df.save or df.write methods.
In fact, I even used the saveAsTable method and it did the same thing with parquet.gz files in the hive warehouse.
It seems to me that this is inefficient and negates the use of a container file format. Is this right? Or is this normal behavior and what I'm seeing just an abstraction in HDFS?
If I am right that this is bad, how do I write the data frame from many files into a single file?
As #zero323 told its normal behavior due to many workers(to support fault tolerance).
I would suggest you to write all the records in parquet or avro file which has avro generic record using something like this
dataframe.write().mode(SaveMode.Append).
format(FILE_FORMAT).partitionBy("parameter1", "parameter2").save(path);
but it wont write in to single file but it will group similar kind of Avro Generic records to one file(may be less number of medium sized) files
Related
Typically spark files are saved in multiple parts, allowing each worker to read different files.
is there a similar solution when working on a single files?
s3 provides the select API that should allow this kind of behaviour.
spark appears to support this API (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html), but this appears to relate only for optimising queries, not for parallelising reading
S3 Select is unrelated to your use case.
S3 Select: have SQL select and project done in the S3 store, so that the client gets the prefiltered data. Result is returned as CSV with the header stripped, or JSON. You cannot then have >1 worker target this. (you could try, but each worker would have to read in and discard all the data in the runup to its offset, and predicting the ranges each worker can process is essentially impossible)
You: have > 1 worker process different parts of a file which has been partitioned
Partitioning large files into smaller parts for parallel processing is exactly what Spark (and mapreduce, hive etc) do for any format where it makes sense.
CSV files are easily partitioned provided they are compressed with a splittable compression format (none, snappy -but not gzip)
All that's needed is to tell spark what the split threshold is. For S3a, set the value fs.s3a.block.size to a value which it can then split up on, then your queries against CSV, Avro, ORC, Parquet and similar will all be split up amongst workers.
Unless your workers are doing a lot of computation per row, there's a minimum block size before it's even worth doing this. Experiment.
Edit: this is now out of date and depends on the type of CSV. Some CSV's allow new lines within columns. These are un splitable. CSVs that do not an guarantee that a newlines only represent a new row can be split
FYI csv's are inherently single threaded. There is no extra information in a csv file that tells the reader where any row starts without reading the whole file from the start.
If you want multiple readers on the same file use a format like Parquet which has row groups with an explicitly defined start position defined in the footer that can be read by independent readers. When spark goes to read the parquet file it will split out row groups into separate tasks. Ultimately having appropriately sized files is very important for spark performance.
I have a job that reads csv files , converts it into data frames and writes in Parquet. I am using append mode while writing the data in Parquet. With this approach, in each write a separate Parquet file is getting generated. My questions are :
1) If every time I write the data to Parquet schema ,a new file gets
appended , will it impact read performance (as the data is now
distributed in varying length of partitioned Parquet files)
2) Is there a way to generate the Parquet partitions purely based on
the size of the data ?
3) Do we need to think to a custom partitioning strategy to implement
point 2?
I am using Spark 2.3
It will affect read performance if
spark.sql.parquet.mergeSchema=true.
In this case, Spark needs to visit each file and grab schema from
it.
In other cases, I believe it does not affect read performance much.
There is no way generate purely on data size. You may use
repartition or coalesce. Latter will created uneven output
files, but much performant.
Also, you have config spark.sql.files.maxRecordsPerFile or option
maxRecordsPerFile to prevent big size of files, but usually it is
not an issue.
Yes, I think Spark has not built in API to evenly distribute by data
size. There are Column
Statistics
and Size
Estimator may help with this.
I have a sequence of very large daily gzipped files. I'm trying to use PySpark to re-save all the files in S3 in Parquet format for later use.
If for a single file (in example, 2012-06-01) I do:
dataframe = spark.read.csv('s3://mybucket/input/20120601.gz', schema=my_schema, header=True)
dataframe.write.parquet('s3://mybucket/output/20120601')
it works, but since gzip isn't splittable it runs on a single host and I get no benefit of using the cluster.
I tried reading in a chunk of files at once, and using partitionBy to write the output to daily files like this (in example, reading in a month):
dataframe = spark.read.csv('s3://mybucket/input/201206*.gz', schema=my_schema, header=True)
dataframe.write.partitionBy('dayColumn').parquet('s3://mybucket/output/')
This time, individual files are read in different executors like I want, but the executors later die and the process fails. I believe since the files are so large, and the partitionBy is somehow using unnecessary resources (a shuffle?) it's crashing the tasks.
I don't actually need to re-partition my dataframe since this is just a 1:1 mapping. Is there anyway to make each individual task write to a separate, explicitly named parquet output file?
I was thinking something like
def write_file(date):
# get input/output locations from date
dataframe = spark.read.csv(input_location, schema=my_schema, header=True)
dataframe.write.parquet(output_location)
spark.sparkContext.parallelize(my_dates).for_each(write_file)
except this doesn't work since you can't broadcast the spark session to the cluster. Any suggestions?
Writing input files to separate output files without repartitioning
TL;DR This is what your code is already doing.
partitionBy is causing a unnecessary shuffle
No. DataFrameWriter.partitionBy doesn't shuffle at all.
it works, but since gzip isn't splittable
You can:
Drop compression completely - Parquet uses internal compression.
Use splittable compression like bzip2.
Unpack the files to a temporary storage before submitting the job.
If you are concerned about resources used by partitionBy (it might open larger number of files for each executor thread) you can actually shuffle to improve performance - DataFrame partitionBy to a single Parquet file (per partition). Single file is probably to much but
dataframe \
.repartition(n, 'dayColumn', 'someOtherColumn') \
.write.partitionBy('dayColumn') \
.save(...)
where someOtherColumn can be chosen to get reasonable cardinality, should improve things.
I have a pipe delimited text file that is 360GB, compressed (gzip). The file is in an S3 bucket.
This is my first time using Spark. I understand that you can partition a file in order to allow multiple worker nodes to operate on the data which results in huge performance gains. However, I'm trying to find an efficient way to turn my one 360GB file into a partitioned file. Is there a way to use multiple spark worker nodes to work on my one, compressed file in order to partition it? Unfortunately, I have no control over the fact that I'm just getting one huge file. I could uncompress the file myself and break it into many files (say 360 1GB files), but I'll just be using one machine to do that and it will be pretty slow. I need to run some expensive transformations on the data using Spark so I think partitioning the file is necessary. I'm using Spark inside of Amazon Glue so I know that it can scale to a large number of machines. Also, I'm using python (pyspark).
Thanks.
If i'm not mistaken, Spark uses Hadoop's TextInputFormat if you read a file using SparkContext.textFile. If a compression codec is set, the TextInputFormat determines if the file is splittable by checking if the code is an instance of SplittableCompressionCodec.
I believe GZIP is not splittable, Spark can only generate one partition to read the entire file.
What you could do is:
1. Add a repartition after SparkContext.textFile so you at least have more than one of your transformations process parts of the data.
2. Ask for multiple files instead of just a single GZIP file
3. Write an application that decompresses and splits the files into multiple output files before running your Spark application on it.
4. Write your own compression codec for GZIP (this is a little more complex).
Have a look at these links:
TextInputFormat
source code for TextInputFormat
GzipCodec
source code for GZIPCodec
These are in java, but i'm sure there are equivalent Python/Scala versions of them.
First I suggest you have to used ORC format with zlib compression so you get almost 70% compression and as per my research ORC is the most suitable file format for fastest data processing. So you have to load your file and simply write it into orc format with repartition.
df.repartition(500).write.option("compression","zlib").mode("overwrite").save("testoutput.parquet")
One potential solution could be to use Amazon's S3DistCp as a step on your EMR cluster to copy the 360GB file in the HDFS file system available on the cluster (this requires Hadoop to be deployed on the EMR).
A nice thing about S3DistCp is that you can change the codec of the output file, and transform the original gzip file into a format which will allow Spark to spawn multiple partitions for its RDD.
However I am not sure about how long it will take for S3DistCp to perform the operation (which is an Hadoop Map/Reduce over S3. It benefits from optimised S3 libraries when run from an EMR, but I am concerned that Hadoop will face the same limitations as Spark when generating the Map tasks).
I'm struggling to understand what exactly Avro, Kryo and Parquet do in the context of Spark. They all are related to serialization but I've seen them used together so they can't be doing the same thing.
Parquet describes its self as a columnar storage format and I kind of get that but when I'm saving a parquet file can Arvo or Kryo have anything to do with it? Or are they only relevant during the spark job, ie. for sending objects over the network during a shuffle or spilling to disk? How do Arvo and Kryo differ and what happens when you use them together?
Parquet works very well when you need to read only a few columns when querying your data. However if your schema has lots of columns (30+) and in your queries/jobs you need to read all of them then record based formats (like AVRO) will work better/faster.
Another limitation of Parquet is that it is essentially write-once format. So usually you need to collect data in some staging area and write it to a parquet file once a day (for example).
This is where you might want to use AVRO. E.g. you can collect AVRO-encoded records in a Kafka topic or local files and have a batch job that converts all of them to Parquet file at the end of the day. This is fairly easy to implement thanks to parquet-avro library that provides tools to convert between AVRO and Parquet formats automatically.
And of course you can use AVRO outside of Spark/BigData. It is fairly good serialization format similar to Google Protobuf or Apache Thrift.
This very good blog post explains the details for everything but Kryo.
http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
Kryo would be used for fast serialization not involving permanent storage, such as shuffle data and cached data, in memory or on disk as temp files.