PySpark: Writing input files to separate output files without repartitioning - apache-spark

I have a sequence of very large daily gzipped files. I'm trying to use PySpark to re-save all the files in S3 in Parquet format for later use.
If for a single file (in example, 2012-06-01) I do:
dataframe = spark.read.csv('s3://mybucket/input/20120601.gz', schema=my_schema, header=True)
dataframe.write.parquet('s3://mybucket/output/20120601')
it works, but since gzip isn't splittable it runs on a single host and I get no benefit of using the cluster.
I tried reading in a chunk of files at once, and using partitionBy to write the output to daily files like this (in example, reading in a month):
dataframe = spark.read.csv('s3://mybucket/input/201206*.gz', schema=my_schema, header=True)
dataframe.write.partitionBy('dayColumn').parquet('s3://mybucket/output/')
This time, individual files are read in different executors like I want, but the executors later die and the process fails. I believe since the files are so large, and the partitionBy is somehow using unnecessary resources (a shuffle?) it's crashing the tasks.
I don't actually need to re-partition my dataframe since this is just a 1:1 mapping. Is there anyway to make each individual task write to a separate, explicitly named parquet output file?
I was thinking something like
def write_file(date):
# get input/output locations from date
dataframe = spark.read.csv(input_location, schema=my_schema, header=True)
dataframe.write.parquet(output_location)
spark.sparkContext.parallelize(my_dates).for_each(write_file)
except this doesn't work since you can't broadcast the spark session to the cluster. Any suggestions?

Writing input files to separate output files without repartitioning
TL;DR This is what your code is already doing.
partitionBy is causing a unnecessary shuffle
No. DataFrameWriter.partitionBy doesn't shuffle at all.
it works, but since gzip isn't splittable
You can:
Drop compression completely - Parquet uses internal compression.
Use splittable compression like bzip2.
Unpack the files to a temporary storage before submitting the job.
If you are concerned about resources used by partitionBy (it might open larger number of files for each executor thread) you can actually shuffle to improve performance - DataFrame partitionBy to a single Parquet file (per partition). Single file is probably to much but
dataframe \
.repartition(n, 'dayColumn', 'someOtherColumn') \
.write.partitionBy('dayColumn') \
.save(...)
where someOtherColumn can be chosen to get reasonable cardinality, should improve things.

Related

Reading Millions of Small JSON Files from S3 Bucket in PySpark Very Slow

I have a folder (path = mnt/data/*.json) in s3 with millions of json files (each file is less than 10 KB). I run the following code:
df = (spark.read
.option("multiline", True)
.option("inferSchema", False)
.json(path))
display(df)
The problem is that it is very slow. Spark creates a job for this with one task. The task appears to have no more executors running it which usually signifies the completion of a job (right?), but for some reason the command cell in DataBricks is still running. It's been stuck like this for 10min. I feel something as simple as this should take no more than 5minutes.
Notes to consider:
Since there are millions of json files, I can't say with confidence that they will have the same exact structure (there could be some discrepancies)
The json files were web-scraped from the same REST API
I read somewhere that inferSchema = False can help reduce runtime, which is why I used it
AWS s3 bucket is already mounted so there is absolutely no need to use boto3
Apache Spark is very good at handling large files but when you have tens of thousands of small files (millions in your case), in a directory/distributed in several directories, that will have a severe impact on processing time (potentially 10s of minutes to hours) since it has to read each of these tiny files.
An ideal file's size should be between 128 MB to 1GB in the disk, anything less than 128 MB (due spark.sql.files.maxPartitionBytes) file would case this Tiny Files problem and will be the bottleneck.
you can rewrite the data in parquet format at an intermediate location as
one large file using coalesce or multiple even-sized files using repartition
you can read the data from this intermediate location for further processing & this should prevent any such bottlenecks which come with Tiny Files problem.
My approach was very simple thanks to Anand pointing out the "small file problem." So my problem was that I could not extract ~ 2 million json files each ~10KB in size. So there was no way I was able to read then store them in parquet format as an intermediary step. I was given an s3 bucket with raw json files scraped from the web.
At any rate, using python's zipfile module came in hand. It was used in order to append multiple json files such that each one was at least 128MB and at most 1GB. Worked pretty well!
There is also another way you can do this using AWS Glue, but of course that requires IAM Role authorization and can be expensive, but the avantage of that is you can convert those files into parquet directly.
zipfile solution: https://docs.python.org/3/library/zipfile.html
AWS Glue solution: https://medium.com/searce/convert-csv-json-files-to-apache-parquet-using-aws-glue-a760d177b45f
Really good blog posts explaining the small file problem:
https://mungingdata.com/apache-spark/compacting-files/
https://garrens.com/blog/2017/11/04/big-data-spark-and-its-small-files-problem/?unapproved=252&moderation-hash=5a657350c6169448d65209caa52d5d2c#comment-252

is it possible in spark to read large s3 csv files in parallel?

Typically spark files are saved in multiple parts, allowing each worker to read different files.
is there a similar solution when working on a single files?
s3 provides the select API that should allow this kind of behaviour.
spark appears to support this API (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html), but this appears to relate only for optimising queries, not for parallelising reading
S3 Select is unrelated to your use case.
S3 Select: have SQL select and project done in the S3 store, so that the client gets the prefiltered data. Result is returned as CSV with the header stripped, or JSON. You cannot then have >1 worker target this. (you could try, but each worker would have to read in and discard all the data in the runup to its offset, and predicting the ranges each worker can process is essentially impossible)
You: have > 1 worker process different parts of a file which has been partitioned
Partitioning large files into smaller parts for parallel processing is exactly what Spark (and mapreduce, hive etc) do for any format where it makes sense.
CSV files are easily partitioned provided they are compressed with a splittable compression format (none, snappy -but not gzip)
All that's needed is to tell spark what the split threshold is. For S3a, set the value fs.s3a.block.size to a value which it can then split up on, then your queries against CSV, Avro, ORC, Parquet and similar will all be split up amongst workers.
Unless your workers are doing a lot of computation per row, there's a minimum block size before it's even worth doing this. Experiment.
Edit: this is now out of date and depends on the type of CSV. Some CSV's allow new lines within columns. These are un splitable. CSVs that do not an guarantee that a newlines only represent a new row can be split
FYI csv's are inherently single threaded. There is no extra information in a csv file that tells the reader where any row starts without reading the whole file from the start.
If you want multiple readers on the same file use a format like Parquet which has row groups with an explicitly defined start position defined in the footer that can be read by independent readers. When spark goes to read the parquet file it will split out row groups into separate tasks. Ultimately having appropriately sized files is very important for spark performance.

Importing a large text file into Spark

I have a pipe delimited text file that is 360GB, compressed (gzip). The file is in an S3 bucket.
This is my first time using Spark. I understand that you can partition a file in order to allow multiple worker nodes to operate on the data which results in huge performance gains. However, I'm trying to find an efficient way to turn my one 360GB file into a partitioned file. Is there a way to use multiple spark worker nodes to work on my one, compressed file in order to partition it? Unfortunately, I have no control over the fact that I'm just getting one huge file. I could uncompress the file myself and break it into many files (say 360 1GB files), but I'll just be using one machine to do that and it will be pretty slow. I need to run some expensive transformations on the data using Spark so I think partitioning the file is necessary. I'm using Spark inside of Amazon Glue so I know that it can scale to a large number of machines. Also, I'm using python (pyspark).
Thanks.
If i'm not mistaken, Spark uses Hadoop's TextInputFormat if you read a file using SparkContext.textFile. If a compression codec is set, the TextInputFormat determines if the file is splittable by checking if the code is an instance of SplittableCompressionCodec.
I believe GZIP is not splittable, Spark can only generate one partition to read the entire file.
What you could do is:
1. Add a repartition after SparkContext.textFile so you at least have more than one of your transformations process parts of the data.
2. Ask for multiple files instead of just a single GZIP file
3. Write an application that decompresses and splits the files into multiple output files before running your Spark application on it.
4. Write your own compression codec for GZIP (this is a little more complex).
Have a look at these links:
TextInputFormat
source code for TextInputFormat
GzipCodec
source code for GZIPCodec
These are in java, but i'm sure there are equivalent Python/Scala versions of them.
First I suggest you have to used ORC format with zlib compression so you get almost 70% compression and as per my research ORC is the most suitable file format for fastest data processing. So you have to load your file and simply write it into orc format with repartition.
df.repartition(500).write.option("compression","zlib").mode("overwrite").save("testoutput.parquet")
One potential solution could be to use Amazon's S3DistCp as a step on your EMR cluster to copy the 360GB file in the HDFS file system available on the cluster (this requires Hadoop to be deployed on the EMR).
A nice thing about S3DistCp is that you can change the codec of the output file, and transform the original gzip file into a format which will allow Spark to spawn multiple partitions for its RDD.
However I am not sure about how long it will take for S3DistCp to perform the operation (which is an Hadoop Map/Reduce over S3. It benefits from optimised S3 libraries when run from an EMR, but I am concerned that Hadoop will face the same limitations as Spark when generating the Map tasks).

Spark repartition zipped file input on read

Using sc.textFile(path,partitions) I'm able to partition uncompressed files as they are being read in. Unfortunately for RDDs that are in zipped parts the partitions doesn't change when you set it. Since downstream processing has to be in more parts I do sc.textFile(path,partitions).repartition(partitions) but it does a lot of shuffling. Is there a way to just repartition locally? That is, for every node they have instead break it into parts rather than randomly shuffling across the cluster.

How the input data is split in Spark?

I'm coming from a Hadoop background, in hadoop if we have an input directory that contains lots of small files, each mapper task picks one file each time and operate on a single file (we can change this behaviour and have each mapper picks more than one file but that's not the default behaviour). I wonder to know how that works in Spark? Does each spark task picks files one by one or..?
Spark behaves the same way as Hadoop working with HDFS, as in fact Spark uses the same Hadoop InputFormats to read the data from HDFS.
But your statement is wrong. Hadoop will take files one by one only if each of your files is smaller than a block size or if all the files are text and compressed with non-splittable compression (like gzip-compressed CSV files).
So Spark would do the same, for each of the small input files it would create a separate "partition" and the first stage executed over your data would have the same amount of tasks as the amount of input files. This is why for small files it is recommended to use wholeTextFiles function as it would create much less partitions

Resources