Importing a large text file into Spark - apache-spark

I have a pipe delimited text file that is 360GB, compressed (gzip). The file is in an S3 bucket.
This is my first time using Spark. I understand that you can partition a file in order to allow multiple worker nodes to operate on the data which results in huge performance gains. However, I'm trying to find an efficient way to turn my one 360GB file into a partitioned file. Is there a way to use multiple spark worker nodes to work on my one, compressed file in order to partition it? Unfortunately, I have no control over the fact that I'm just getting one huge file. I could uncompress the file myself and break it into many files (say 360 1GB files), but I'll just be using one machine to do that and it will be pretty slow. I need to run some expensive transformations on the data using Spark so I think partitioning the file is necessary. I'm using Spark inside of Amazon Glue so I know that it can scale to a large number of machines. Also, I'm using python (pyspark).
Thanks.

If i'm not mistaken, Spark uses Hadoop's TextInputFormat if you read a file using SparkContext.textFile. If a compression codec is set, the TextInputFormat determines if the file is splittable by checking if the code is an instance of SplittableCompressionCodec.
I believe GZIP is not splittable, Spark can only generate one partition to read the entire file.
What you could do is:
1. Add a repartition after SparkContext.textFile so you at least have more than one of your transformations process parts of the data.
2. Ask for multiple files instead of just a single GZIP file
3. Write an application that decompresses and splits the files into multiple output files before running your Spark application on it.
4. Write your own compression codec for GZIP (this is a little more complex).
Have a look at these links:
TextInputFormat
source code for TextInputFormat
GzipCodec
source code for GZIPCodec
These are in java, but i'm sure there are equivalent Python/Scala versions of them.

First I suggest you have to used ORC format with zlib compression so you get almost 70% compression and as per my research ORC is the most suitable file format for fastest data processing. So you have to load your file and simply write it into orc format with repartition.
df.repartition(500).write.option("compression","zlib").mode("overwrite").save("testoutput.parquet")

One potential solution could be to use Amazon's S3DistCp as a step on your EMR cluster to copy the 360GB file in the HDFS file system available on the cluster (this requires Hadoop to be deployed on the EMR).
A nice thing about S3DistCp is that you can change the codec of the output file, and transform the original gzip file into a format which will allow Spark to spawn multiple partitions for its RDD.
However I am not sure about how long it will take for S3DistCp to perform the operation (which is an Hadoop Map/Reduce over S3. It benefits from optimised S3 libraries when run from an EMR, but I am concerned that Hadoop will face the same limitations as Spark when generating the Map tasks).

Related

PySpark S3 file read performance consideration

I am new bee to pyspark.
Just wanted to understand how large files I should write into S3 so that Spark can read those files and process.
I have around 400 to 500GB of total data, I need to first upload them to S3 using some tool.
Just trying to understand how big each file should be in S3 so that Spark can read and process efficiently.
And how spark will distribute the S3 files data to multiple executors?
Any god reading link?
Thanks
Try 64-128MB, though it depends on the format.
Spark treats S3 data as independent of location, so doesn't use locality in its placement decisions -just whichever workers have capacity for extra work

Dilemma about Spark partitions

I am working on a project where I have to read S3 files (each about 3MB zipped) using boto3. I have a small pyspark script that runs every hour to process the file and generate 2 types of output data which is written back to S3. The pyspark script uses 'xmltodict' python library to read some static data into a dictionary object needed for file processing. I have a small Amazon EMR cluster v5.28 running with 1 Master and 1 Core. This might be excessive but is not my main concern right now.
Questions:
1. How do I know 'IF' i should partition the data? I have read articles on how many partitions to create, etc but couldn't find anything on IF and WHEN. What is the criteria that drives partitioning - number of rows, columns, data type, actions taken in the script, etc in the source data file? I read the source file into an RDD and convert it to a DF and perform various operations by adding columns, grouping data, counting data, etc. How does spark handle partitioning behind the scenes?
2. Currently, I manually execute the pyspark script as follows:
spark-submit --master spark://x.x.x.x:7077 --deploy-mode client test.py
on the master node as I have decided to stick with Standalone CM. The 'xmltodict' is installed on this node, but is not installed on the Core node. It doesn't seem like it needs to be installed or even python3 configured on Core node since I am not seeing any errors. Is that correct and can somebody shed some light on this confusion? I tried to install the python libraries via shell file as a bootstrap
when I created the cluster, but it failed and quite frankly after trying it a few times, I gave up.
3. Based on partitioning I think I am slightly confused on whether or not to use coalesce() or collect(). Again, the question is when to use and when not to?
Sorry too many questions. Now, that I have the pyspark script written, I am trying to work the efficiencies.
Thanks
Partitioning is the mechanism with which data is divided into optimum size chunks and based on that multiple tasks are run, each processing one piece of data. As you see this is the core of parallelism and without this there is no significant use of Spark (or any bigdata processing framework). Most of the file formats are splittable and some are splittable when compressed like Avro, parquet, orc etc. Some file formats are not splittable when compressed like - zip, gzip etc. Based on the size of the file being processed and their ability to be split, Spark automatically creates multiple partitions and processes data in parallel. In your case the data being zip, one file will be one partition and no more than 1 CPU can work on it at once. If this zip is small then its ok, but if it is big then its processing will be slow.

Spark 2.x - gzip vs snappy compression for parquet files

I am (for the first time) trying to repartition the data my team is working with to enhance our querying performance. Our data is currently stored in partitioned .parquet files compressed with gzip. I have been reading that using snappy instead would significantly increase throughput (we query this data daily for our analysis). I still wanted to benchmark the two codecs to see the perfomance gap with with my own eyes. I wrote a simple (Py)Spark 2.1.1 application to carry out some tests. I persisted 50 millions records in memory (deserialized) in a single partition, wrote them into a single parquet file (to HDFS) using the different codecs and then imported the files again to assess the difference. My problem is that I can't see any significant difference for both read and write.
Here is how I wrote my records to HDFS (same thing for the gzip file, just replace 'snappy' with 'gzip') :
persisted_records.write\
.option('compression', 'snappy')\
.mode("overwrite")\
.partitionBy(*partition_cols)\
.parquet('path_to_dir/test_file_snappy')
And here is how I read my single .parquet file (same thing for the gzip file, just replace 'snappy' with 'gzip') :
df_read_snappy = spark.read\
.option('basePath', 'path_to_dir/test_file_snappy')\
.option('compression', 'snappy')\
.parquet('path_to_dir/test_file_snappy')\
.cache()
df_read_snappy.count()
I looked at the durations in the Spark UI. For information, the persisted (deserialized) 50 millions rows amount 317.4M. Once written into a single parquet file, the file weights 60.5M and 105.1M using gzip and snappy respectively (this is expected as gzip is supposed to have a better compression ratio). Spark spends 1.7min (gzip) et 1.5min (snappy) to write the file (single partition so a single core has to carry out all the work). Reading times amount to 2.7min (gzip) et 2.9min (snappy) on a single core (since we have a single file / HDFS block). This what I do not understand : where is snappy's higher performance ?
Have I done something wrong ? Is my "benchmarking protocol" flawed ? Is the performance gain here but I am not looking at the right metrics ?
I must add that I am using Spark default conf. I did not change anything aside from specifying the number of executors, etc.
Many thanks for your help!
Notice: Spark parquet jar version is 1.8.1

Spark write.avro creates individual avro files

I have a spark-submit job I wrote that reads an in directory of json docs, does some processing on them using data frames, and then writes to an out directory. For some reason, though, it creates individual avro, parquet or json files when I use df.save or df.write methods.
In fact, I even used the saveAsTable method and it did the same thing with parquet.gz files in the hive warehouse.
It seems to me that this is inefficient and negates the use of a container file format. Is this right? Or is this normal behavior and what I'm seeing just an abstraction in HDFS?
If I am right that this is bad, how do I write the data frame from many files into a single file?
As #zero323 told its normal behavior due to many workers(to support fault tolerance).
I would suggest you to write all the records in parquet or avro file which has avro generic record using something like this
dataframe.write().mode(SaveMode.Append).
format(FILE_FORMAT).partitionBy("parameter1", "parameter2").save(path);
but it wont write in to single file but it will group similar kind of Avro Generic records to one file(may be less number of medium sized) files

How the input data is split in Spark?

I'm coming from a Hadoop background, in hadoop if we have an input directory that contains lots of small files, each mapper task picks one file each time and operate on a single file (we can change this behaviour and have each mapper picks more than one file but that's not the default behaviour). I wonder to know how that works in Spark? Does each spark task picks files one by one or..?
Spark behaves the same way as Hadoop working with HDFS, as in fact Spark uses the same Hadoop InputFormats to read the data from HDFS.
But your statement is wrong. Hadoop will take files one by one only if each of your files is smaller than a block size or if all the files are text and compressed with non-splittable compression (like gzip-compressed CSV files).
So Spark would do the same, for each of the small input files it would create a separate "partition" and the first stage executed over your data would have the same amount of tasks as the amount of input files. This is why for small files it is recommended to use wholeTextFiles function as it would create much less partitions

Resources