How to reduce no of partition without changing in spark code - apache-spark

I have a code zip file executing through spark submit and it produce 200 output file now the question is without changing in as its a zip file
how to reduce no of output files?

If you are using config file and your code doing repartition by getting number of partitions from config file dynamically then you can change the value in your config file, no need change zip file.
Another option would be using --conf spark.sql.shuffle.partitions=<number of partitions> in your spark-submit then your spark job will create number of files specified number.
NOTE: Setting up this param will degrade performance as this controls number of partitions whole spark program only advised to use if spark job is not processing millions of records.

Related

Why apache spark save function with a folder contain multi sub-file?

When save spark dataframe, spark save to multi file inside a folder instead only one file.
df.write.format("json") \
.option("header", "true") \
.save('data.json', mode='append')
When run this code, data.json will be folder name instead file name.
And I want to know what are the advantages for that ?
When you write the dataframe or rdd the spark uses HadoopAPI underneath
The actual data that contains result is in the part- files which are created as the same number of partition on dataframe. If you have n numbers of partition then it creates n number of part files.
The main advantage of multiple part file is that if you have multiple workers can access and write the file in parallel.
Other files like _SUCCESS is to indicate that it has completed successfully and .crc is for the ckeck.
Hope this helps you.

How to load lots of files into one RDD in Spark

I use saveAsTextFile method to save RDD, but it is not in a file, instead there are many parts files as the following picture.
So, my question is how to reload these files into one RDD.
You are trying to use Spark locally, rather than in a distributed manner is my guess. When you use saveAsTextFile it is just saving these using Hadoop's file writer and creating a file per RDD partition. One thing you could do is coalesce the partition to 1 file before writing if you want a single file. But if you go up one folder you will find that the folder's name is that which you saved. So you can just sc.textFile using that same path and it will pull everything into the partitions once again.
you know what? I just found it very elegant:
say your files are all in the /output directory, just use the following command to merge them into one, and then you can easily reload as one RDD:
hadoop fs -getmerge /output /local/file/path
Not a big deal, I'm Leifeng.

Importing a large text file into Spark

I have a pipe delimited text file that is 360GB, compressed (gzip). The file is in an S3 bucket.
This is my first time using Spark. I understand that you can partition a file in order to allow multiple worker nodes to operate on the data which results in huge performance gains. However, I'm trying to find an efficient way to turn my one 360GB file into a partitioned file. Is there a way to use multiple spark worker nodes to work on my one, compressed file in order to partition it? Unfortunately, I have no control over the fact that I'm just getting one huge file. I could uncompress the file myself and break it into many files (say 360 1GB files), but I'll just be using one machine to do that and it will be pretty slow. I need to run some expensive transformations on the data using Spark so I think partitioning the file is necessary. I'm using Spark inside of Amazon Glue so I know that it can scale to a large number of machines. Also, I'm using python (pyspark).
Thanks.
If i'm not mistaken, Spark uses Hadoop's TextInputFormat if you read a file using SparkContext.textFile. If a compression codec is set, the TextInputFormat determines if the file is splittable by checking if the code is an instance of SplittableCompressionCodec.
I believe GZIP is not splittable, Spark can only generate one partition to read the entire file.
What you could do is:
1. Add a repartition after SparkContext.textFile so you at least have more than one of your transformations process parts of the data.
2. Ask for multiple files instead of just a single GZIP file
3. Write an application that decompresses and splits the files into multiple output files before running your Spark application on it.
4. Write your own compression codec for GZIP (this is a little more complex).
Have a look at these links:
TextInputFormat
source code for TextInputFormat
GzipCodec
source code for GZIPCodec
These are in java, but i'm sure there are equivalent Python/Scala versions of them.
First I suggest you have to used ORC format with zlib compression so you get almost 70% compression and as per my research ORC is the most suitable file format for fastest data processing. So you have to load your file and simply write it into orc format with repartition.
df.repartition(500).write.option("compression","zlib").mode("overwrite").save("testoutput.parquet")
One potential solution could be to use Amazon's S3DistCp as a step on your EMR cluster to copy the 360GB file in the HDFS file system available on the cluster (this requires Hadoop to be deployed on the EMR).
A nice thing about S3DistCp is that you can change the codec of the output file, and transform the original gzip file into a format which will allow Spark to spawn multiple partitions for its RDD.
However I am not sure about how long it will take for S3DistCp to perform the operation (which is an Hadoop Map/Reduce over S3. It benefits from optimised S3 libraries when run from an EMR, but I am concerned that Hadoop will face the same limitations as Spark when generating the Map tasks).

How the input data is split in Spark?

I'm coming from a Hadoop background, in hadoop if we have an input directory that contains lots of small files, each mapper task picks one file each time and operate on a single file (we can change this behaviour and have each mapper picks more than one file but that's not the default behaviour). I wonder to know how that works in Spark? Does each spark task picks files one by one or..?
Spark behaves the same way as Hadoop working with HDFS, as in fact Spark uses the same Hadoop InputFormats to read the data from HDFS.
But your statement is wrong. Hadoop will take files one by one only if each of your files is smaller than a block size or if all the files are text and compressed with non-splittable compression (like gzip-compressed CSV files).
So Spark would do the same, for each of the small input files it would create a separate "partition" and the first stage executed over your data would have the same amount of tasks as the amount of input files. This is why for small files it is recommended to use wholeTextFiles function as it would create much less partitions

What are the files generated by Spark when using "saveAsTextFile"?

When I run a Spark job and save the output as a text file using method "saveAsTextFile" as specified at https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.rdd.RDD :
here are the files that are created :
Is the .crc file a Cyclic Redundancy Check file ? and so is used to check that the content of each generated file IS correct ?
The _SUCCESS file is always empty, what does this signify ?
The files that do not have an extension in above screenshot contain the actual data from the RDD but why are many files generated instead of just one ?
Those are files generated by the underlying Hadoop API that Spark calls when you invoke saveAsTextFile().
part- files: These are your output data files.
You will have one part- file per partition in the RDD you called saveAsTextFile() on. Each of these files will be written out in parallel, up to a certain limit (typically, the number of cores on the workers in your cluster). This means you will write your output much faster that it would be written out if it were all put in a single file, assuming your storage layer can handle the bandwidth.
You can check the number of partitions in your RDD, which should tell you how many part- files to expect, as follows:
# PySpark
# Get the number of partitions of my_rdd.
my_rdd._jrdd.splits().size()
_SUCCESS file: The presence of an empty _SUCCESS file simply means that the operation completed normally.
.crc files: I have not seen the .crc files before, but yes, presumably they are checks on the part- files.

Resources