Partitioning a large binary file using pyspark - apache-spark

Trying to process a large binary file using PySpark, but always getting OutofMemoryError. Tried all possible ways such as increasing executor/driver memory, repartitioning the rdd. Will a single large binary file gets partitioned in spark? If not, how can we process binary files. The binary file which I am using currently is more than 2GB.

Related

Memory difference between pyspark and spark?

I have been trying to get a PySpark job to work which creates a RDD with a bunch of binary files, and then I use a flatMap operation to process the binary data into a bunch of rows. This has lead to a bunch of out of memory errors, and after playing around with memory settings for a while I have decided to get the simplest thing possible working, which is just counting the number of files in the RDD.
This also fails with OOM error. So I opened up both the spark-shell and PySpark and ran the commands in the REPL/shell with default settings, the only additional parameter was --master yarn. The spark-shellversion works, while the PySpark version shows the same OOM error.
Is there that much overhead to running PySpark? Or is this a problem with binaryFiles being new? I am using Spark version 2.2.0.2.6.4.0-91.
The difference:
Scala will load records as PortableDataStream - this means process is lazy, and unless you call toArray on the values, won't load data at all.
Python will call Java backend, but load the data as byte array. This part will be eager-ish, therefore might fail on both sides.
Additionally PySpark will use at least twice as much memory - for Java and Python copy.
Finally binaryFiles (same as wholeTextFiles) are very inefficient and don't perform well, if individual input files are large. In case like this it is better to implement format specific Hadoop input format.
Since you are reading multiple binary files with binaryFiles() and starting Spark 2.1, the minPartitions argument of binaryFiles() is ignored
1.try to repartition the input files based on the following:
enter code hererdd = sc.binaryFiles(Path to the binary file , minPartitions = ).repartition()
2.You may try reducing the partition size to 64 MB or less depending on your size of the data using below config's
spark.files.maxPartitionBytes, default 128 MB
spark.files.openCostInBytes, default 4 MB
spark.default.parallelism

Concatenate ORC partition files on disk?

I am using Spark 2.3 to convert some CSV data to ORC for use with Amazon Athena; it is working fine! Athena works best with files that are not too small so, after manipulating the data a bit, I am using Spark to coalesce the partitions into a single partition before writing to disk, like so:
df.coalesce(1).write.orc("out.orc", compression='zlib', mode='append')
This results in a single ORC file that is an optimal file size for use with Athena. However, the coalesce step takes a very long time. It adds about 33% to the total amount of time to convert the data!
This is obviously due to the fact that Spark cannot parallelize the coalesce step when saving to a single file. When I create the same number of partitions as there are CPUs available, the ORC write out to disk is much faster!
My question is, can I parallelize the ORC write to disk and then concatenate the files somehow? This would allow me to parallelize the write and merge the files without having to compress everything on a single CPU?

Importing a large text file into Spark

I have a pipe delimited text file that is 360GB, compressed (gzip). The file is in an S3 bucket.
This is my first time using Spark. I understand that you can partition a file in order to allow multiple worker nodes to operate on the data which results in huge performance gains. However, I'm trying to find an efficient way to turn my one 360GB file into a partitioned file. Is there a way to use multiple spark worker nodes to work on my one, compressed file in order to partition it? Unfortunately, I have no control over the fact that I'm just getting one huge file. I could uncompress the file myself and break it into many files (say 360 1GB files), but I'll just be using one machine to do that and it will be pretty slow. I need to run some expensive transformations on the data using Spark so I think partitioning the file is necessary. I'm using Spark inside of Amazon Glue so I know that it can scale to a large number of machines. Also, I'm using python (pyspark).
Thanks.
If i'm not mistaken, Spark uses Hadoop's TextInputFormat if you read a file using SparkContext.textFile. If a compression codec is set, the TextInputFormat determines if the file is splittable by checking if the code is an instance of SplittableCompressionCodec.
I believe GZIP is not splittable, Spark can only generate one partition to read the entire file.
What you could do is:
1. Add a repartition after SparkContext.textFile so you at least have more than one of your transformations process parts of the data.
2. Ask for multiple files instead of just a single GZIP file
3. Write an application that decompresses and splits the files into multiple output files before running your Spark application on it.
4. Write your own compression codec for GZIP (this is a little more complex).
Have a look at these links:
TextInputFormat
source code for TextInputFormat
GzipCodec
source code for GZIPCodec
These are in java, but i'm sure there are equivalent Python/Scala versions of them.
First I suggest you have to used ORC format with zlib compression so you get almost 70% compression and as per my research ORC is the most suitable file format for fastest data processing. So you have to load your file and simply write it into orc format with repartition.
df.repartition(500).write.option("compression","zlib").mode("overwrite").save("testoutput.parquet")
One potential solution could be to use Amazon's S3DistCp as a step on your EMR cluster to copy the 360GB file in the HDFS file system available on the cluster (this requires Hadoop to be deployed on the EMR).
A nice thing about S3DistCp is that you can change the codec of the output file, and transform the original gzip file into a format which will allow Spark to spawn multiple partitions for its RDD.
However I am not sure about how long it will take for S3DistCp to perform the operation (which is an Hadoop Map/Reduce over S3. It benefits from optimised S3 libraries when run from an EMR, but I am concerned that Hadoop will face the same limitations as Spark when generating the Map tasks).

pyspark split load uniformly across all executors

I have a 5 node cluster.I am loading a 100k csv file to a dataframe using pyspark and performing some etl operations and writing the output to a parquet file.
When I load the data frame how can divide the dataset uniformly across all executors os that each executor processes 20k records.
If possible, make sure that the input data is split into smaller files.
that way each executor will read and process a single file.
In the case that you can't modify the input files, you can call df.repartition(5), but keep in mind that it will cause an expensive shuffle operation

Spark to read a big file as inputstream

I know spark built in method can have partition and read huge chunk of file and distributed as rdd using textfile.
However, i am reading this in a customized encrytped filessytem which spark does not support by nature. One way i can think of is to read an inputstream instead and loads multiple lines and distributed to executor. Keep reading until all file is loaded. So no executor will blow up due to out of memory error. Is that possible to do this in spark?
you can try lines.take(n) for different n to find the limit of your cluster.
or
spark.readStream.option("sep", ";").csv("filepath.csv")

Resources