Spark to read a big file as inputstream - apache-spark

I know spark built in method can have partition and read huge chunk of file and distributed as rdd using textfile.
However, i am reading this in a customized encrytped filessytem which spark does not support by nature. One way i can think of is to read an inputstream instead and loads multiple lines and distributed to executor. Keep reading until all file is loaded. So no executor will blow up due to out of memory error. Is that possible to do this in spark?

you can try lines.take(n) for different n to find the limit of your cluster.
or
spark.readStream.option("sep", ";").csv("filepath.csv")

Related

Partitioning a large binary file using pyspark

Trying to process a large binary file using PySpark, but always getting OutofMemoryError. Tried all possible ways such as increasing executor/driver memory, repartitioning the rdd. Will a single large binary file gets partitioned in spark? If not, how can we process binary files. The binary file which I am using currently is more than 2GB.

Partitioning strategy in Parquet and Spark

I have a job that reads csv files , converts it into data frames and writes in Parquet. I am using append mode while writing the data in Parquet. With this approach, in each write a separate Parquet file is getting generated. My questions are :
1) If every time I write the data to Parquet schema ,a new file gets
appended , will it impact read performance (as the data is now
distributed in varying length of partitioned Parquet files)
2) Is there a way to generate the Parquet partitions purely based on
the size of the data ?
3) Do we need to think to a custom partitioning strategy to implement
point 2?
I am using Spark 2.3
It will affect read performance if
spark.sql.parquet.mergeSchema=true.
In this case, Spark needs to visit each file and grab schema from
it.
In other cases, I believe it does not affect read performance much.
There is no way generate purely on data size. You may use
repartition or coalesce. Latter will created uneven output
files, but much performant.
Also, you have config spark.sql.files.maxRecordsPerFile or option
maxRecordsPerFile to prevent big size of files, but usually it is
not an issue.
Yes, I think Spark has not built in API to evenly distribute by data
size. There are Column
Statistics
and Size
Estimator may help with this.

Importing a large text file into Spark

I have a pipe delimited text file that is 360GB, compressed (gzip). The file is in an S3 bucket.
This is my first time using Spark. I understand that you can partition a file in order to allow multiple worker nodes to operate on the data which results in huge performance gains. However, I'm trying to find an efficient way to turn my one 360GB file into a partitioned file. Is there a way to use multiple spark worker nodes to work on my one, compressed file in order to partition it? Unfortunately, I have no control over the fact that I'm just getting one huge file. I could uncompress the file myself and break it into many files (say 360 1GB files), but I'll just be using one machine to do that and it will be pretty slow. I need to run some expensive transformations on the data using Spark so I think partitioning the file is necessary. I'm using Spark inside of Amazon Glue so I know that it can scale to a large number of machines. Also, I'm using python (pyspark).
Thanks.
If i'm not mistaken, Spark uses Hadoop's TextInputFormat if you read a file using SparkContext.textFile. If a compression codec is set, the TextInputFormat determines if the file is splittable by checking if the code is an instance of SplittableCompressionCodec.
I believe GZIP is not splittable, Spark can only generate one partition to read the entire file.
What you could do is:
1. Add a repartition after SparkContext.textFile so you at least have more than one of your transformations process parts of the data.
2. Ask for multiple files instead of just a single GZIP file
3. Write an application that decompresses and splits the files into multiple output files before running your Spark application on it.
4. Write your own compression codec for GZIP (this is a little more complex).
Have a look at these links:
TextInputFormat
source code for TextInputFormat
GzipCodec
source code for GZIPCodec
These are in java, but i'm sure there are equivalent Python/Scala versions of them.
First I suggest you have to used ORC format with zlib compression so you get almost 70% compression and as per my research ORC is the most suitable file format for fastest data processing. So you have to load your file and simply write it into orc format with repartition.
df.repartition(500).write.option("compression","zlib").mode("overwrite").save("testoutput.parquet")
One potential solution could be to use Amazon's S3DistCp as a step on your EMR cluster to copy the 360GB file in the HDFS file system available on the cluster (this requires Hadoop to be deployed on the EMR).
A nice thing about S3DistCp is that you can change the codec of the output file, and transform the original gzip file into a format which will allow Spark to spawn multiple partitions for its RDD.
However I am not sure about how long it will take for S3DistCp to perform the operation (which is an Hadoop Map/Reduce over S3. It benefits from optimised S3 libraries when run from an EMR, but I am concerned that Hadoop will face the same limitations as Spark when generating the Map tasks).

Spark - loading a directory with 500G of data

I am fairly new to Spark and distributed computing. Its all very straight forward to load a csv file or text file that can fit into your driver memory.
But here I have a real scenario and I am finding it difficult to figure out the approach.
I am trying to access around 500G of data in S3, this is made up of Zip files.
as these are zipfiles, I am using ZipFileInputFormat as detailed here. It makes sure the files are not splitting across partitions.
Here is my code
val sc = new SparkContext(conf)
val inputLocation = args(0)
val emailData = sc.newAPIHadoopFile(inputLocation, classOf[ZipFileInputFormat], classOf[Text], classOf[BytesWritable]);
val filesRDD = emailData.filter(_._1.toString().endsWith(".txt")).map( x => new String(x._2.getBytes))
This runs fine on a input of few 100mb. but as soon as it crosses the memory limit of my cluster, I am getting the outofMemory issue.
what is the correct way to approach this issue?
- should I create an RDD for each zip file and save the output to file, and load all the outputs into a seperate RDD later?
- Is there a way to load the base directory into Spark context and partitioned
I have a HDP cluster with 5 nodes and a master, each having 15G of memory.
Any answers/pointers/links are highly appreciated
zip files are not splittable so processing individual files won't do you any good. If you wont it to scale out you should avoid these completely or at least hard limit size of the archives.

high cache size for RDD in apache Spark

I am reading a textfile of ~20MB consisting of rows of space separated integers, convert to RDD and cache it. On caching I observed, it consumed ~200MB on RAM!
I don't understand why it is consuming such high RAM (x10) for caching ?
val filea = sc.textFile("a.txt")
val fileamapped = filea.map(_.split(" ").map(_.toInt))
fileamapped.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)
fileamapped.collect()
I am running Spark in local interactive mode (spark-shell) and reading datafile from HDFS.
Questions
Reasons behind high RAM use for caching?
Is there a way I can directly read integers from the file, sc.textFile gives me RDD[String].
I checked fileamapped with estimate() method and it shows ~64MB size, would it be JAVA component sizes?
thanks,

Resources