Spark - loading a directory with 500G of data - apache-spark

I am fairly new to Spark and distributed computing. Its all very straight forward to load a csv file or text file that can fit into your driver memory.
But here I have a real scenario and I am finding it difficult to figure out the approach.
I am trying to access around 500G of data in S3, this is made up of Zip files.
as these are zipfiles, I am using ZipFileInputFormat as detailed here. It makes sure the files are not splitting across partitions.
Here is my code
val sc = new SparkContext(conf)
val inputLocation = args(0)
val emailData = sc.newAPIHadoopFile(inputLocation, classOf[ZipFileInputFormat], classOf[Text], classOf[BytesWritable]);
val filesRDD = emailData.filter(_._1.toString().endsWith(".txt")).map( x => new String(x._2.getBytes))
This runs fine on a input of few 100mb. but as soon as it crosses the memory limit of my cluster, I am getting the outofMemory issue.
what is the correct way to approach this issue?
- should I create an RDD for each zip file and save the output to file, and load all the outputs into a seperate RDD later?
- Is there a way to load the base directory into Spark context and partitioned
I have a HDP cluster with 5 nodes and a master, each having 15G of memory.
Any answers/pointers/links are highly appreciated

zip files are not splittable so processing individual files won't do you any good. If you wont it to scale out you should avoid these completely or at least hard limit size of the archives.

Related

How to write outputs of spark streaming application to a single file

I'm reading data from Kafka using spark streaming and passing to py file for prediction. It returns predictions as well as the original data. It's saving the original data with its predictions to file however it is creating a single file for each RDD.
I need a single file consisting of all the data collected till the I stop the program to be saved to a single file.
I have tried writeStream it does not create even a single file.
I have tried to save it to parquet using append but it creates multiple files that is 1 for each RDD.
I tried to write with append mode still multiple files as output.
The below code creates a folder output.csv and enters all the files into it.
def main(args: Array[String]): Unit = {
val ss = SparkSession.builder()
.appName("consumer")
.master("local[*]")
.getOrCreate()
val scc = new StreamingContext(ss.sparkContext, Seconds(2))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer"->
"org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer">
"org.apache.kafka.common.serialization.StringDeserializer",
"group.id"-> "group5" // clients can take
)
mappedData.foreachRDD(
x =>
x.map(y =>
ss.sparkContext.makeRDD(List(y)).pipe(pyPath).toDF().repartition(1)
.write.format("csv").mode("append").option("truncate","false")
.save("output.csv")
)
)
scc.start()
scc.awaitTermination()
I need to get just 1 file with all the statements one by one collected while streaming.
Any help will be appreciated, thank you in anticipation.
You cannot modify any file in hdfs once it has been written. If you wish to write the file in realtime(append the blocks of data from streaming job in the same file every 2 seconds), its simply isn't allowed as hdfs files are immutable. I suggest you try to write a read logic that reads from multiple files, if possible.
However, if you must read from a single file, I suggest either one of the two approaches, after you have written output to a single csv/parquet folder, with "Append" SaveMode(which will create part files for each block you write every 2 seconds).
You can create a hive table on top of this folder read data from that table.
You can write a simple logic in spark to read this folder with multiple files and write it to another hdfs location as a single file using reparation(1) or coalesce(1), and read the data from that location. See below:
spark.read.csv("oldLocation").coalesce(1).write.csv("newLocation")
repartition - its recommended to use repartition while increasing no of partitions, because it involve shuffling of all the data.
coalesce- it’s is recommended to use coalesce while reducing no of partitions. For example if you have 3 partitions and you want to reduce it to 2 partitions, Coalesce will move 3rd partition Data to partition 1 and 2. Partition 1 and 2 will remains in same Container.but repartition will shuffle data in all partitions so network usage between executor will be high and it impacts the performance.
Performance wise coalesce performance better than repartition while reducing no of partitions.
So while writing use option as coalesce.
For Ex: df.write.coalesce

Spark driver running out of memory when reading multiple files

My program works like this:
Read in a lot of files as dataframes. Among those files there is a group of about 60 files with 5k rows each, where I create a separate Dataframe for each of them, do some simple processing and then union them all into one dataframe which is used for further joins.
I perform a number of joins and column calculations on a number of dataframes finally which finally results in a target dataframe.
I save the target dataframe as a Parquet file.
In the same spark application, I load that Parquet file and do some heavy aggregation followed by multiple self-joins on that dataframe.
I save the second dataframe as another Parquet file.
The problem
If I have just one file instead of 60 in the group of files I mentioned above, everything works with driver having 8g memory. With 60 files, the first 3 steps work fine, but driver runs out of memory when preparing the second file. Things improve only when I increase the driver's memory to 20g.
The Question
Why is that? When calculating the second file I do not use Dataframes used to calculate the first file so their number and content should not really matter if the size of the first Parquet file remains constant, should it? Do those 60 dataframes get cached somehow and occupy driver's memory? I don't do any caching myself. I also never collect anything. I don't understand why 8g of memory would not be sufficient for Spark driver.
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
//you have to use serialization configuration if you are using MEMORY_AND_DISK_SER
val rdd1 = sc.textFile("some data")
rdd1.persist(storageLevel.MEMORY_AND_DISK_SER) // marks rdd as persist
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.persist(storageLevel.MEMORY_AND_DISK_SER)
rdd3.persist(storageLevel.MEMORY_AND_DISK_SER)
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
rdd1.unpersist()
rdd2.unpersist()
rdd3.unpersist()
For tuning your code follow this link
Caching or persistence are optimisation techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated.
RDDs can be cached using cache operation. They can also be persisted using persist operation.
The difference between cache and persist operations is purely syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY.
refer to use of persist and unpersist

Spark to read a big file as inputstream

I know spark built in method can have partition and read huge chunk of file and distributed as rdd using textfile.
However, i am reading this in a customized encrytped filessytem which spark does not support by nature. One way i can think of is to read an inputstream instead and loads multiple lines and distributed to executor. Keep reading until all file is loaded. So no executor will blow up due to out of memory error. Is that possible to do this in spark?
you can try lines.take(n) for different n to find the limit of your cluster.
or
spark.readStream.option("sep", ";").csv("filepath.csv")

Is reading a CSV file from S3 into a Spark dataframe expected to be so slow?

I am building an application that needs to load data sets from S3. The functionality is working correctly, but the performance is surprisingly slow.
The datasets are in CSV format. There are approximately 7M records (lines) in each file, and each file is 600-700MB.
val spark = SparkSession
.builder()
.appName("MyApp")
.getOrCreate()
val df = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.csv(inFileName:_*)
// inFileName is a list that current contains 2 file names
// eg. s3://mybucket/myfile1.csv
val r = df.rdd.flatMap{ row =>
/*
* Discard poorly formated input records
*/
try {
totalRecords.add(1)
// this extracts several columns from the dataset
// each tuple of indexColProc specifies the index of the column to
// select from the input row, and a function to convert
// the value to an Int
val coords = indexColProc.map{ case (idx, func) => func( row.get(idx).toString ) }
List( (coords(0), coords) )
}
catch {
case e: Exception => {
badRecords.add(1)
List()
}
}
}
println("Done, row count " + r.count )
I ran this on an AWS cluster of 5 machines, each an m3.xlarge. The maximizeResourceAllocation parameter was set to true, and this was the only application running on the cluster.
I ran the application in twice. The first time with 'inFileName' pointing at the files on S3, and the second time pointing at a local copy of the files in hadoop file system.
When I look at the Spark history server and drill down to the job that corresponds to the final r.count action, I see that it takes 2.5 minutes when accessing the files on s3, and 18s when accessing the files locally on hdfs. I"ve gotten proportionally similar results when I run the same experiment on a smaller cluster or in master=local configuration.
When I copy the s3 files to the cluster using
aws s3 cp <file>
It only takes 6.5s to move one 600-700MB file. So it doesn't seem the raw I/O of the machine instance is contributing that much to the slow down.
Is this kind of slow performance when accessing s3 expected? If not, could someone please point out where I'm going wrong. If it is expected, are other ways to do this that would have better performance? Or do I need to develop something to simply copy the files over from s3 to hdfs before the application runs?
After some more digging I discovered that using S3 native makes a huge difference. I just changed the URI prefix to s3n:// and the performance for the job in question went from 2.5 minutes down to 21s. So only a 3s penalty for accessing s3 vs hdfs, which is pretty reasonable.
When searching for this topic there are many posts that mention s3n has a max file size limit of 5GB. However, I came across this which says that max files size limit was increased to 5TB in Hadoop 2.4.0.
"Using the S3 block file system is no longer recommended."
We faced the exact same issue about a couple of months ago, except that our data was 1TB so the issue was more pronounced.
We dug into it and finally came to the following conclusion:
Since we had 5 instances with 30 executors each, every time a stage was scheduled (and the first thing the task would do is fetch data from S3), so these tasks will be bottle-necked on network bandwidht, then they all move to compute part of the task and may contend for CPU simultaneously.
So basically because the tasks are all doing the same thing at the same time, they are always contending for the same resources.
We figured out that allowing only k number of tasks at any point would allow them to finish download quickly and move to the compute part and next set of k tasks can then come in and start downloading. This way, now k (as opposed to all) tasks are getting full bandwidth and some tasks are simultaneously doing something useful on CPU or I/O without waiting for each other on some common resource.
Hope this helps.
Did you try the spark-csv package? There is a lot of optimization for reading csv and you can use mode=MALFORMED to drop bad lines you are trying to filter. You can read from s3 directly like this:
csv_rdf<- read.df(sqlContext,"s3n://xxxxx:xxxxx#foldername/file1.csv",source="com.databricks.spark.csv")
More details can be found here https://github.com/databricks/spark-csv

How does Spark parallelize the processing of a 1TB file?

Imaginary problem
A gigantic CSV log file, let's say 1 TB in size, the file is located on a USB drive
The log contains activities logs of users around the world, let's assume that the line contains 50 columns, among those there is Country.
We want a line count per country, descending order.
Let's assume the Spark cluster has enough nodes with RAM to process the entire 1TB in memory (20 nodes, 4 cores CPU, each node has 64GB RAM)
My Poorman's conceptual solution
Using SparkSQL & Databricks spark-csv
$ ./spark-shell --packages com.databricks:spark-csv_2.10:1.4.0
val dfBigLog = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/media/username/myUSBdrive/bogusBigLog1TB.log")
dfBigLog.select("Country")
.groupBy("Country")
.agg(count($"Country") as "CountryCount")
.orderBy($"CountryCount".desc).show
Question 1: How does Spark parallelize the processing?
I suppose the majority of the execution time (99% ?) of the above solution is to read the 1TB file from the USB drive into the Spark cluster. Reading the file from the USB drive is not parallelizable. But after reading the entire file, what does Spark do under the hood to parallelize the processing?
How many nodes used for creating the DataFrame? (maybe only one?)
How many nodes used for groupBy & count? Let's assume there are 100+ countries (but Spark doesn't know that yet). How would Spark partition to distribute the 100+ country values on 20 nodes?
Question 2: How to make the Spark application the fastest possible?
I suppose the area of improvement would be to parallelize the reading of the 1TB file.
Convert the CSV File into a Parquet file format + using Snappy compression. Let's assume this can be done in advance.
Copy the Parquet file on HDFS. Let's assume the Spark cluster is within the same Hadoop cluster and the datanodes are independant from the 20 nodes Spark cluster.
Change the Spark application to read from HDFS. I suppose Spark would now use several nodes to read the file as Parquet is splittable.
Let's assume the Parquet file compressed by Snappy is 10x smaller, size = 100GB, HDFS block = 128 MB in size. Total 782 HDFS blocks.
But then how does Spark manage to use all the 20 nodes for both creating the DataFrame and the processing (groupBy and count)? Does Spark use all the nodes each time?
Question 1: How does Spark parallelize the processing (of reading a
file from a USB drive)?
This scenario is not possible.
Spark relies on a hadoop compliant filesystem to read a file. When you mount the USB drive, you can only access it from the local host. Attempting to execute
.load("/media/username/myUSBdrive/bogusBigLog1TB.log")
will fail in cluster configuration, as executors in the cluster will not have access to that local path.
It would be possible to read the file with Spark in local mode (master=local[*]) in which case you only will have 1 host and hence the rest of the questions would not apply.
Question 2: How to make the Spark application the fastest possible?
Divide and conquer.
The strategy outlined in the question is good. Using Parquet will allow Spark to do a projection on the data and only .select("Country") column, further reducing the amount of data required to be ingested and hence speeding things up.
The cornerstone to parallelism in Spark are partitions. Again, as we are reading from a file, Spark relies on the Hadoop filesystem. When reading from HDFS, the partitioning will be dictated by the splits of the file on HDFS. Those splits will be evenly distributed among the executors. That's how Spark will initially distribute the work across all available executors for the job.
I'm not deeply familiar with the Catalist optimizations, but I think I could assume that .groupBy("Country").agg(count($"Country") will become something similar to: rdd.map(country => (country,1)).reduceByKey(_+_)
The map operation will not affect partitioning, so can be applied on site.
The reduceByKey will be applied first locally on each partition and partial results will be combined on the driver. So most counting happens distributed in the cluster, and adding it up will be centralized.
Reading the file from the USB drive is not parallelizable.
USB drive or any other data source the same rules apply. Either source is accessible from the driver and all worker machines and data is accessed in parallel (up to the source limits) or data is not accessed at all you get an exception.
How many nodes used for creating the DataFrame? (maybe only one?)
Assuming that files is accessible from all machines it depends on a configuration. For starters you should take a look at the split size.
How many nodes used for the GroupBy & Count?
Once again it depends on a configuration.

Resources