Spark wholeTextFiles(): java.lang.OutOfMemoryError: Java heap space - apache-spark

I'm processing a file of 400MB with spark.wholeTextFiles() and I keep getting out of memory error. I first used this API with a folder of files which has 40MB in total and I would like to know if my code works with large file, that's where comes the big file.
This is the configuration and I think I offered enough RAM for heap but still no luck and I'm just reading the folder and then write down with
files.saveAsTextFile("data/output/no")
and the command is
spark-submit --driver-memory 4G --driver-java-options -Xms4096m
--executor-memory 4G target/scala-2.11/mz_2.11-1.0.jar
I compared spark sql, sc.hadoopFile and sc.wholeTextFiles and wholeTextFiles is the fastest and I think that's because wholeTextFiles tries to load the whole folder into the memory of one node, the master I guess and everything happens at RAM, so it is fast.
HadoopFile() load by partition, which will be as many as files number, even if the files are small and this read action is expensive.
spark sql will load folder to partitions, the size of partition could be defined with
spark.conf.set("spark.sql.files.maxPartitionBytes", 32000000)
but if the files are small, it takes time to charge the files to each partition.
Q1. why do I keep getting out of memory error?
Q2. when spark load folder/big file by partition and return RDD, how
many partition has been read into the RAM? maybe non, and spark wait
for an action to load as many partitions as the number of
executor(or cores?) each time to treat? in that case, maybe we should
load big partition like 64MB or 128MB instead of small partition like
32kb?

Can you please the entire the code ?
The wholeTextFile() is used when the filePath and fileContent would be required.
Something like key -> filePath (C:\\fileName) and value -> actual fileContent.
The number of partitions when wholeTextFile() is used depends on how many executor cores you have.
Here the number of partitions will be 1 or more .
Unless an action is called spark job isn't triggered.
It's a bottom-top approach / lazy evaluation .

Related

PySpark OOM for multiple data files

I want to process several idependent csv files of similar sizes (100 MB) in parallel with PySpark.
I'm running PySpark on a single machine:
spark.driver.memory 20g
spark.executor.memory 2g
local[1]
File content:
type (has the same value within each csv), timestamp, price
First I tested it on one csv (note I used 35 different window functions):
logData = spark.read.csv("TypeA.csv", header=False,schema=schema)
// Compute moving avg. I used 35 different moving averages.
w = (Window.partitionBy("type").orderBy(f.col("timestamp").cast("long")).rangeBetween(-24*7*3600 * i, 0))
logData = logData.withColumn("moving_avg", f.avg("price").over(w))
// Some other simple operations... No Agg, no sort
logData.write.parquet("res.pr")
This works great. However, i had two issues with scaling this job:
I tried to increase number of window functions to 50 the job OOMs. Not sure why PySpark doesn't spill to disk in this case, since window functions are independent of each other
I tried to run the job for 2 CSV files, it also OOMs. It is also not clear why it is not spilled to disk, since the window functions are basically partitioned by CSV files, so they are independent.
The question is why PySpark doesn't spill to disk in these two cases to prevent OOM, or how can I hint the Spark to do it?
If your machine cannot run all of these you can do that in sequence and write the data of each bulk of files before loading the next bulk.
I'm not sure if this is what you mean but you can try hint spark to write some of the data to your disk instead of keep it on RAM with:
df.persist(StorageLevel.MEMORY_AND_DISK)
Update if it helps
In theory, you could process all these 600 files in one single machine. Spark should spill to disk when meemory is not enough. But there're some points to consider:
As the logic involves window agg, which results in heavy shuffle operation. You need to check whether OOM happened on map or reduce phase. Map phase process each partition of file, then write shuffle output into some file. Then reduce phase need to fetch all these shuffle output from all map tasks. It's obvious that in your case you can't hold all map tasks running.
So it's highly likely that OOM happened on map phase. If this is the case, it means the memory per core can't process one signle partition of file. Please be aware that spark will do rough estimation of memory usage, then do spill if it thinks it should be. As the estatimation is not accurate, so it's still possible OOM. You can tune partition size by below configs:
spark.sql.files.maxPartitionBytes (default 128MB)
Usaually, 128M input needs 2GB heap with total 4G executor memory as
executor JVM heap execution memory (0.5 of total executor memory) =
(total executor memory - executor.memoryOverhead (default 0.1)) * spark.memory.storageFraction (0.6)
You can post all your configs in Spark UI for further investigation.

setting tuning parameters of a spark job

I'm relatively new to spark and I have a few questions related to the tuning optimizations with respect to the spark submit command.
I have followed : How to tune spark executor number, cores and executor memory?
and I understand how to utilise maximum resources out of my spark cluster.
However, I was recently asked how to define the number of cores, memory and cores when I have a relatively smaller operation to do as if I give maximum resources, it is going to be underutilised .
For instance,
if I have to just do a merge job (read files from hdfs and write one single huge file back to hdfs using coalesce) for about 60-70 GB (assume each file is of 128 mb in size which is the block size of HDFS) of data(in avro format without compression), what would be the ideal memory, no of executor and cores required for this?
Assume I have the configurations of my nodes same as the one mentioned in the link above.
I can't understand the concept of how much memory will be used up by the entire job provided there are no joins, aggregations etc.
The amount of memory you will need depends on what you run before the write operation. If all you're doing is reading data combining it and writing it out, then you will need very little memory per cpu because the dataset is never fully materialized before writing it out. If you're doing joins/group-by/other aggregate operations all of those will require much ore memory. The exception to this rule is that spark isn't really tuned for large files and generally is much more performant when dealing with sets of reasonably sized files. Ultimately the best way to get your answers is to run your job with the default parameters and see what blows up.

Spark - 54 GB CSV file transform to single JSON in 16 GB RAM single machine

I want to take a CSV file and transform into single JSON, I have written and verified the code. I have a CSV file of 54 GB and I want to transform and export this single file into single JSON, I want to take this data in Spark and it will design the JSON using SparkSQL collect_set(struct built-in functions.
I am running Spark job in Eclipse IDE in a single machine only. The machine configuration has 16 GB RAM, i5 Processor, 600 GB HDD.
Now when I have been trying to run the spark program it is throwing java.lang.OutOfMemory and insufficient heap size error. I tried to increase the spark.sql.shuffle.partitions value 2000 to 20000 but still the job is failing after loading and during the transformation due to the same error I have mentioned.
I don't want to split the single CSV into multiple parts, I want to process this single CSV, how can I achieve that? Need help. Thanks.
Spark Configuration:
val conf = new SparkConf().setAppName("App10").setMaster("local[*]")
// .set("spark.executor.memory", "200g")
.set("spark.driver.memory", "12g")
.set("spark.executor.cores", "4")
.set("spark.driver.cores", "4")
// .set("spark.testing.memory", "2147480000")
.set("spark.sql.shuffle.partitions", "20000")
.set("spark.driver.maxResultSize", "500g")
.set("spark.memory.offHeap.enabled", "true")
.set("spark.memory.offHeap.size", "200g")
Few observations from my side,
When you collect data at the end on driver it needs to have enough memory to hold your complete json output. 12g is not sufficient memory for that IMO.
200g executor memory is commented then how much was allocated? Executors too need enough memory to process/transform this heavy data. If driver was allocated with 12g and if you have total of 16 then only available memory for executor is 1-2gb considering other applications running on system. It's possible to get OOM. I would recommend find whether driver or executor is lacking on memory
Most important, Spark is designed to process data in parallel on multiple machines to get max throughput. If you wanted to process this on single machine/single executor/single core etc. then you are not at all taking the benefits of Spark.
Not sure why you want to process it as a single file but I would suggest revisit your plan again and process it in a way where spark is able to use its benefits. Hope this helps.

very small Batch processing with spark

We are working on a project, where we need to process some dataset which is very small, in fact, less than 100 rows in csv format. There are around 20-30 such jobs that process these kinds of datasets. But the load can grow in future, and it can reach into big data category. Is it fine to start with spark for these extra-small load, so that system remains scalable tomorrow? Or should we write a normal program for now in java/c# that runs on schedule? And in future if load of some of these tasks becomes really high, switch to spark?
Thanks in advance.
Absolutely fine,One thing to remember before running Job is to check memory and allocating memory based on size of data.
Say you have 10 cores , 50GB ram and initially you have csv files of 3kb or 1MB in size.Giving 50Gb ram and 10cores for 1Mb Files is a false approach,
Before you tigger the Job you should be carefull in allocating memory and number of executors.
For above csv files of 3Mb data you can give 2-cores at maximum and 5Gb of RAM to get job done.With the increase of size in data you can increase of usage of cores and memory.
Before you open sparkshell(Here I am using Pyspark and yarn as resource manager).This Can be done by example:
pyspark --master yarn --num-executors-memory <512M ,2G>
Thank you.

Spark container failed. Shall I trust the results I got?

I have a quite long Spark job only consisting of a map operation.
I tried to launch it several times with different number of partitions, executors, and the maximum amount of memory I could give (16G + 2G of overhead).
During my last attempt few executors were killed because of memory overheads, however, the output was produced and it seems ok (obviously, I couldn't check all the rows of my dataframe, though).
Moreover, I found a _SUCCESS file in the output directory.
Shall I trust the output I got?
I think output will be correct because you have _SUCCESS file and also if some of you executor dies because of out of memory spark is fault tolerant so the work load will be transfer to the other executor.

Resources