Will spark load data into in-memory if data is 10 gb and RAM is 1gb - apache-spark

If i have cluster of 5 nodes, each node having 1gb ram, now if my data file is 10gb distributed in all 5 nodes, let say 2gb in each node, now if i trigger
val rdd = sc.textFile("filepath")
rdd.collect
will spark load data into the ram and how spark will deal with this scenario
will it straight away deny or will it process it.

Lets understand the question first #intellect_dp you are asking, you have a cluster of 5 nodes (here the term "node" I am assuming machine which generally includes hard disk,RAM, 4 core cpu etc.), now each node having 1 GB of RAM and you have 10 GB of data file which is distributed in a manner, that 2GB of data is residing in the hard disk of each node. Here lets assume that you are using HDFS and now your block size at each node is 2GB.
now lets break this :
each block size = 2GB
RAM size of each node = 1GB
Due to lazy evaluation in spark, only when "Action API" get triggered, then only it will load your data into the RAM and execute it further.
here you are saying that you are using "collect" as an action api. Now the problem here is that RAM size is less than your block size, and if you process it with all default configuration (1 block = 1 partition ) of spark and considering that no further node will going to add up, then in that case it will give you out of memory exception.
now the question - is there any way spark can handle this kind of large data with the given kind of hardware provisioning?
Ans - yes, first you need to set default minimum partition :
val rdd = sc.textFile("filepath",n)
here n will be my default minimum partition of block, now as we have only 1gb of RAM, so we need to keep it less than 1gb, so let say we take n = 4,
now as your block size is 2gb and minimum partition of block is 4 :
each partition size will be = 2GB/4 = 500mb;
now spark will process this 500mb first and will convert it into RDD, when next chunk of 500mb will come, the first rdd will get spill to hard disk (given that you have set the storage level "MEMORY_AND_DISK_ONLY").
In this way it will process your whole 10 GB of data file with the given cluster hardware configuration.
Now I personally will not recommend the given hardware provisioning for such case,
as it will definitely process the data, but there are few disadvantages :
firstly it will involve multiple I/O operation making whole process very slow.
secondly if any lag occurs in reading or writing to the hard disk, your whole job will get discarded, you will go frustrated with such hardware configuration. In addition to it you will never be sure that will spark process your data and will be able to give result when data will increase.
So try to keep very less I/O operation, and
Utilize in memory computation power of spark with an adition of few more resources for faster performance.

When you use collect all the data send is collected as array only in driver node.
From this point distribution spark and other nodes does't play part. You can think of it as a pure java application on a single machine.
You can determine driver's memory with spark.driver.memory and ask for 10G.
From this moment if you will not have enough memory for the array you will probably get OutOfMemory exception.

In the otherhand if we do so, Performance will be impacted, we will not get the speed we want.
Also Spark store only results in RDD, so I can say result would not be complete data, any worst case if we are doing select * from tablename, it will give data in chunks , what it can affroad....

Related

PySpark OOM for multiple data files

I want to process several idependent csv files of similar sizes (100 MB) in parallel with PySpark.
I'm running PySpark on a single machine:
spark.driver.memory 20g
spark.executor.memory 2g
local[1]
File content:
type (has the same value within each csv), timestamp, price
First I tested it on one csv (note I used 35 different window functions):
logData = spark.read.csv("TypeA.csv", header=False,schema=schema)
// Compute moving avg. I used 35 different moving averages.
w = (Window.partitionBy("type").orderBy(f.col("timestamp").cast("long")).rangeBetween(-24*7*3600 * i, 0))
logData = logData.withColumn("moving_avg", f.avg("price").over(w))
// Some other simple operations... No Agg, no sort
logData.write.parquet("res.pr")
This works great. However, i had two issues with scaling this job:
I tried to increase number of window functions to 50 the job OOMs. Not sure why PySpark doesn't spill to disk in this case, since window functions are independent of each other
I tried to run the job for 2 CSV files, it also OOMs. It is also not clear why it is not spilled to disk, since the window functions are basically partitioned by CSV files, so they are independent.
The question is why PySpark doesn't spill to disk in these two cases to prevent OOM, or how can I hint the Spark to do it?
If your machine cannot run all of these you can do that in sequence and write the data of each bulk of files before loading the next bulk.
I'm not sure if this is what you mean but you can try hint spark to write some of the data to your disk instead of keep it on RAM with:
df.persist(StorageLevel.MEMORY_AND_DISK)
Update if it helps
In theory, you could process all these 600 files in one single machine. Spark should spill to disk when meemory is not enough. But there're some points to consider:
As the logic involves window agg, which results in heavy shuffle operation. You need to check whether OOM happened on map or reduce phase. Map phase process each partition of file, then write shuffle output into some file. Then reduce phase need to fetch all these shuffle output from all map tasks. It's obvious that in your case you can't hold all map tasks running.
So it's highly likely that OOM happened on map phase. If this is the case, it means the memory per core can't process one signle partition of file. Please be aware that spark will do rough estimation of memory usage, then do spill if it thinks it should be. As the estatimation is not accurate, so it's still possible OOM. You can tune partition size by below configs:
spark.sql.files.maxPartitionBytes (default 128MB)
Usaually, 128M input needs 2GB heap with total 4G executor memory as
executor JVM heap execution memory (0.5 of total executor memory) =
(total executor memory - executor.memoryOverhead (default 0.1)) * spark.memory.storageFraction (0.6)
You can post all your configs in Spark UI for further investigation.

Spark OOM error explanation and alleviation

Sometimes, you will get an OutOfMemoryError not because your RDDs don’t fit in memory, but because the working set of one of your tasks, such as one of the reduce tasks in groupByKey, was too large. Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large. The simplest fix here is to increase the level of parallelism, so that each task’s input set is smaller.
I think it this way, please correct me if I am wrong.
Suppose there are 2 Data Nodes to process the Dataset and both these nodes collectively has a memory of 32GB(16 GB per Data Node). The data set size is 100 GB and let us suppose this data, when read by spark, is partitioned into 10 partitions of 10GB each. It is obvious that the 100GB file cannot be fit into 32 GB RAM at a time. so the partitions have to be loaded into memory and processed in a iterative manner. so I assume as below.
first iteration, 2 partitions, 10GB each are loaded into memory on each data node.
second iteration, 2 partitions, 10GB each are loaded into memory on each data node.
....
....
Fifth iteration, 2 partitions, 10GB each are loaded into memory on each data node.
If this is how the spark is processing, during every iteration, only 2 partitions are loaded into memory. Does that mean, the other partitions which were unable to be accommodated in memory, were read but spilled to disk and they are waiting for the memory to be freed? or those partitions are not read at all and they will be read only when the resources are available. which is true?
During processing if there is a need to groupby/reduceby/join, then it mandates a shuffle. so if one of the shuffle partition is greater than RAM size then the job will fail with OOM error. Example, 10 partitions were processed and shuffled. Now the shuffle partitions are only 4 partitions with 25GB each.
(Default shuffle partitions are 200, but only 4 partitions have the total data remaining are empty.) since the shuffle partition size is greater than 16MB RAM, will the spark job fail? Is my understanding correct?
I understand that, you do not really need that your data fit in memory. Spark processes the data on partition basis. But My question is what if the partition itself is not fitting in memory. Would it still spill the data to disk and start processing or it will fail with OOM error?
The second question I have is, If another spark job(Job2) is triggered during the above spark job(job1) is under execution, and suppose this is also having 100GB file to process with 10 partitions of 10GB each. so when job1 Iteration1 is under execution, there is only 6 MB free slot available in the memory. The job2's partition of 10GB cannot be loaded into memory for processing job2. so will the Job2 wait till the memory is freed up? or will this job also fail with OOM error?
The explanation (bold) is correct.
On your comments:
Unless you explicitly repartition, your partitions will be HDFS block size related, the 128MB size and as many that make up that file.
Then you have number of executors, say 2, per Worker / Data Node. Then max 4 tasks / partitions will be active at any given time.
What would be the point of loading all partitions to memory if you can service at most 4? You would be clogging up the system to the detriment of other Spark Apps. This is all like a normal OS then.
Of course it is a bit more complicated, e.g. if you have 10 Data Nodes and allocation only 2 Executors, there is traffic to move stuff about. Just keeping it simple.
OOM errors only occur if a partition exceeds max partition size. For the rest disk space is needed for spilling.

pyspark java.lang.OutOfMemoryError: GC overhead limit exceeded

I'm trying to process, 10GB of data using spark it is giving me this error,
java.lang.OutOfMemoryError: GC overhead limit exceeded
Laptop configuration is: 4CPU, 8 logical cores, 8GB RAM
Spark configuration while submitting the spark job.
spark = SparkSession.builder.master('local[6]').config("spark.ui.port", "4041").appName('test').getOrCreate()
spark.conf.set("spark.executor.instances", 1)
spark.conf.set("spark.executor.cores", 5)
After searching internet about this error, I have few questions
If answered that would be a great help.
1) Spark is in memory computing engine, for processing 10 gb of data, the system should have 10+gb of RAM. Spark loads 10gb of data into 10+ gb RAM memory and then do the job?
2) If point 1 is correct, how big companies are processing 100s of TBs of data, are they processing 100TB of data by clustering multiple systems to form 100+TB RAM and then process 100TB of data?
3) Is their no other way to process 50gb of data with 8gb RAM and 8Cores, by setting proper spark configurations? If it is what is the way and what should be the spark configurations.
4) What should be ideal spark configuration if the system properites are 8gb RAM and 8 Cores? for processing 8gb of data
spark configuration to be defined in spark config.
spark = SparkSession.builder.master('local[?]').config("spark.ui.port", "4041").appName('test').getOrCreate()
spark.conf.set("spark.executor.instances", ?)
spark.conf.set("spark.executor.cores", ?)
spark.executors.cores = ?
spark.executors.memory = ?
spark.yarn.executor.memoryOverhead =?
spark.driver.memory =?
spark.driver.cores =?
spark.executor.instances =?
No.of core instances =?
spark.default.parallelism =?
I hope the following will help if not clarify everything.
1) Spark is an in-memory computing engine, for processing 10 GB of data, the system should have 10+gb of RAM. Spark loads 10gb of data into 10+ GB RAM memory and then do the job?
Spark being an in-memory computation engine take the input/source from an underlying data lake or distributed storage system. The 10Gb file will be broken into smaller blocks (128Mb or 256Mb block size for Hadoop based data lake) and Spark Driver will get many executor/cores to read them from the cluster's worker node. If you try to load 10Gb data with laptop or with a single node, it will certainly go out of memory. It has to load all the data either in one machine or in many slaves/worker-nodes before it is processed.
2) If point 1 is correct, how big companies are processing 100s of TBs of data, are they processing 100TB of data by clustering multiple systems to form 100+TB RAM and then process 100TB of data?
The large data processing project design the storage and access layer with a lot of design patterns. They simply don't dump GBs or TBs of data to file system like HDFS. They use partitions (like sales transaction data is partition by month/week/day) and for structured data, there are different file formats available (especially columnar) which helps to lad only those columns which are required for processing. So right file format, partitioning, and compaction are the key attributes for large files.
3) Is their no other way to process 50gb of data with 8gb RAM and 8Cores, by setting proper spark configurations? If it is what is the way and what should be the spark configurations.
Very unlikely if there is no partition but there are ways. It also depends on what kind of file it is. You can create a custom stream file reader that can read the logical block and process it. However, the enterprise doesn't read 50Gb of one file which is one single unit. Even if you load an excel file of 10Gb in your machine via Office tool, it will go out of memory.
4) What should be ideal spark configuration if the system properties are 8gb RAM and 8 Cores? for processing 8gb of data
Leave 1 core & 1-Gb or 2-GB for OS and use the rest of them for your processing. Now depends on what kind of transformation is being performed, you have to decide the memory for driver and worker processes. Your driver should have 2Gb of RAM. But laptop is primarily for the playground to explore the code syntax and not suitable for large data set. Better to build your logic with dataframe.sample() and then push the code to bigger machine to generate the output.

How Spark internally works when reading HDFS files

Say I have a file of 256 KB is stored on HDFS file system of one node (as two blocks of 128 KB each). This file internally contains two blocks
of 128 KB each. Assume I have two nodes cluster of each 1 core only. My understanding is that spark during transformation will read complete file
on one node in memory and then transfer one file block memory data to other node so that both nodes/cores can parallely execute it ? Is that correct ?
What if both nodes had two core each instead of one core ? In that case two cores on single node could do the computation ? Is that right ?
val text = sc.textFile("mytextfile.txt")
val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
counts.collect
You question is a little hypothetical as it is unlikely you would have an Hadoop Cluster with HDFS existing with only one Data Node and 2 Worker Nodes - one being both Worker and Data Node. That is to say, the whole idea of Spark (and MR) with HDFS is to bring the processing to the data. The Worker Nodes are in fact the Data Nodes in the standard Hadoop set up. This is the original intent.
Some variations to answer your question:
Assuming the case as per above described, each Worker Node would process one partition and subsequent transformations on the newer generated RDDs until finished. You may of course repartition the data and what happens depends on the number of partitions and number of Executors per Worker Node.
In a nutshell: if you have N blocks / partitions initially and less than N Executors allocated - E - on a Hadoop Cluster with HDFS, then you will get some transfer of blocks (not a shuffle as is talked about elsewhere) to the Workers assigned, from Workers where no Executor was allocated to you Spark App, otherwise the block is assigned to be processed to that Data / Worker Node, obviously. Each block / partition is processed in some way, shuffled and the next set of Partitions or Partition read in and processed, depending on speed of processing for your transformation(s).
In the case of AWS S3 and Mircosoft's and gooogle's equivalent Cloud Storage which leave aside the principle of data locality as in the above case - i.e. compute power is divorced from storage, with the assumption that the network is not the bottleneck - which was exactly the Hadoop classic reason to bring the processing to the data, then it works similarly to the aforementioned, i.e. transfer of S3 data to Workers.
All of this assume an Action has been invoked.
I leave aside the principles of Rack Awareness, etc. as it becomes all quite complicated, but the Resource Managers understand these things and decide accordingly.
In the first case, Spark will usually load 1 partition on the first node and then if it cannot find an empty core, it will load the 2nd partition on the 2nd node after waiting for spark/locality.wait (default 3 seconds).
In the 2nd case both partitions will be loaded on the same node unless it does not have both cores free.
Many circumstances can cause this to change if you play with the default configurations.

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

Resources