how does Spark handle more memory than its capacity - apache-spark

Say my Spark cluster has 100G memory, during the Spark computing process, more data (new dataframes, caches) with a size of 200G are generated. In this case, will Spark store some of this data on Disk or it will just OOM?

Spark only starts reading in the data when an action (like count, collect or write) is called. Once an action is called, Spark loads in data in partitions - the number of concurrently loaded partitions depend on the number of cores you have available. So in Spark you can think of 1 partition = 1 core = 1 task.
If you apply no transformation but only do for instance a count, Spark will still read in the data in partitions, but it will not store any data in your cluster and if you do the count again it will read in all the data once again. To avoid reading in data several times, you might call cache or persist in which case Spark will try to store the data in you cluster. On cache (which is the same as persist(StorageLevel.MEMORY_ONLY) it will store all partitions in memory - if it doesn't fit in memory you will get an OOM. If you call persist(StorageLevel.MEMORY_AND_DISK) it will store as much as it can in memory and the rest will be put on disk. If data doesn't fit on disk either the OS will usually kill your workers.
In Apache Spark if the data does not fits into the memory then Spark simply persists that data to disk. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.
The persist method in Apache Spark provides six persist storage level to persist the data.
MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER
(Java and Scala), MEMORY_AND_DISK_SER
(Java and Scala), DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, OFF_HEAP.
The OFF_HEAP storage is under experimentation.

Related

Spark SQL data storage life cycle

I recently had a issue with with one of my spark jobs, where I was reading a hive table having several billion records, that resulted in job failure due to high disk utilization, But after adding AWS EBS volume, the job ran without any issues. Although it resolved the issue, I have few doubts, I tried doing some research but couldn't find any clear answers. So my question is?
when a spark SQL reads a hive table, where the data is stored for processing initially and what is the entire life cycle of data in terms of its storage , if I didn't explicitly specify anything? And How adding EBS volumes solves the issue?
Spark will read the data, if it does not fit in memory, it will spill it out on disk.
A few things to note:
Data in memory is compressed, from what I read, you gain about 20% (e.g. a 100MB file will take only 80MB of memory).
Ingestion will start as soon as you read(), it is not part of the DAG, you can limit how much you ingest in the SQL query itself. The read operation is done by the executors. This example should give you a hint: https://github.com/jgperrin/net.jgp.books.spark.ch08/blob/master/src/main/java/net/jgp/books/spark/ch08/lab300_advanced_queries/MySQLWithWhereClauseToDatasetApp.java
In latest versions of Spark, you can push down the filter (for example if you filter right after the ingestion, Spark will know and optimize the ingestion), I think this works only for CSV, Avro, and Parquet. For databases (including Hive), the previous example is what I'd recommend.
Storage MUST be seen/accessible from the executors, so if you have EBS volumes, make sure they are seen/accessible from the cluster where the executors/workers are running, vs. the node where the driver is running.
Initially the data is in table location in HDFS/S3/etc. Spark spills data on local storage if it does not fit in memory.
Read Apache Spark FAQ
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Whenever spark reads data from hive tables, it stores it in RDD. One point i want to make clear here is hive is just a warehouse so it is like a layer which is above HDFS, when spark interacts with hive , hive provides the spark the location where the hdfs loaction exists.
Thus, Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop (whatever the InputFormat used to read this file. ex: if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (note:the split between partitions would be done on line split, not the exact block split), unless you have a compressed file format like Avro/parquet.
If you manually add rdd.repartition(x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (256MB) it would be stored in 40blocks, which means that the RDD you read from this file would have 40partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)
Now its all up to spark that how it will process the data as Spark is doing lazy evaluation , before doing the processing, spark prepare a DAG for optimal processing. One more point spark need configuration for driver memory, no of cores , no of executors etc and if the configuration is inappropriate the job will fail.
Once it prepare the DAG , then it start processing the data. So it divide your job into stages and stages into tasks. Each task will further use specific executors, shuffle , partitioning. So in your case when you do processing of bilions of records may be your configuration is not adequate for the processing. One more point when we say spark load the data in RDD/Dataframe , its managed by spark, there are option to keep the data in memory/disk/memory only etc ref -storage_spark.
Briefly,
Hive-->HDFS--->SPARK>>RDD(Storage depends as its a lazy evaluation).
you may refer the following link : Spark RDD - is partition(s) always in RAM?

How does spark read data behind the scenes?

I am slightly confused as to how does spark reads the data from s3 for example. Let's say there is 100 GB of data to be read from s3 and the spark cluster has a total memory of 30 GB. Will spark read all 100 GB of the data once an action is triggered and store the maximum number of partitions in memory and spill the rest to disk or will it read only the partitions that it can store in memory process them and then read the rest of the data? Any link to some documentation will be highly appreciated.
There is a question on Spark FAQ about this:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.
MEMORY_AND_DISK
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

How does Spark handle partial cache/persist results?

If you cache a very large dataset that cannot be all stored in memory or on disk how does spark handle the partial cache? how does it know which data needs to be recomputed when you go to use that dataframe again?
Example:
Read 100 GB dataset into memory df1
Compute new dataframe df2 based on df1
cache df2
If spark can only fit 50GB of Cache for df2 what happens if you go to reuse df2 for the next steps? How would spark know which data it doesn't need to recompute and which is does? Will it need to re-read that data again that it couldn't persist?
UPDATE
What happens if you have 5GB memory and 5GB disk and try to cache a 20GB dataset? What happens to the other 10GB of data that can't be cached and how does spark know which data it needs to recompute and which it doesn't?
Spark has this default option for DF and DS:
MEMORY_AND_DISK – This is the default behavior of the DataFrame or
Dataset. In this Storage Level, The DataFrame will be stored in JVM
memory as a deserialized objects. When required storage is greater
than available memory, it stores some of the excess partitions into
local disk and reads the data from local disk when it required. It is slower as
there is I/O involved.
However, to be more specific:
Spark's unit of processing is a partition = 1 task. So the discussion
is more about partition or partitions fitting into memory and/or local
disk.
If a partition of the DF doesn't fit in memory and disk when using
StorageLevel.MEMORY_AND_DISK, then the OS will fail, aka kill, the
Executor / Worker. Eviction of other partitions than your own DF may occur, but
not for your own DF. The .cache is either successful or not, there is no re- reading in this case.
I base this on the fact that partition eviction does not occur for partitions belonging to the same underlying RDD. Not well explained all this stuff, but see here: How does Spark evict cached partitions?. In the end other RDD partitions may be evicted and re-computed but in the end also yo need enough local disk as well as memory.
A good read is: https://sparkbyexamples.com/spark/spark-difference-between-cache-and-persist/

How can Spark process data that is way larger than Spark storage?

Currently taking a course in Spark and came across the definition of an executor:
Each executor will hold a chunk of the data to be processed. This
chunk is called a Spark partition. It is a collection of rows that
sits on one physical machine in the cluster. Executors are responsible
for carrying out the work assigned by the driver. Each executor is
responsible for two things: (1) execute code assigned by the driver,
(2) report the state of the computation back to the driver
I am wondering what will happen if the storage of the spark cluster is less than the data that needs to be processed? How executors will fetch the data to sit on the physical machine in the cluster?
The same question goes for streaming data, which unbound data. Do Spark save all the incoming data on disk?
The Apache Spark FAQ briefly mentions the two strategies Spark may adopt:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Although Spark uses all available memory by default, it could be configured to run the jobs only with disk.
In section 2.6.4 Behavior with Insufficient Memory of Matei's PhD dissertation on Spark (An Architecture for Fast and General Data Processing on Large Clusters) benchmarks the performance impact due to the reduced amount of memory available.
In practice, you don't usually persist the source dataframe of 100TB, but only the aggregations or intermediate computations that are reused.

Apache Spark ---- how spark reads large partitions from source when there is no enough memory

Suppose my data source contains data in 5 partitions each partition size is 10gb ,so total data size 50gb , my doubt here is ,when my spark cluster doesn't have 50gb of main memory how spark handles out of memory exceptions , and what is the best practice to avoid these scenarios in spark.
50GB is data that can fit in memory and you probably don't need Spark for this kind of data - it would run slower than other solutions.
Also depending on the job and data format, a lot of times, not all the data needs to be read into memory (e.g. reading just needed columns from columnar storage format like parquet)
Generally speaking - when the data can't fit in memory Spark will write temporary files to disk. you may need to tune the job to more smaller partitions so each individual partition will fit in memory. see Spark Memory Tuning
Arnon

Resources