If I have a dataset with 10GB of size and there is only 10GB of resource(executors) available in the spark cluster then how would it process programmatically?
You seem to assume that the memory available to Spark has to equal or exceed the size of your data. That is not the case. Spark will spill to disk as needed.
Furthermore, compression will shrink the memory footprint of your data.
Bottom line: Proceed without persisting the data (.cache()).
Related
I know there's a way to configure a Spark Application based in your cluster resources ("Executor memory" and "number of Executor" and "executor cores") I'm wondering if exist a way to do it considering the data input size?
What would happen if data input size does not fit into all partitions?
Example:
Data input size = 200GB
Number of partitions in cluster = 100
Size of partitions = 128MB
Total size that partitions could handle = 100 * 128MB = 128GB
What about the rest of the data (72GB)?
I guess Spark will wait to have free the resources free due to is designed to process batches of data Is this a correct assumption?
Thank in advance
I recommend for best performance, don't set spark.executor.cores. You want one executor per worker. Also, use ~70% of the executor memory in spark.executor.memory. Finally- if you want real-time application statistics to influence the number of partitions, use Spark 3, since it will come with Adaptive Query Execution (AQE). With AQE, Spark will dynamically coalesce shuffle partitions. SO you set it to an arbitrarily-large number of partitions, such as:
spark.sql.shuffle.partitions=<number of cores * 50>
Then just let AQE do its thing. You can read more about it here:
https://www.databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html
There are 2 aspects to your question. The first is regarding storage of this data, & the second is regarding data execution.
With regards to storage, when you say Size of partitions = 128MB, I assume you use HDFS to store this data & 128M is your default block size. HDFS itself internally decides how to split this 200GB file & store in chunks not exceeding 128M. And your HDFS cluster should have more than 200GB * replication factor of combined storage to persist this data.
Coming to the Spark execution part of the question, once you define spark.default.parallelism=100, it means that Spark will use this value as the default level of parallelism while performing certain operations (like join etc). Please note that the amount of data being processed by each executor is not affected by the block size (128M) in any way. Which means each executor task will work on 200G/100 = 2G of data (provided the executor memory is sufficient for the required operation being performed). In case there isn't enough capacity in the spark cluster to run 100 executors in parallel, then it will launch as many executors it can in batches as and when resources are available.
I am slightly confused as to how does spark reads the data from s3 for example. Let's say there is 100 GB of data to be read from s3 and the spark cluster has a total memory of 30 GB. Will spark read all 100 GB of the data once an action is triggered and store the maximum number of partitions in memory and spill the rest to disk or will it read only the partitions that it can store in memory process them and then read the rest of the data? Any link to some documentation will be highly appreciated.
There is a question on Spark FAQ about this:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.
MEMORY_AND_DISK
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
Say my Spark cluster has 100G memory, during the Spark computing process, more data (new dataframes, caches) with a size of 200G are generated. In this case, will Spark store some of this data on Disk or it will just OOM?
Spark only starts reading in the data when an action (like count, collect or write) is called. Once an action is called, Spark loads in data in partitions - the number of concurrently loaded partitions depend on the number of cores you have available. So in Spark you can think of 1 partition = 1 core = 1 task.
If you apply no transformation but only do for instance a count, Spark will still read in the data in partitions, but it will not store any data in your cluster and if you do the count again it will read in all the data once again. To avoid reading in data several times, you might call cache or persist in which case Spark will try to store the data in you cluster. On cache (which is the same as persist(StorageLevel.MEMORY_ONLY) it will store all partitions in memory - if it doesn't fit in memory you will get an OOM. If you call persist(StorageLevel.MEMORY_AND_DISK) it will store as much as it can in memory and the rest will be put on disk. If data doesn't fit on disk either the OS will usually kill your workers.
In Apache Spark if the data does not fits into the memory then Spark simply persists that data to disk. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.
The persist method in Apache Spark provides six persist storage level to persist the data.
MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER
(Java and Scala), MEMORY_AND_DISK_SER
(Java and Scala), DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, OFF_HEAP.
The OFF_HEAP storage is under experimentation.
Suppose my data source contains data in 5 partitions each partition size is 10gb ,so total data size 50gb , my doubt here is ,when my spark cluster doesn't have 50gb of main memory how spark handles out of memory exceptions , and what is the best practice to avoid these scenarios in spark.
50GB is data that can fit in memory and you probably don't need Spark for this kind of data - it would run slower than other solutions.
Also depending on the job and data format, a lot of times, not all the data needs to be read into memory (e.g. reading just needed columns from columnar storage format like parquet)
Generally speaking - when the data can't fit in memory Spark will write temporary files to disk. you may need to tune the job to more smaller partitions so each individual partition will fit in memory. see Spark Memory Tuning
Arnon
I am running a Spark streaming application with 2 workers.
Application has a join and an union operations.
All the batches are completing successfully but noticed that shuffle spill metrics are not consistent with input data size or output data size (spill memory is more than 20 times).
Please find the spark stage details in the below image:
After researching on this, found that
Shuffle spill happens when there is not sufficient memory for shuffle data.
Shuffle spill (memory) - size of the deserialized form of the data in memory at the time of spilling
shuffle spill (disk) - size of the serialized form of the data on disk after spilling
Since deserialized data occupies more space than serialized data. So, Shuffle spill (memory) is more.
Noticed that this spill memory size is incredibly large with big input data.
My queries are:
Does this spilling impacts the performance considerably?
How to optimize this spilling both memory and disk?
Are there any Spark Properties that can reduce/ control this huge spilling?
Learning to performance-tune Spark requires quite a bit of investigation and learning. There are a few good resources including this video. Spark 1.4 has some better diagnostics and visualisation in the interface which can help you.
In summary, you spill when the size of the RDD partitions at the end of the stage exceed the amount of memory available for the shuffle buffer.
You can:
Manually repartition() your prior stage so that you have smaller partitions from input.
Increase the shuffle buffer by increasing the memory in your executor processes (spark.executor.memory)
Increase the shuffle buffer by increasing the fraction of executor memory allocated to it (spark.shuffle.memoryFraction) from the default of 0.2. You need to give back spark.storage.memoryFraction.
Increase the shuffle buffer per thread by reducing the ratio of worker threads (SPARK_WORKER_CORES) to executor memory
If there is an expert listening, I would love to know more about how the memoryFraction settings interact and their reasonable range.
To add to the above answer, you may also consider increasing the default number (spark.sql.shuffle.partitions) of partitions from 200 (when shuffle occurs) to a number that will result in partitions of size close to the hdfs block size (i.e. 128mb to 256mb)
If your data is skewed, try tricks like salting the keys to increase parallelism.
Read this to understand spark memory management:
https://0x0fff.com/spark-memory-management/
https://www.tutorialdocs.com/article/spark-memory-management.html