Spark consumes more heap memory. is it true? - apache-spark

With respect to heap memory.
Spark consumes more heap memory comparative Hadoop.
Please advice me.

Starting Apache Spark version 1.6.0, memory management model has changed. The old memory management model is implemented by StaticMemoryManager class, and now it is called “legacy”. “Legacy” mode is disabled by default, which means that running the same code on Spark 1.5.x and 1.6.0 would result in different behavior, be careful with that. For compatibility, you can enable the “legacy” model with spark.memory.useLegacyMode parameter, which is turned off by default.
It's also based on your heap configuration.

Related

Difference between "spark.yarn.executor.memoryOverhead" and "spark.memory.offHeap.size"

I am running spark on yarn. I don't understand what is the difference between the following settings spark.yarn.executor.memoryOverhead and spark.memory.offHeap.size. Both seem to be settings for allocating off-heap memory to spark executor. Which one should I use? Also what is the recommended setting for executor offheap memory?
Many thanks!
TL;DR: For Spark 1.x and 2.x, Total Off-Heap Memory = spark.executor.memoryOverhead (spark.offHeap.size included within)
For Spark 3.x, Total Off-Heap Memory = spark.executor.memoryOverhead + spark.offHeap.size (credit from this page)
Detailed explanation:
spark.executor.memoryOverhead is used by resource management like YARN, whereas spark.memory.offHeap.size is used by Spark core (memory manager). The relationship a bit different depending on the version.
Spark 2.4.5 and before:
spark.executor.memoryOverhead should include spark.memory.offHeap.size. This means that if you specify offHeap.size, you need to manually add this portion to memoryOverhead for YARN. As you can see from the code below from YarnAllocator.scala, when YARN request resource, it does not know anything about offHeap.size:
private[yarn] val resource = Resource.newInstance(
executorMemory + memoryOverhead + pysparkWorkerMemory,
executorCores)
However, the behavior is changed in Spark 3.0:
spark.executor.memoryOverhead does not include spark.memory.offHeap.size anymore. YARN will include offHeap.size for you when requesting resources. From the new documentation:
Note: Additional memory includes PySpark executor memory (when spark.executor.pyspark.memory is not configured) and memory used by other non-executor processes running in the same container. The maximum memory size of container to running executor is determined by the sum of spark.executor.memoryOverhead, spark.executor.memory, spark.memory.offHeap.size and spark.executor.pyspark.memory.
And from the code you can also tell:
private[yarn] val resource: Resource = {
val resource = Resource.newInstance(
executorMemory + executorOffHeapMemory + memoryOverhead + pysparkWorkerMemory, executorCores)
ResourceRequestHelper.setResourceRequests(executorResourceRequests, resource)
logDebug(s"Created resource capability: $resource")
resource
}
For more details of this change you can refer to this Pull Request.
For your second question, what is the recommended setting for executor offheap memory? It depends on your application and you need some testing. I found this page helpful to explain it further:
Off-heap memory is a great way to reduce GC pauses because it's not in the GC's scope. However, it brings an overhead of serialization and deserialization. The latter in its turn makes that the off-heap data can be sometimes put onto heap memory and hence be exposed to GC. Also, the new data format brought by Project Tungsten (array of bytes) helps to reduce the GC overhead. These 2 reasons make that the use of off-heap memory in Apache Spark applications should be carefully planned and, especially, tested.
BTW, spark.yarn.executor.memoryOverhead is deprecated and changed to spark.executor.memoryOverhead, which is common for YARN and Kubernetes.
spark.yarn.executor.memoryOverhead is used in StaticMemoryManager. This is used in older Spark Version like 1.2.
The amount of off heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).
You can find this in older Spark docs,like Spark1.2 docs:
https://spark.apache.org/docs/1.2.0/running-on-yarn.html
spark.memory.offHeap.size is used in UnifiedMemoryManager, which is used by default after version 1.6
The absolute amount of memory in bytes which can be used for off-heap allocation. This setting has no impact on heap memory usage, so if your executors' total memory consumption must fit within some hard limit then be sure to shrink your JVM heap size accordingly. This must be set to a positive value when spark.memory.offHeap.enabled=true.
You can find this in lates Spark docs,like Spark2.4 docs:
https://spark.apache.org/docs/2.4.4/configuration.html

default configuration about 'spark.shuffle.consolidatefiles'

what is the default behavior to Map Shuffling in the newer version of Spark?
I learned that configuration of spark.shuffle.consolidateFile is used to reduce memory cost by write buffer. But I cannot find the configuration anymore. I checked configuration, and from Spark 1.6.0, this configuration has been removed. So what is the default behavior to Map Shuffling in the newer version of Spark?
I learned that hash shuffle has been replaced by sort shuffle. So this configuration makes no sense anymore.

What is the equivalent configuration between Spark and Drill?

I want to compare the query performance between Spark and Drill. Therefore, the configuration of these two systems has to be identical. What are the parameters I have to consider like driver memory, executor memory for spark, drill max direct memory, planner memory max query memory per node for Drill etc? Can someone give me an example of configuration?
It is possible to get a close comparison between Spark and Drill for specific overlapping use case. I will first describe how Spark and Drill are different, what the overlapping use cases are, and finally how you could tune Spark's memory settings to match Drill as closely as possible for overlapping use cases.
Comparison of Functionality
Both Spark and Drill can function as a SQL compute engine. My definition of a SQL compute engine is a system that can do the following:
Ingest data from files, databases, or message queue.
Execute SQL statements provided by a user on the ingested data.
Write the results of a user's SQL statement to a terminal, file, database table, or message queue.
Drill is only a SQL compute engine while Spark can do more than just a SQL compute engine. The extra things that Spark can do are the following:
Spark has APIs to manipulate data with functional programming operations, not just SQL.
Spark can save results of operations to DataSets. DataSets can be efficiently reused in other operations and are efficiently cached both on disk and in memory.
Spark has some stream processing concepts APIs.
So to accurately compare Drill and Spark you can only consider their overlapping functionality, which is executing a SQL statement.
Comparison of Nodes
A running Spark job is comprised of two types of nodes. An Executor and a Driver. The Executor is like a worker node that is given simple compute tasks and executes them. The Driver orchestrates a Spark job. For example if you have a SQL query or a Spark job written in Python, the Driver is responsible for planning how the work for the SQL query or python script will be distributed to the Executors. The Driver will then monitor the work being done by Executors. The Driver can be run in a variety of modes: on your laptop like a client, on a separate dedicated node or container.
Drill is slightly different. The two participants in a SQL query are the Client and Drillbit. The Client is essentially a dummy commandline terminal for sending SQL commands and receiving results. The Drillbits are responsible for doing the compute work for a query. When the Client sends a SQL command to Drill the client will pick one Drillbit to be a Foreman. There is no restriction on which Drillbit can be a foreman and there can be a different Foreman selected for each query. The Foreman performs two functions during the query:
He plans the query and orchestrates the rest of the Drillbits to divide up the work.
He also participates in the execution of the query and does some of the data processing as well.
The functions of Spark's Driver and Executors are very similar to Drill's Drillbit and Foreman but not quite the same. The main difference being that a Driver cannot function as an Executor simultaneusly, while a Foreman also functions as a Drillbit.
When constructing a cluster comparing Spark and Drill I would do the following:
Drill: Create a cluster with N nodes.
Spark: Create a cluster with N Executors and make sure the Driver has the same amount of memory as the Executors.
Comparison of Memory Models
Spark and Drill both use the JVM. Applications running on the JVM have access to two kinds of memory. On Heap Memory and Off Heap Memory. On heap memory is normal garbage collected memory; for example if you do new Object() the object will be allocated on the heap. Off heap memory is not garbage collected and must be explicitly allocated and freed. When applications consume a large amounts of heap memory (16 GB or more), they can tax JVM garbage collector. In such cases garbage collection can incur a significant compute overhead and depending on the GC algorithm computation can pause for several seconds as garbage collection is done. In contrasts off heap memory is not subject to garbage collection and would not incur these performance penalties.
Spark stores everything on heap by default. It can be configured to store some data in off heap memory, but it is not clear to me when it will actually store data off heap.
Drill stores all its data in off heap memory, and only uses on heap memory for the general engine itself.
Another additional difference is that Spark reserves some of its memory to cache DataSets, while Drill does not caching of data in memory after a query is executed.
In order to compare Spark and Drill apples to apples we would have to configure Spark and Drill to use the same amount of off heap and on heap memory for executing a SQL query. In the following example we will walk through how to configure Drill and spark to use 8gb of on heap memory and 8gb of off heap memory.
Drill Memory Config Example
Set the following in your drill-env.sh file on each Drillbit
export DRILL_HEAP="8G"
export DRILL_MAX_DIRECT_MEMORY="8G"
Once these are configured restart your Drillbits and try your query. Your query may run out of memory because Drill's memory management is still under active development. In order to give you an out you can manually control Drill's memory usage with a query using the planner.width.max_per_node and planner.memory.max_query_memory_per_node options. These options are set in your drill-override.conf. Note you must change these options on all your nodes and restart your Drillbits for them to take effect. A more detailed explanation of these options can be found here.
Spark Memory Config Example
Create a properties file myspark.conf and pass it to the spark submit command. The spark properties file should include the following config.
# 8gb of heap memory for executor
spark.executor.memory 8g
# 8gb of heap memory for driver
spark.driver.memory 8g
# Enable off heap memory and use 8gb of it
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 8000000000
# Do not set aside memory for caching data frames
# Haven't tested if 0.0 works. If it doesn't make this
# as small as possible
spark.memory.storageFraction 0.0
Summary
Create a Drill cluster with N nodes, a Spark cluster with N executors and deploy a dedicated driver, try the memory configurations provided above, and run the same or a similar SQL query on both clusters. Hope this helps.

Spark - Changing memory fraction dynamically

I have a Spark job which needs a large portion of executor memory in the first half and large portion of user memory in the second half. Is there any way to dynamically change Spark memory fraction during runtime?
Short: spark.* configuration option cannot be changed on run-time.
Longer: There should be no need to. If you use recent Spark (1.6 or later) memory settings are deprecated. You can set spark.memory.useLegacyMode and Spark will do the rest.

Is Tachyon by default implemented by the RDD's in Apache Spark?

I'm trying to understand Spark's in memory feature. In this process i came across Tachyon
which is basically in memory data layer which provides fault tolerance without replication by using lineage systems and reduces re-computation
by check-pointing the data-sets. Now where got confused is, all these features are also achievable by Spark's standard RDDs system. So i wonder does RDDs implement Tachyon behind the curtains to implement these features? If not than what is the use of Tachyon where all of its job can be done by standard RDDs. Or am i making some mistake in relating these two? a detailed explanation or link to one will be a great help. Thank you.
What is in the paper you linked does not reflect the reality of what is in Tachyon as a release open source project, parts of that paper have only ever existed as research prototypes and never been fully integrated into Spark/Tachyon.
When you persist data to the OFF_HEAP storage level via rdd.persist(StorageLevel.OFF_HEAP) it uses Tachyon to write that data into Tachyon's memory space as a file. This removes it from the Java heap thus giving Spark more heap memory to work with.
It does not currently write the lineage information so if your data is too large to fit into your configured Tachyon clusters memory portions of the RDD will be lost and your Spark jobs can fail.

Resources