Spark - Changing memory fraction dynamically - apache-spark

I have a Spark job which needs a large portion of executor memory in the first half and large portion of user memory in the second half. Is there any way to dynamically change Spark memory fraction during runtime?

Short: spark.* configuration option cannot be changed on run-time.
Longer: There should be no need to. If you use recent Spark (1.6 or later) memory settings are deprecated. You can set spark.memory.useLegacyMode and Spark will do the rest.

Related

Why caching small Spark RDDs takes big memory allocation in Yarn?

The RDDs that are cached (in total 8) are not big, only around 30G, however, on Hadoop UI, it shows that the Spark application is taking lots of memory (no active jobs are running), i.e. 1.4T, why so much?
Why it shows around 100 executors (here, i.e. vCores) even when there's no active jobs running?
Also, if cached RDDs are stored across 100 executors, are those executors preserved and no more other Spark apps can use them for running tasks any more? To rephrase the question: will preserving a little memory resource (.cache) in executors prevents other Spark app from leveraging the idle computing resource of them?
Is there any potential Spark config / zeppelin config that can cause this phenomenon?
UPDATE 1
After checking the Spark conf (zeppelin), it seems there's the default (configured by administrator by default) setting for spark.executor.memory=10G, which is probably the reason why.
However, here's a new question: Is it possible to keep only the memory needed for the cached RDDs in each executors and release the rest, instead of holding always the initially set memory spark.executor.memory=10G?
Spark configuration
Perhaps you can try to repartition(n) your RDD to a fewer n < 100 partitions before caching. A ~30GB RDD would probably fit into storage memory of ten 10GB executors. A good overview of Spark memory management can be found here. This way, only those executors that hold cached blocks will be "pinned" to your application, while the rest can be reclaimed by YARN via Spark dynamic allocation after spark.dynamicAllocation.executorIdleTimeout (default 60s).
Q: Is it possible to keep only the memory needed for the cached RDDs in each executors and release the rest, instead of holding always the initially set memory spark.executor.memory=10G?
When Spark uses YARN as its execution engine, YARN allocates the containers of a specified (by application) size -- at least spark.executor.memory+spark.executor.memoryOverhead, but may be even bigger in case of pyspark -- for all the executors. How much memory Spark actually uses inside a container becomes irrelevant, since the resources allocated to a container will be considered off-limits to other YARN applications.
Spark assumes that your data is equally distributed on all the executors and tasks. That's the reason why you set memory per task. So to make Spark to consume less memory, your data has to be evenly distributed:
If you are reading from Parquet files or CSVs, make sure that they have similar sizes. Running repartition() causes shuffling, which if the data is so skewed may cause other problems if executors don't have enough resources
Cache won't help to release memory on the executors because it doesn't redistribute the data
Can you please see under "Event Timeline" on the Stages "how big are the green bars?" Normally that's tied to the data distribution, so that's a way to see how much data is loaded (proportionally) on every task and how much they are doing. As all tasks have same memory assigned, you can see graphically if resources are wasted (in case there are mostly tiny bars and few big bars). A sample of wasted resources can be seen on the image below
There are different ways to create evenly distributed files for your process. I mention some possibilities, but for sure there are more:
Using Hive and DISTRIBUTE BY clause: you need to use a field that is equally balanced in order to create as many files (and with proper size) as expected
If the process creating those files is a Spark process reading from a DB, try to create as many connections as files you need and use a proper field to populate Spark partitions. That is achieved, as explained here and here with partitionColumn, lowerBound, upperBound and numPartitions properties
Repartition may work, but see if coalesce also make sense in your process or in the previous one generating the files you are reading from

setting tuning parameters of a spark job

I'm relatively new to spark and I have a few questions related to the tuning optimizations with respect to the spark submit command.
I have followed : How to tune spark executor number, cores and executor memory?
and I understand how to utilise maximum resources out of my spark cluster.
However, I was recently asked how to define the number of cores, memory and cores when I have a relatively smaller operation to do as if I give maximum resources, it is going to be underutilised .
For instance,
if I have to just do a merge job (read files from hdfs and write one single huge file back to hdfs using coalesce) for about 60-70 GB (assume each file is of 128 mb in size which is the block size of HDFS) of data(in avro format without compression), what would be the ideal memory, no of executor and cores required for this?
Assume I have the configurations of my nodes same as the one mentioned in the link above.
I can't understand the concept of how much memory will be used up by the entire job provided there are no joins, aggregations etc.
The amount of memory you will need depends on what you run before the write operation. If all you're doing is reading data combining it and writing it out, then you will need very little memory per cpu because the dataset is never fully materialized before writing it out. If you're doing joins/group-by/other aggregate operations all of those will require much ore memory. The exception to this rule is that spark isn't really tuned for large files and generally is much more performant when dealing with sets of reasonably sized files. Ultimately the best way to get your answers is to run your job with the default parameters and see what blows up.

What will happen if the single file is larger than a executor in a map operation in YARN - SPARK?

I'm working on a solution where driver program will read the xml file and from that i will take a HDFS file path and that will be read inside map operation.I have few questions here.
Since the map operation will be performed in containers (Containers will be allocated while starting the job ).
What is the single input file is greater than a executor. Since the file is not read in driver program it cannot allocate more resource? OR the application master will get more memory from resource manager?
Any help is highly appreciated.
What is the single input file is greater than a executor?
As the file is in HDFS, Spark will create 1 partition for 1 block in HDFS. Every partitions will be processed in a Worker.
If file has many blocks which can't be computed at a time then spark make sure the pending partition will be computed once resources are free(after completing transformation with a stage).
Loaded file appears as RDD. RDD is combination of pieces so called partitions which are reside across cluster. Reading file is not problem but after transformation it can throw OOM exception depending on executor memory limitations. Because there can be some shuffle operations which will require transfer of partitions to one place. By default executor memory set to be 512MB. But for processing large amount of data set custom memory parameter.
Spark reserves parts of that memory for cached data storage and for temporary shuffle data. Set the heap for these with the parameters spark.storage.memoryFraction (default 0.6) and spark.shuffle.memoryFraction (default 0.2). Because these parts of the heap can grow before Spark can measure and limit them, two additional safety parameters must be set: spark.storage.safetyFraction (default 0.9) and spark.shuffle.safetyFraction (default 0.8). Safety parameters lower the memory fraction by the amount specified. The actual part of the heap used for storage by default is 0.6 × 0.9 (safety fraction times the storage memory fraction), which equals 54%. Similarly, the part of the heap used for shuffle data is 0.2 × 0.8 (safety fraction times the shuffle memory fraction), which equals 16%. You then have 30% of the heap reserved for other Java objects and resources needed to run tasks. You should, however, count on only 20%.

spark spilling independent of executor memory assigned

I've noticed strange behavior when running a pyspark application with spark 2.0. In the first step in my script involving a reduceByKey (and thus shuffle) operation, I observe that the amount the shuffle writes is roughly in line with my expectations, but that much more spills occur than I had expected. I tried to avoid these spills by increasing the amount of memory assigned per executor up to 8x the original amount, but see basically no difference in the amount spilled. Strangely, I also see that while this stage is running, hardly any of the assigned storage memory is used (as reported in the executors tab in the spark web UI).
I saw this earlier question, which led me to believe that increasing executor memory might help avoid the spills: How to optimize shuffle spill in Apache Spark application
. This leads me to believe that some hard limit is leading to the spills, and not the spark.shuffle.memoryFraction parameter. Does such a hard limit exist, possibly among HDFS parameters? Otherwise, what could be done to avoid spills besides increasing executor memory?
Many thanks, R
Spilling behavior in PySpark is controlled using spark.python.worker.memory:
Amount of memory to use per python worker process during aggregation, in the same format as JVM memory strings (e.g. 512m, 2g). If the memory used during aggregation goes above this amount, it will spill the data into disks.
which is by default set to 512MB. Moreover PySpark uses its own reducing mechanism with External(GroupBy|Sorter|Merger) and exhibits slightly different behavior than its native counterpart.

Can sparkSQL dataframe exceed the memory?

I`m using SparkSQL doing some calculation. Every 5 minutes there will be a new data frame comes in. I need to run calculation on the recent one week dataframe.
Which means I need to merge 12*24*7 = 2016 dataframes to a big one and run calculation.
The size is going to beyond my RAM size. All the nodes within my spark cluster have totally 128G memory which is not enough.
So I want to know what will happen if the dataframe too big to fit in memory? Will spark swap it out to disk temporarily? Do I need to explicitly ask spark to swap or it will done automatically?
Do you have 2016 input files that you need to read in? If so, spark's read functions accept wildcards, so you can read them all at once as opposed to setting up some loop/read/merge functionality. And depending on your input files, the size of data frame in memory could be much smaller than the size of your saved files. So, it's possible your data frame will fit into memory.
To answer your question, Spark will automatically spill to disk as needed if it runs out of memory.

Resources