How does Spark handle memory in local mode?

How does Spark handle memory in local mode? - apache-spark

I'm trying to get a grasp about how Spark (2.1.1) handles memory in local mode.
As far as I understand, when I launch spark-shell with --driver-memory 3g:
300MB is reserved
60% (default of spark.memory.fraction) is used
for the rest, shared between execution and storage - 1.7GB
Presumably some of this is also shared with spark-shell and the Spark UI.
Looking at running processes, I see the java.exe process for spark-shell using about 1GB of RAM after a fresh launch.
If I then read in a 900MB CSV file using:
val data: DataFrame = spark.read.option("header", value = true).csv("data.csv")
And then repeatedly call data.count, I can see the java.exe process creep up each time, until it caps out at about 2GB of RAM.
A few questions:
Why does it cap at 2GB? Is that number the 1.7GB usable + 300MB reserved?
What is actually in that memory, since I'm not caching anything?
When it hits that 2GB cap, what's happening on subsequent calls to data.count? It clearly ate more memory on previous calls, so why does it not need more once it hits that cap?

Related

Spark - 54 GB CSV file transform to single JSON in 16 GB RAM single machine

I want to take a CSV file and transform into single JSON, I have written and verified the code. I have a CSV file of 54 GB and I want to transform and export this single file into single JSON, I want to take this data in Spark and it will design the JSON using SparkSQL collect_set(struct built-in functions.
I am running Spark job in Eclipse IDE in a single machine only. The machine configuration has 16 GB RAM, i5 Processor, 600 GB HDD.
Now when I have been trying to run the spark program it is throwing java.lang.OutOfMemory and insufficient heap size error. I tried to increase the spark.sql.shuffle.partitions value 2000 to 20000 but still the job is failing after loading and during the transformation due to the same error I have mentioned.
I don't want to split the single CSV into multiple parts, I want to process this single CSV, how can I achieve that? Need help. Thanks.
Spark Configuration:
val conf = new SparkConf().setAppName("App10").setMaster("local[*]")
// .set("spark.executor.memory", "200g")
.set("spark.driver.memory", "12g")
.set("spark.executor.cores", "4")
.set("spark.driver.cores", "4")
// .set("spark.testing.memory", "2147480000")
.set("spark.sql.shuffle.partitions", "20000")
.set("spark.driver.maxResultSize", "500g")
.set("spark.memory.offHeap.enabled", "true")
.set("spark.memory.offHeap.size", "200g")

Few observations from my side,
When you collect data at the end on driver it needs to have enough memory to hold your complete json output. 12g is not sufficient memory for that IMO.
200g executor memory is commented then how much was allocated? Executors too need enough memory to process/transform this heavy data. If driver was allocated with 12g and if you have total of 16 then only available memory for executor is 1-2gb considering other applications running on system. It's possible to get OOM. I would recommend find whether driver or executor is lacking on memory
Most important, Spark is designed to process data in parallel on multiple machines to get max throughput. If you wanted to process this on single machine/single executor/single core etc. then you are not at all taking the benefits of Spark.
Not sure why you want to process it as a single file but I would suggest revisit your plan again and process it in a way where spark is able to use its benefits. Hope this helps.

Spark Memory Usage Concentrated on Driver / Master

I'm currently developing a Spark (v 2.2.0) Streaming application and am running into issues with the way Spark seems to be allocating work across the cluster. This application is submitted to AWS EMR using client mode, so there is a driver node and a couple of worker nodes. Here is a screenshot of Ganglia that shows memory usage in the last hour:
The left-most node is the "master" or "driver" node, and the other two are worker nodes. There are spikes in the memory usage for all three nodes that correspond to workloads coming in through the stream, but the spikes are not equal (even when scaled to % memory usage). When a large workload comes in, the driver node appears to be overworked, and the job will crash with an error regarding memory:
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x000000053e980000, 674234368, 0) failed; error='Cannot allocate memory' (errno=12)
I've also run into this:
Exception in thread "streaming-job-executor-10" java.lang.OutOfMemoryError: Java heap space when the master runs out of memory, which is equally confusing, as my understanding is that "client" mode would not use the driver / master node as an executor.
Pertinent details:
As mentioned earlier, this application is submitted in client mode: spark-submit --deploy-mode client --master yarn ....
Nowhere in the program am I running collect or coalesce
Any work that I suspect of being run on a single node (jdbc reads mainly) is repartition'd after completion.
There are a couple of very, very small datasets persist into memory.
1 x Driver specs: 4 cores, 16GB RAM (m4.xlarge instance)
2 x Worker specs: 4 cores, 30.5GB RAM (r3.xlarge instance)
I have tried both allowing Spark to choose executor size / cores and specifying them manually. Both cases behave the same. (I manually specified 6 executors, 1 core, 9GB RAM)
I'm certainly at a loss here. I'm not sure what is going on in the code to be triggering the driver to hog the workload like this.
The only suspect I can think of is a code snippet similar to the following:
val scoringAlgorithm = HelperFunctions.scoring(_: Row, batchTime)
val rawScored = dataToScore.map(scoringAlgorithm)
Here, a function is being loaded from a static object, and used to map over the Dataset. It is my understanding that Spark will serialize this function across the cluster (re: http://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#passing-functions-to-spark). However perhaps I am mistaken and it is simply running this transformation on the driver.
If anyone has any insight to this issue, I would love to hear it!

I ended up solving this issue. Here's how I addressed it:
I made an incorrect assertion in stating the problem: there was a collect statement at the beginning of the Spark program.
I had a transaction that required collect() to run as it was designed. My assumption was that calling repartition(n) on the resulting data would split the data back amongst the executors in the cluster. From what I can tell, this strategy does not work. Once I re-wrote this line, Spark started behaving as I expected and farming jobs out to worker nodes.
My advice to any lost soul who stumbles across this issue: don't collect unless it's the end of your Spark program. You can not recover from it. Find another way to perform your task. (I ended up switching a SQL transaction from where col in (,,,) syntax to a join on the database.)

GC overhead limit exceeded while reading data from MySQL on Spark

I have a > 5GB table on mysql. I want to load that table on spark as a dataframe and create a parquet file out of it.
This is my python function to do the job:
def import_table(tablename):
spark = SparkSession.builder.appName(tablename).getOrCreate()
df = spark.read.format('jdbc').options(
url="jdbc:mysql://mysql.host.name:3306/dbname?zeroDateTimeBehavior=convertToNull
",
driver="com.mysql.jdbc.Driver",
dbtable=tablename,
user="root",
password="password"
).load()
df.write.parquet("/mnt/s3/parquet-store/%s.parquet" % tablename)
I am running the following script to run my spark app:
./bin/spark-submit ~/mysql2parquet.py --conf "spark.executor.memory=29g" --conf "spark.storage.memoryFraction=0.9" --conf "spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit" --driver-memory 29G --executor-memory 29G
When I run this script on a EC2 instance with 30 GB, it fails with java.lang.OutOfMemoryError: GC overhead limit exceeded
Meanwhile, I am only using 1.42 GB of total memory available.
Here is full console output with stack trace: https://gist.github.com/idlecool/5504c6e225fda146df269c4897790097
Here is part of stack trace:
Here is HTOP output:
I am not sure if I am doing something wrong or spark is not meant for this use-case. I hope spark is.

A bit of a crude explanation about memory management of spark is provided below, you can read more about it from the official documentation, but here is my take:
I believe the option "spark.storage.memoryFraction=0.9" is problematic in your case, roughly speaking an executor has three types of memory which can be allocated, first is the storage memory which you have set to 90% of the executor memory i.e. about ~27GB which is used to keep persistent datasets.
Second is heap memory which is used to perform computations and is typically set high for cases where you are doing machine learning or lot of calculations, this is what is insufficient in your case, your program needs to have a higher heap memory which is what causes this error.
The third type of memory is shuffle memory which is used for communicating between different partitions. It needs to be set to a high value in cases where you are doing a lot of joins between dataframes/rdd's or in general, which requires a high amount of network overhead. This can be configured by the setting "spark.shuffle.memoryFraction"
So basically you can set the memory fractions by using these two settings, the rest of the memory available after shuffle and storage memory goes to the heap.
Since you are having such a high storage fraction the heap memory available to the program is extremely small. You will need to play with these parameters to get an optimal value. Since you are outputting a parquet file, you will usually need a higher amount of heap space since the programs requires computations for compression. I would suggest the following settings for you. The idea is that you are not doing any operations which require a lot of shuffle memory hence it can be kept small. Also, you do not need such a high amount of storage memory.
"spark.storage.memoryFraction=0.4"
"spark.shuffle.memoryFraction=0.2"
More about this can be read here:
https://spark.apache.org/docs/latest/configuration.html#memory-management

thanks to Gaurav Dhama for good explanation
, you may need to set spark.executor.extraJavaOptions to -XX:-UseGCOverheadLimit too.

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result.
Now I'm trying to process a ~5GB bzip2 CSV file on this cluster but I'm receiving this error:
16/11/23 17:29:53 WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I'm confused as to why I'm getting a ~10.5GB memory limit on a ~75GB cluster (15GB per 3m.xlarge instance)...
Here is my EMR config:
[
{
"classification":"spark-env",
"properties":{
},
"configurations":[
{
"classification":"export",
"properties":{
"PYSPARK_PYTHON":"python34"
},
"configurations":[
]
}
]
},
{
"classification":"spark",
"properties":{
"maximizeResourceAllocation":"true"
},
"configurations":[
]
}
]
From what I've read, setting the maximizeResourceAllocation property should tell EMR to configure Spark to fully utilize all resources available on the cluster. Ie, I should have ~75GB of memory available... So why am I getting a ~10.5GB memory limit error?
Here is the code I'm running:
def sessionize(raw_data, timeout):
# https://www.dataiku.com/learn/guide/code/reshaping_data/sessionization.html
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp"))
diff = (pyspark.sql.functions.lag(raw_data.timestamp, 1)
.over(window))
time_diff = (raw_data.withColumn("time_diff", raw_data.timestamp - diff)
.withColumn("new_session", pyspark.sql.functions.when(pyspark.sql.functions.col("time_diff") >= timeout.seconds, 1).otherwise(0)))
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp")
.rowsBetween(-1, 0))
sessions = (time_diff.withColumn("session_id", pyspark.sql.functions.concat_ws("_", "user_id", "site_id", pyspark.sql.functions.sum("new_session").over(window))))
return sessions
def aggregate_sessions(sessions):
median = pyspark.sql.functions.udf(lambda x: statistics.median(x))
aggregated = sessions.groupBy(pyspark.sql.functions.col("session_id")).agg(
pyspark.sql.functions.first("site_id").alias("site_id"),
pyspark.sql.functions.first("user_id").alias("user_id"),
pyspark.sql.functions.count("id").alias("hits"),
pyspark.sql.functions.min("timestamp").alias("start"),
pyspark.sql.functions.max("timestamp").alias("finish"),
median(pyspark.sql.functions.collect_list("foo")).alias("foo"),
)
return aggregated
spark_context = pyspark.SparkContext(appName="process-raw-data")
spark_session = pyspark.sql.SparkSession(spark_context)
raw_data = spark_session.read.csv(sys.argv[1],
header=True,
inferSchema=True)
# Windowing doesn't seem to play nicely with TimestampTypes.
#
# Should be able to do this within the ``spark.read.csv`` call, I'd
# think. Need to look into it.
convert_to_unix = pyspark.sql.functions.udf(lambda s: arrow.get(s).timestamp)
raw_data = raw_data.withColumn("timestamp",
convert_to_unix(pyspark.sql.functions.col("timestamp")))
sessions = sessionize(raw_data, SESSION_TIMEOUT)
aggregated = aggregate_sessions(sessions)
aggregated.foreach(save_session)
Basically, nothing more than windowing and a groupBy to aggregate the data.
It starts with a few of those errors, and towards halting increases in the amount of the same error.
I've tried running spark-submit with --conf spark.yarn.executor.memoryOverhead but that doesn't seem to solve the problem either.

I feel your pain..
We had similar issues of running out of memory with Spark on YARN. We have five 64GB, 16 core VMs and regardless of what we set spark.yarn.executor.memoryOverhead to, we just couldn't get enough memory for these tasks -- they would eventually die no matter how much memory we would give them. And this as a relatively straight-forward Spark application that was causing this to happen.
We figured out that the physical memory usage was quite low on the VMs but the virtual memory usage was extremely high (despite the logs complaining about physical memory). We set yarn.nodemanager.vmem-check-enabled in yarn-site.xml to false and our containers were no longer killed, and the application appeared to work as expected.
Doing more research, I found the answer to why this happens here: http://web.archive.org/web/20190806000138/https://mapr.com/blog/best-practices-yarn-resource-management/
Since on Centos/RHEL 6 there are aggressive allocation of virtual memory due to OS behavior, you should disable virtual memory checker or increase yarn.nodemanager.vmem-pmem-ratio to a relatively larger value.
That page had a link to a very useful page from IBM: https://web.archive.org/web/20170703001345/https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en
In summary, glibc > 2.10 changed its memory allocation. And although huge amounts of virtual memory being allocated isn't the end of the world, it doesn't work with the default settings of YARN.
Instead of setting yarn.nodemanager.vmem-check-enabled to false, you could also play with setting the MALLOC_ARENA_MAX environment variable to a low number in hadoop-env.sh. This bug report has helpful information about that: https://issues.apache.org/jira/browse/HADOOP-7154
I recommend reading through both pages -- the information is very handy.

If you're not using spark-submit, and you're looking for another way to specify the yarn.nodemanager.vmem-check-enabled parameter mentioned by Duff, here are 2 other ways:
Method 2
If you're using a JSON Configuration file (that you pass to the AWS CLI or to your boto3 script), you'll have to add the following configuration:
[{
"Classification": "yarn-site",
"Properties": {
"yarn.nodemanager.vmem-check-enabled": "false"
}
}]
Method 3
If you use the EMR console, add the following configuration:
classification=yarn-site,properties=[yarn.nodemanager.vmem-check-enabled=false]

See,
I had the same problem in a huge cluster that I'm working now. The problem will not be solved to adding memory to the worker. Sometimes in process aggregation spark will use more memory than it has and the spark jobs will start to use off-heap memory.
One simple example is:
If you have a dataset that you need to reduceByKey it will, sometimes, agregate more data in one worker than other, and if this data exeeds the memory of one worker you get that error message.
Adding the option spark.yarn.executor.memoryOverhead will help you if you set for 50% of the memory used for the worker (just for test, and see if it works, you can add less with more tests).
But you need to understand how Spark works with the Memory Allocation in the cluster:
The more common way Spark uses 75% of the machine memory. The rest goes to SO.
Spark has two types of memory during the execution. One part is for execution and the other is the storage. Execution is used for Shuffles, Joins, Aggregations and Etc. The storage is used for caching and propagating data accross the cluster.
One good thing about memory allocation, if you are not using cache in your execution you can set the spark to use that sotorage space to work with execution to avoid in part the OOM error. As you can see this in documentation of spark:
This design ensures several desirable properties. First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.
But how can we use that?
You can change some configurations, Add the MemoryOverhead configuration to your job call but, consider add this too: spark.memory.fraction change for 0.8 or 0.85 and reduce the spark.memory.storageFraction to 0.35 or 0.2.
Other configurations can help, but it need to check in your case. Se all these configuration here.
Now, what helps in My case.
I have a cluster with 2.5K workers and 2.5TB of RAM. And we were facing OOM error like yours. We just increase the spark.yarn.executor.memoryOverhead to 2048. And we enable the dynamic allocation. And when we call the job, we don't set the memory for the workers, we leave that for the Spark to decide. We just set the Overhead.
But for some tests for my small cluster, changing the size of execution and storage memory. That solved the problem.

Try repartition. It works in my case.
The dataframe was not so big at the very beginning when it was loaded with write.csv(). The data file amounted to be 10 MB or so, as may required say totally several 100 MB memory for each processing task in executor.
I checked the number of partitions to be 2 at the time.
Then it grew like a snowball during the following operations joining with other tables, adding new columns. And then I ran into the memory exceeding limits issue at a certain step.
I checked the number of partitions, it was still 2, derived from the original data frame I guess.
So I tried to repartition it at the very beginning, and there was no problem anymore.
I have not read many materials about Spark and YARN yet. What I do know is that there are executors in nodes. An executor could handle many tasks depending on the resources. My guess is one partition would be atomically mapped to one task. And its volume determines the resource usage. Spark could not slice it if one partition grows too big.
A reasonable strategy is to determine the nodes and container memory first, either 10GB or 5GB. Ideally, both could serve any data processing job, just a matter of time. Given the 5GB memory setting, the reasonable row for one partition you find, say is 1000 after testing (it won't fail any steps during the processing), we could do it as the following pseudo code:
RWS_PER_PARTITION = 1000
input_df = spark.write.csv("file_uri", *other_args)
total_rows = input_df.count()
original_num_partitions = input_df.getNumPartitions()
numPartitions = max(total_rows/RWS_PER_PARTITION, original_num_partitions)
input_df = input_df.repartition(numPartitions)
Hope it helps!

I had the same issue on small cluster running relatively small job on spark 2.3.1.
The job reads parquet file, removes duplicates using groupBy/agg/first then sorts and writes new parquet. It processed 51 GB of parquet files on 4 nodes (4 vcores, 32Gb RAM).
The job was constantly failing on aggregation stage. I wrote bash script watch executors memory usage and found out that in the middle of the stage one random executor starts taking double memory for a few seconds. When I correlated time of this moment with GC logs it matched with full GC that empties big amount of memory.
At last I understood that the problem is related somehow to GC. ParallelGC and G1 causes this issue constantly but ConcMarkSweepGC improves the situation. The issue appears only with small amount of partitions. I ran the job on EMR where OpenJDK 64-Bit (build 25.171-b10) was installed. I don't know the root cause of the issue, it could be related to JVM or operating system. But it is definitely not related to heap or off-heap usage in my case.
UPDATE1
Tried Oracle HotSpot, the issue is reproduced.

How does Spark occupy the memory

If my server has 50GB memory, Hbase is using 40GB. And when I run Spark I set the memory as --executor-memory 30G. So will Spark grab some memory from Hbase since there only 10GB left.
Another question, if Spark only need 1GB memory, but I gave Spark 10G memory, will Spark occupy 10GB memory.

The behavior will be different depending upon the deployment mode. In case you are using local mode, then --executor-memory will not change anything as you only have 1 Executor and that's your driver, so you need to increase the memory of your driver.
In case you are using Standalone mode and submitting your job in cluster mode then following would be applicable: -
--executor-memory is the memory required by per executor. It is the executors Heap Size. By Default 60% of the configured --executor-memory is used to cache RDDs. The remaining 40% of memory is available for any objects created during task execution. this is equivalent to -Xms and -Xmx. so in case you provide more memory then available then your executors will show errros regarding insufficient memory.

When you give Spark executor 30G memory, OS will not give it actual physical memory. But As and when your executor requires actual memory to either cache or processing this will cause your other processes like hbase to go on to swap. If your system's swap is set to zero then you will face OOM Error.
OS Swaps out idle part of the process which could make your process behave very slow.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string