how much memory can be allocated to Cassandra in DSE with spark enabled? - apache-spark

Currently my DSE Cassandra uses up all of the memory. And therefore after some time and increasing data amount the whole system crashes. But spark and ops center and agent etc also needs several G memory. I am now trying to only allocate half of the memory to cassandra but not sure if that will work.
This is my error message:
kernel: Out of memory: Kill process 31290 (java) score 293 or sacrifice child

By default DSE sets the Executor memory to (Total Ram)*(.7) - Ram Used By C*. This should be ok for most systems. With this setup it should Spark shouldn't be able to OOM C* or Vice Versa. If you want to change that multipler (.7) it's set in the dse.yaml file as
initial_spark_worker_resources: 0.7
If I was going for minimum memory for the system it would be 16GB but I would recommend at least 32GB if you are serious. This should be increased even more if you are doing a lot of in-memory caching.

Related

Spark is not use all configured storage memory capacity

My task in spark uses images data for prediction I am working on a spark cluster standalone but I have an issue utilizing all the available memory capacity as here all available memory is 2.7 GB (coming from a memory executor that is configured 5 GB *0.6 *0.9= 2.7 it's okay ) but the usage memory is only 342 MB after that value my spark session being crashed and I did not know why this specific value!
I test my application on local and on a standalone cluster mode in addition whatever the memory executor configured value the limit of memory value for execution will be 342 MB. and here as shown my data size of 290691 KB led to the crash of my spark session and it works fine if I decrease the number of images
as follows screenshot issue:
This output error crashed with a data size of 290691 KB
Here my spark UI Storage Memory did not exceed 342 MB
so is there any advice or what is the correct spark configuration?
It's a warning, initially.
The general gist here is that you need to repartition to get more, but smaller size partitions, so as to get more parallelism and higher throughput. You can find many such issues out there on the Internet.

Structured Streaming runs out of memory after running for a while even with spark.catalog.clearCache()

spark version: 2.4.0
My application is a simple pyspark-kafka structured streaming application using park-sql-kafka-0-10_2.11:2.4.0
To avoid this possible memory leak problem
, I am also using foreachBatch for each microbatch.
Also, each microbatch is supposed to be composed of <= 1000 rows meaning, it is very unlikely to cause the out of memory issue as long as caches are cleared properly. To be extra cautious, I called spark.catalog.clearCache() at the end of each microbatch to ensure all caches are cleared.
However, after having it run for a while (~ 30 mins) it raises the following issue.
22/01/11 10:39:36 ERROR Client: Application diagnostics message: Application application_1641893608223_0002 failed 2 times due to AM Container for appattempt_1641893608223_0002_000002 exited with exitCode: -104
Failing this attempt.Diagnostics: Container [pid=17995,containerID=container_1641893608223_0002_02_000001] is running beyond physical memory limits. Current usage: 1.4 GB of 1.4 GB physical memory used; 4.4 GB of 6.9 GB virtual memory used. Killing container.
Even though 1.4 GB is a small amount of memory, each microbatch itself is pretty small as well so it shouldn't be a problem.
Also, there are a lot of tasks stacked in the Kafka-Q, In order to prevent the overload in the spark streaming, I have set spark.streaming.blockInterval to 40000ms and maxOffsetsPerTrigger to 10.
What could be possibly causing this out-of-memory issue?

Any tips for scaling Spark horizontally

Does anybody have any tips when moving Spark execution from a few large nodes to many, smaller nodes?
I am running a system with 4 executors, each executor has 24Gb of ram and 12 cores. If I try to scale that out to 12 executors, 4 cores each and 8 Gb of ram (Same total RAM, same total cores, just distributed differently) I run into out of memory errors:
Container killed by YARN for exceeding memory limits. 8.8 GB of 8.8 GB physical memory used.
I have increased the number partitions by a factor of 3 to create more (yet smaller) partitions, but this didn't help.
Does anybody have any tips & tricks when trying to scale spark horizontally?
This is a pretty broad question, executor sizing in Spark is a very complicated kind of black magic, and the rules of thumb which were correct in 2015 for example are obsolete now, as will whatever I say be obsolete in 6 months with the next release of Spark. A lot comes down to exactly what you are doing and avoiding key skew in your data.
This is a good place to start to learn and develop your own understanding:
https://spark.apache.org/docs/latest/tuning.html
There are also a multitude of presentations on Slideshare about tuning Spark, try and read / watch the most recent ones. Anything older than 18 months be sceptical of, and anything older than 2 years just ignore.
I will make the assumption that you are using at least Spark 2.x.
The error you're encountering is indeed because of poor executor sizing. What is happening is that your executors are attempting to do too much at once, and running themselves into the ground as they run out of memory.
All other things being equal these are the current rules of thumb as I apply them:
The short version
3 - 4 virtual (hyperthreaded) cores and 29GB of RAM is a reasonable default executor size (I will explain why later). If you know nothing else, partition your data well and use that.
You should normally aim for a data partition size (in memory) on the order of ~100MB to ~3GB
The formulae I apply
Executor memory = number of executor cores * partition size * 1.3 (safety factor)
Partition size = size on disk of data / number of partitions * deser ratio
The deserialisation ratio is the ratio between the size of the data on disk and the size of data in memory. The Java memory representation of the same data tends to be a decent bit larger than on disk.
You also need to account for whether your data is compressed, many common formats like Parquet and ORC use compression like gzip or snappy.
For snappy compressed text data (very easily compressed), I use ~10X - 100X.
For snappy compressed data with a mix of text, floats, dates etc I see between 3X and 15X typically.
number of executor cores = 3 to 4
Executor cores totally depends on how compute vs memory intensive your calculation is. Experiment and see what is best for your use case. I have never seen anyone informed on Spark advocate more than 6 cores.
Spark is smart enough to take advantage of data locality, so the larger your executor, the better chance that your data is PROCESS_LOCAL
More data locality is good, up to a point.
When a JVM gets too large > 50GB, it begins to operate outside what it was originally designed to do, and depending on your garbage collection algorithm, you may begin to see degraded performance and high GC time.
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
There also happens to be a performance trick in Java that if your JVM is smaller than 32GB, you can use 32 bit compressed pointers rather than 64 bit pointers, which saves space and reduces cache pressure.
https://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
It also so happens that YARN adds 7% or 384MB of RAM (whichever is larger) to your executor size for overhead / safety factor, which is where 29GB rule of thumb comes from: 29GB + 7% ~= 32GB
You mentioned that you are using 12 core, 24GB RAM executors. This sends up a red flags for me.
Why?
Because every "core" in an executor is assigned one "task" at time. A task is equivalent to the work required to compute the transformation of one partition from "stage" A to "stage" B.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-taskscheduler-tasks.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-DAGScheduler-Stage.html
If your executor has 12 cores, then it is going to try and do 12 tasks simulatenously with a 24GB memory budget. 24GB / 12 cores = 2GB per core. If your partitions are greater than 2GB, you will get an out of memory error. If the particular transformation doubles the size of the input (even intermediately), then you need to account for that as well.

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result.
Now I'm trying to process a ~5GB bzip2 CSV file on this cluster but I'm receiving this error:
16/11/23 17:29:53 WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I'm confused as to why I'm getting a ~10.5GB memory limit on a ~75GB cluster (15GB per 3m.xlarge instance)...
Here is my EMR config:
[
{
"classification":"spark-env",
"properties":{
},
"configurations":[
{
"classification":"export",
"properties":{
"PYSPARK_PYTHON":"python34"
},
"configurations":[
]
}
]
},
{
"classification":"spark",
"properties":{
"maximizeResourceAllocation":"true"
},
"configurations":[
]
}
]
From what I've read, setting the maximizeResourceAllocation property should tell EMR to configure Spark to fully utilize all resources available on the cluster. Ie, I should have ~75GB of memory available... So why am I getting a ~10.5GB memory limit error?
Here is the code I'm running:
def sessionize(raw_data, timeout):
# https://www.dataiku.com/learn/guide/code/reshaping_data/sessionization.html
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp"))
diff = (pyspark.sql.functions.lag(raw_data.timestamp, 1)
.over(window))
time_diff = (raw_data.withColumn("time_diff", raw_data.timestamp - diff)
.withColumn("new_session", pyspark.sql.functions.when(pyspark.sql.functions.col("time_diff") >= timeout.seconds, 1).otherwise(0)))
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp")
.rowsBetween(-1, 0))
sessions = (time_diff.withColumn("session_id", pyspark.sql.functions.concat_ws("_", "user_id", "site_id", pyspark.sql.functions.sum("new_session").over(window))))
return sessions
def aggregate_sessions(sessions):
median = pyspark.sql.functions.udf(lambda x: statistics.median(x))
aggregated = sessions.groupBy(pyspark.sql.functions.col("session_id")).agg(
pyspark.sql.functions.first("site_id").alias("site_id"),
pyspark.sql.functions.first("user_id").alias("user_id"),
pyspark.sql.functions.count("id").alias("hits"),
pyspark.sql.functions.min("timestamp").alias("start"),
pyspark.sql.functions.max("timestamp").alias("finish"),
median(pyspark.sql.functions.collect_list("foo")).alias("foo"),
)
return aggregated
spark_context = pyspark.SparkContext(appName="process-raw-data")
spark_session = pyspark.sql.SparkSession(spark_context)
raw_data = spark_session.read.csv(sys.argv[1],
header=True,
inferSchema=True)
# Windowing doesn't seem to play nicely with TimestampTypes.
#
# Should be able to do this within the ``spark.read.csv`` call, I'd
# think. Need to look into it.
convert_to_unix = pyspark.sql.functions.udf(lambda s: arrow.get(s).timestamp)
raw_data = raw_data.withColumn("timestamp",
convert_to_unix(pyspark.sql.functions.col("timestamp")))
sessions = sessionize(raw_data, SESSION_TIMEOUT)
aggregated = aggregate_sessions(sessions)
aggregated.foreach(save_session)
Basically, nothing more than windowing and a groupBy to aggregate the data.
It starts with a few of those errors, and towards halting increases in the amount of the same error.
I've tried running spark-submit with --conf spark.yarn.executor.memoryOverhead but that doesn't seem to solve the problem either.
I feel your pain..
We had similar issues of running out of memory with Spark on YARN. We have five 64GB, 16 core VMs and regardless of what we set spark.yarn.executor.memoryOverhead to, we just couldn't get enough memory for these tasks -- they would eventually die no matter how much memory we would give them. And this as a relatively straight-forward Spark application that was causing this to happen.
We figured out that the physical memory usage was quite low on the VMs but the virtual memory usage was extremely high (despite the logs complaining about physical memory). We set yarn.nodemanager.vmem-check-enabled in yarn-site.xml to false and our containers were no longer killed, and the application appeared to work as expected.
Doing more research, I found the answer to why this happens here: http://web.archive.org/web/20190806000138/https://mapr.com/blog/best-practices-yarn-resource-management/
Since on Centos/RHEL 6 there are aggressive allocation of virtual memory due to OS behavior, you should disable virtual memory checker or increase yarn.nodemanager.vmem-pmem-ratio to a relatively larger value.
That page had a link to a very useful page from IBM: https://web.archive.org/web/20170703001345/https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en
In summary, glibc > 2.10 changed its memory allocation. And although huge amounts of virtual memory being allocated isn't the end of the world, it doesn't work with the default settings of YARN.
Instead of setting yarn.nodemanager.vmem-check-enabled to false, you could also play with setting the MALLOC_ARENA_MAX environment variable to a low number in hadoop-env.sh. This bug report has helpful information about that: https://issues.apache.org/jira/browse/HADOOP-7154
I recommend reading through both pages -- the information is very handy.
If you're not using spark-submit, and you're looking for another way to specify the yarn.nodemanager.vmem-check-enabled parameter mentioned by Duff, here are 2 other ways:
Method 2
If you're using a JSON Configuration file (that you pass to the AWS CLI or to your boto3 script), you'll have to add the following configuration:
[{
"Classification": "yarn-site",
"Properties": {
"yarn.nodemanager.vmem-check-enabled": "false"
}
}]
Method 3
If you use the EMR console, add the following configuration:
classification=yarn-site,properties=[yarn.nodemanager.vmem-check-enabled=false]
See,
I had the same problem in a huge cluster that I'm working now. The problem will not be solved to adding memory to the worker. Sometimes in process aggregation spark will use more memory than it has and the spark jobs will start to use off-heap memory.
One simple example is:
If you have a dataset that you need to reduceByKey it will, sometimes, agregate more data in one worker than other, and if this data exeeds the memory of one worker you get that error message.
Adding the option spark.yarn.executor.memoryOverhead will help you if you set for 50% of the memory used for the worker (just for test, and see if it works, you can add less with more tests).
But you need to understand how Spark works with the Memory Allocation in the cluster:
The more common way Spark uses 75% of the machine memory. The rest goes to SO.
Spark has two types of memory during the execution. One part is for execution and the other is the storage. Execution is used for Shuffles, Joins, Aggregations and Etc. The storage is used for caching and propagating data accross the cluster.
One good thing about memory allocation, if you are not using cache in your execution you can set the spark to use that sotorage space to work with execution to avoid in part the OOM error. As you can see this in documentation of spark:
This design ensures several desirable properties. First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.
But how can we use that?
You can change some configurations, Add the MemoryOverhead configuration to your job call but, consider add this too: spark.memory.fraction change for 0.8 or 0.85 and reduce the spark.memory.storageFraction to 0.35 or 0.2.
Other configurations can help, but it need to check in your case. Se all these configuration here.
Now, what helps in My case.
I have a cluster with 2.5K workers and 2.5TB of RAM. And we were facing OOM error like yours. We just increase the spark.yarn.executor.memoryOverhead to 2048. And we enable the dynamic allocation. And when we call the job, we don't set the memory for the workers, we leave that for the Spark to decide. We just set the Overhead.
But for some tests for my small cluster, changing the size of execution and storage memory. That solved the problem.
Try repartition. It works in my case.
The dataframe was not so big at the very beginning when it was loaded with write.csv(). The data file amounted to be 10 MB or so, as may required say totally several 100 MB memory for each processing task in executor.
I checked the number of partitions to be 2 at the time.
Then it grew like a snowball during the following operations joining with other tables, adding new columns. And then I ran into the memory exceeding limits issue at a certain step.
I checked the number of partitions, it was still 2, derived from the original data frame I guess.
So I tried to repartition it at the very beginning, and there was no problem anymore.
I have not read many materials about Spark and YARN yet. What I do know is that there are executors in nodes. An executor could handle many tasks depending on the resources. My guess is one partition would be atomically mapped to one task. And its volume determines the resource usage. Spark could not slice it if one partition grows too big.
A reasonable strategy is to determine the nodes and container memory first, either 10GB or 5GB. Ideally, both could serve any data processing job, just a matter of time. Given the 5GB memory setting, the reasonable row for one partition you find, say is 1000 after testing (it won't fail any steps during the processing), we could do it as the following pseudo code:
RWS_PER_PARTITION = 1000
input_df = spark.write.csv("file_uri", *other_args)
total_rows = input_df.count()
original_num_partitions = input_df.getNumPartitions()
numPartitions = max(total_rows/RWS_PER_PARTITION, original_num_partitions)
input_df = input_df.repartition(numPartitions)
Hope it helps!
I had the same issue on small cluster running relatively small job on spark 2.3.1.
The job reads parquet file, removes duplicates using groupBy/agg/first then sorts and writes new parquet. It processed 51 GB of parquet files on 4 nodes (4 vcores, 32Gb RAM).
The job was constantly failing on aggregation stage. I wrote bash script watch executors memory usage and found out that in the middle of the stage one random executor starts taking double memory for a few seconds. When I correlated time of this moment with GC logs it matched with full GC that empties big amount of memory.
At last I understood that the problem is related somehow to GC. ParallelGC and G1 causes this issue constantly but ConcMarkSweepGC improves the situation. The issue appears only with small amount of partitions. I ran the job on EMR where OpenJDK 64-Bit (build 25.171-b10) was installed. I don't know the root cause of the issue, it could be related to JVM or operating system. But it is definitely not related to heap or off-heap usage in my case.
UPDATE1
Tried Oracle HotSpot, the issue is reproduced.

Cassanda, the available RAM is used to a maximum

I want to make a simple query over approximately 10 mio rows.
I have 32GB RAM (20GB is free). And Cassandra is using so much memory, that the available RAM is used to a maximum, and the process is killed.
How can I optimize Cassandra? I have read about "Tuning Java resources" and changing the Java heap sizing, but I still have no solution.
Cassandra will use up as much memory as is available to it on the system. It's a greedy process and will use any available memory for caching, similar to the way the kernel page cache works. Don't worry if Cassandra is using all your hosts memory, it will just be in cache and will be released to other processes if necessary.
If your query is suffering from timeouts this will probably be from reading too much data from a single partition so that the query doesn't return in under read_request_timeout_in_ms. If this is the case you should look at making your partition sizes smaller.

Resources