Container crashes after one hour due to OOM - apache-spark

i'm running spark using docker on DC/OS. When i submit the spark jobs, using the following memory configurations
Driver 2 Gb
Executor 2 Gb
Number of executors are 3.
The spark submit works fine, after 1 hour the docker container(worker container) crashes due to OOM (exit code 137). but my spark logs shows that 1Gb+ of memory is available.
The strange thing is the same jar which is running in the container , runs normally for almost 20+ hours in the standalone mode.
Is it the normal behaviour of the Spark contianers, or is there Something im doing wrong.Or are there any extra configuraton do I need to use for the docker container.
Thanks

It looks like I have a similar issue. Have you looked at the cache/buffer memory usage on the OS?
Using the command below you can get some info on the type of memory usage on the OS:
free -h
In my case the buffer / cache kept on growing until there was no more memory available in the Container. In my case the VM was a CentOS machine running on AWS and it crashed entirely when this happened.

Is your spark calling REST end point, if yes, try closing connections

Related

emr spark master node runs out of memory in yarn cluster mode

I am new to EMR and I am running an EMR cluster, with 1 master (32gb) and 5 core nodes (16gb). I launch 11 apps. The apps have to be separated in case one of them fail (all of them are streaming apps). I must mention that I also got ElasticSearch running on the cluster.
After some time the master node is running out of memory and stops responding and some apps starting to fail. In the process overview I found many smaller hadoop processes with that occupy 1-1.3GB of RAM. I guess these are the driver processes from each app. I tried to reduce the the driver memory under "spark.driver.memory" to 512MB, but it's still at 1.3GB after relaunching the apps. Is this because of yarn?
ES just allocates ca. 6.5 GB of RAM of the master node
I had to specify the driver memory in spark-submit command like this:
spark-submit --driver-memory 500M
because to specify it inside the python file is too late, when you run the driver in client mode, because it allocates the memory before

How to start Cassandra and Spark services assigning half of the RAM memory on each program?

I have a 4 nodes cluster on which there are installed Spark and Cassandra on each node.
Spark version 3.1.2 and Cassandra v3.11
Let me say that each nodes have 4GB of RAM and I want to run my "spark+cassandra" program all over the cluster.
How can I assign 2GB of RAM for Cassandra execution and 2GB for Spark execution?
I noted that.
If my Cassandra cluster is up and I run start-worker.sh command on a worker node to make my spark cluster up, suddenly Cassandra service stops but spark still works. Basically, Spark steals RAM resources to Cassandra. How can I avoid also this?
On Cassandra logs of the crashed node I read the message:
There is insufficient memory for the Java Runtime Environment to continue.
In fact typing top -c and then shift+M i can see Spark Service at the top of column Memory
Thanks for any suggestions.
By default, Spark workers take up the total RAM less 1GB. On a 4GB machine, the worker JVM consumes 3GB of memory. This is the reason the machine runs out of memory.
You'll need to configure the SPARK_WORKER_MEMORY to 1GB to leave enough memory for the operating system. For details, see Starting a Spark cluster manually.
It's very important to note as Alex Ott already pointed out, a machine with only 4GB of RAM is not going to be able to do much so expect to run into performance issues. Cheers!

Docker containers freezing

I'm currently trying to deploy a node.js app on docker containers. I need to deploy 30 of them but they begin to have a weird behavior at some point, some of them freeze.
I am currently running Docker version for windows 18.03.0-ce, build 0520e24302, my computer specs (cpu and memory):
I5 4670 K
24 GB of ram
My docker default machine resource allocation is the following :
Allocated RAM : 10 Gb
Allocated vCPUs : 4
My node application is running on apline3.8 and node.js 11.4 and mostly do http requests every 2-3 seconds.
When i try to deploy 20 containers everything is running like a charm, my application do the job and i can see that there is an activity on every on my containers through the logs, activity stats.
The problem comes when i try to deploy more containers, more than 20, i can notice that some of the previously deployed containers start to stop their activities (0% cpu using, logs freezing). When everything is deployed (30 containers), Docker start to block the activity of some of them and unblock them at some point to block some others (blocking/unblocking is random). It seems to be sequential. I tried to wait and see what happened and the result is that some of the containers are able to poursue their activities and some others are stuck forever (still running but no more activity).
It's important to notice that i used the following resources restrictions on each of my containers :
MemoryReservation : 160mb
Memory soft limit : 160mb
NanoCPUs : 250000000 (0.25 cpus)
I had to increase my docker default machine resource allocation and decrease container's ressource allocation because it was using almost 100% of my cpu, maybe i did a mistake in my configuration. I tried to tweak those values, but no success i still have some containers freezing.
I'm kind of lost right know.
Any help would be appreciated even a little one, thank you in advance !

Buffer/cache exhaustion Spark standalone inside a Docker container

I have a very weird memory issue (which is what a lot of people will most
likely say ;-)) with Spark running in standalone mode inside a Docker
container. Our setup is as follows: We have a Docker container in which we have a Spring boot application that runs Spark in standalone mode. This Spring boot app also contains a few scheduled tasks (managed by Spring). These tasks trigger Spark jobs. The Spark jobs scrape a SQL database, shuffles the data a bit and then writes the results to a different SQL table (writing the results doesn't go through Spark). Our current data set is very small (the table contains a few million rows).
The problem is that the Docker host (a CentOS VM) that runs the Docker
container crashes after a while because the memory gets exhausted. I currently have limited the Spark memory usage to 512M (I have set both executor and driver memory) and in the Spark UI I can see that the largest job only takes about 10 MB of memory. I know that Spark runs best if it has 8GB of memory or more available. I have tried that as well but the results are the same.
After digging a bit further I noticed that Spark eats up all the buffer / cache memory on the machine. After clearing this manually by forcing Linux to drop caches (echo 2 > /proc/sys/vm/drop_caches) (clearing the dentries and inodes) the cache usage drops considerably but if I don't keep doing this regularly I see that the cache usage slowly keeps going up until all memory is used in buffer/cache.
Does anyone have an idea what I might be doing wrong / what is going on here?
Big thanks in advance for any help!

Cannot run memory intense program on Spark through YARN

I'm trying to benchmark a program on an Azure cluster using Spark. We previously ran this on EC2 and know that 150 GB of RAM is sufficient. I have tried multiple setups for the executors and given them 160-180GB of RAM but regardless of what I do, the program dies due to executors requesting more memory.
What can I do? Are there more launch options I should consider, I have tried every conceivable executor setup and nothing seems to want to work. I'm at a total loss.
For your command, you specified 7 executor and each with 40g of memory. That's 280G of memory in total, but you said your cluster has only 160-180 G of memory? If only 150G of memory is needed, why the spark-submit is configured that way?
What's your HDI cluster node type and how many of them you created?
Were you using YARN previously on EC2 as well? In that case, are the configuration the same?

Resources