Dask distributed workers always leak memory when running many tasks - memory-leaks

What are some strategies to work around or debug this?
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 26.17 GB -- Worker memory limit: 32.66 GB
Basically, am just running lots of parallel jobs on single machine but but a dask-scheduler and have tried various numbers of workers. Any time I launch a large number of jobs the memory gradually creeps up over time and only goes down when I bounce the cluster.
I am trying to use fire_and_forget. Will .release() the futures help? I am typically launching these tasks via client.submit from the REPL and then terminating the REPL.
Would be happy to occasionally bounce workers and add some retry patterns if that is the correct way to use dask with leaky libraries.
UPDATE:
I have tried limited worker memory to 2 GB, but am still getting this error. When the error happens it seems to go into some sort of unrecoverable loop continually printing the error and no compute happens.

Dask isn't leaking the memory in this case. Something else is. Dask is just telling you about it. Something about the code that you are running with Dask seems to be leaking something.

Related

How to avoid out of memory error on Slurm clusters

What is the best way to avoid slurm killing a process due to OOM error without running the process multiple times to test out different memory constraints? Is there a soft memory limit that I can set for slurm to dynamically allocate more memory?
The best thing I came up with is to use a large memory limit and allow the process to share the resources, but I was wondering if there are better ways to prevent a process from being killed by OOM error.

One application, multiple instances, different memory usage

I have node.js server running two instances in cluster mode (via pm2).
The two instances are obviously identical, they execute the same code, load the same data.
Yet memory usage differs by over 100%:
Instance 1: 303,592kB
Instance 2: 614,404kB
Is there any reason the OS (Linux) can cause this behavior? The machine has plenty or RAM, so I would exclude memory shortage.
Have the two servers been running for the same amount of time? Did they answer the same requests?
Node.js is a garbage-collected runtime. Memory use over time is not constant. The garbage collector kicks in depending on allocation behavior, heap size and limit, idleness, and possibly other factors. Maybe your instance 1 has just done a major round of garbage collection, and instance 2 is about to do one? Have you watched their memory usage over time?

NodeJS, PM2, GC, Grafana - better understanding

I would like to unterstand the GC Process a little bit better in Nodejs/V8.
Could you provide some information for the following questions:
When GC is triggered, does this block the event loop of node js?
Is GC running in it's own process or is just a submethod of the event-loop ?
When spawning nodejs process via Pm2 (clustered mode) does the instance
really have it's own process or is the GC shared between the
instances ?
For Logging Purposes I am using Grafana
(https://github.com/RuntimeTools/appmetrics-statsd), can someone
explain the differences \ more details about these gauges:
gc.size the size of the JavaScript heap in bytes.
gc.used the amount of memory used on the JavaScript heap in bytes.
Are there any scenarios where GC is not freeing memory (gc.used) in relation with stress tests?
The questions are related to an issue that I am currently facing. The used memory of GC is rising and doesn't release any memory (classical memory leak). The problem is that it only appears when we a lot of requests.
I played around with max-old-space-size to avoid pm2 restarts, but it looks like that GC is not freeing up anymore and the whole application is getting really slow...
Any ideas ?
ok some questions, I already figured out:
gc.size = Total Heap Size (https://nodejs.org/api/v8.html -> getHeapStatistics),
gc.used = used_heap_size
it looks ok that when gc_size hits a plateu that it never goes down again =>
Memory usage doesn't decrease in node.js? What's going on?
Why is garbage collection expensive? The V8 JavaScript engine employs a stop-the-world garbage collector mechanism. In practice, it means that the program stops execution while garbage collection is in progress.
https://blog.risingstack.com/finding-a-memory-leak-in-node-js/

How much memory can/will w3wp use

On Win2k12, running IIS8, with 12GB of ram, how much memory will a single w3wp process use with no other pressures.
I am attempting to isolate where I have an issue. It seems like there is a leak somewhere since it grows to about 10GB and then crashes and starts up again. It usually hovers between 800MB-2GB. Then, all of the sudden, it starts increasing over a 2-5 minute period and it crashes.
As I begin to isolate I was wondering, all things being equal, how much would the process use up with no other pressure - just one site on this server. Similar to the way SQL Server will use up all the ram, if you let it, will w3wp do the same?

JVM process killed by OS

I've implemented a web service using Camel's Jetty component through Akka (endpoint) which forwards received messages to an actor pool with the setup of:
def receive = _route()
def lowerBound = 5
def upperBound = 20
def rampupRate = 0.1
def partialFill = true
def selectionCount = 1
def instance() = Actor.actorOf[Processor]
And Processor is a class that processes the received message and replies with the result of the process. The app has been working normally and flawless on my local machine, however after deploying it on an EC2 micro instance (512m of memory - CentOS like OS) the OS (oom-killer) kills the process due to OutOfMemory (not JVM OOM) after 30 calls or so (regardless of the frequency of calls).
Profiling the application locally doesn't show any significant memory leaks, if there exist any at all. Due to some difficulties I could not perform proper profiling on the remote machine but monitoring "top"s output, I observed something interesting which is the free memory available stays around 400mb after the app is initialized, afterwards it bounces between 380mb to 400mb which seems pretty natural (gc, etc). But the interesting part is that after receiving the 30th or so call, it suddenly goes from there to 5mb of free memory and boom, it's killed. The oom-killer log in /var/log/messages verifies that this has been done by the OS due to lack of memory/free swap.
Now this is not totally Akka-relevant but I finally decided I should seek some advice from you guys, after 3 days of hopeless wrestling.
Thanks for any leads.
I have observed that when lot of small objects are created, which should be garbage collected immediately, the Java process is killed. Perhaps because the memory limit is reached before the temporary objects are reclaimed by GC.
Try running it with concurrent mark and sweep garbage collector:
java -XX:+UseConcMarkSweepGC
My general observation is that the JVM uses a lot of memory beyond the Java heap. I don't know exactly for what, but can only speculate that it might using normal C heap for compilation or compiled-code storage or other permgen stuff or whatnot. Either way, I have found it difficult to control its usage.
Unless you're very pressed on disk storage, you may want to simply create a swap file of a GB or two so that the JVM has some place to overflow. In my experience, the memory it uses outside the Java heap isn't referenced overly often anyway and can just lie swapped out safely without causing much I/O.

Resources