Memory profiling in DirectRunner spark mode - apache-spark

I am doing memory profiling using YourKit and to simplify the matters for a Spark application, I am running the app in DirectRunner mode. The machine I am testing on has 32 cores. The captured snapshot looks like:
The "direct-runner-worker" has 32 threads and it seems like I was under the false assumption that direct runner occupies just one thread. My question is - shouldn't there be a limit on the number of parallelization threads? In the snapshot a thread occupies between 250 and 350 MB and this will inevitably blow up.
Another question is I am not sure if I should follow http://spark.apache.org/developer-tools.html#profiling for my case, the documentation seems to be for an application running with a SparkCluster but since I am using DirectRunner (for debugging purpose) then maybe whatever I am doing is good enough - does anyone have experience with this?
Any pointers are appreciated! :)
PS: my mind is boggled by the creation of 215 million objects but that should go down with the thread count. However, ~6 million objects per thread seem like a lot.

Related

Pygmo 2: control of memory allocation

I'm happily running Pygmo2 to solve a 18-parameters problem using self-adapting differential evolution.
Everything runs fine but at an high cost: Pygmo hugely overallocates memory, requesting about 170G while actually using about 10G.
I'm running on a shered cluster with a total of 500G, so I can't run multiple instances at the same time without affecting the server performance for other users. As it takes 2-3 hours to complete one run this is somewhat limiting for exploratory analysis and objective function optimization.
I looked at the documentation, other SO questions, git threads, but I've to say I didn't find much about memory usage.
So, my questions are:
Is this memory-greedy behaviour normal for problems with multiple parameters? Or is something due to how the objective function is coded? (I'd post the code, but is a 600-line piece of code describing a thermodynamic biochemical equilibrium, if not necessary I would not clog the post)
If this overallocation is normal, what function does it have?
Is there a way to limit the memory pygmo allocates?
Tips/tricks/experiences/suggestions?
Few details about the setting:
pygmo 2.8
18-parameter problem
archipelago with 4 islands
population of 40 parents (interesting statement about lack of performance increase exploding the number of parents regardless of the number of parameters here http://www1.icsi.berkeley.edu/~storn/code.html)
Thanks!

What to expect in terms of performance from my Spark Streaming Application in local mode?

I realize this might be a very broad question, but this is my issue: I developed a Spark Application in Java which uses an algorithm to analyse several JSON messages (1kB of size each) which are received through a socket connection, in one second intervals.
I'm only using 6 map methods, but the functions inside have several loops that can run up to 1000 times each (there are even cases where I have a loop inside a loop which leads to them being run 1000*1000 times in total).
I'm running the application in local mode, that is, with just one node (I assume) to perform the Spark tasks and jobs.
The problem here is that I am taking up to 7 minutes to process one of these messages, which is an insane amount of time, and causes great scheduling delays.
Is this normal given the complexity of my algorithm + running in local mode+ possibly some memory leakage?
If so, how can I proceed to improve the throughput?
Don't know if it helps, but here are some specifications of my computer:
Processor: Intel Core i5, 2.60GHz
RAM: 3.87GB usable memory
64 bit operating system
Thank you so much.

Mono 4.2.2 garbage collection really slow/leaking on Linux with multiple threads?

I have an app that processes 3+GB of data into 300MB of data. Run each independent dataset sequentially on the main thread, its memory usage tops out at about 3.5GB and it works fine.
If I run each dataset concurrently on 10 threads, I see the following:
Virtual memory usage climbs steadily until allocations fail and it crashes. I can see GC is trying to run in the stack trace)
CPU utilization is 1000% for periods, then goes down to 100% for minutes, and then cycles back up. The app is easily 10x slower when run with multiple threads, even though they are completely independent.
This is mono 4.2.2 build for Linux with large heap support, running on 128GB RAM with 40 logical processors. I am running mono-sgen and have tried all the custom GC settings I could think of (concurrent mark-sweep, max heap size, etc).
These problems do not happen on Windows. If I rewrite code to do significant object pooling, I get farther in the dataset before running OOM, but the fate is the same. I have verified that I have no memory leaks using multiple tools and good-old printf-debugging.
My best theory is that lots of allocations across lots of threads are a weak case for the GC, and most of that wall-clock time is spent with my work threads suspended.
Does anyone have any experience with this? Is there a way I can help the GC get out of that 100% rut it gets stuck in, and to not run out of memory?

Profiling resource usage - CPU, memory, hard-drive - of a long-running process on Linux?

We have a process that takes about 20 hours to run on our Linux box. We would like to make it faster, and as a first step need to identify bottlenecks. What is our best option to do so?
I am thinking of sampling the process's CPU, RAM, and disk usage every N seconds. So unless you have other suggestions, my specific questions would be:
How much should N be?
Which tool can provide accurate readings of these stats, with minimal interference or disruption from the fact that the tool itself is running?
Any other tips, nuggets of wisdom, or references to other helpful documents would be appreciated, since this seems to be one of these tasks where you can make a lot of time-consuming mistakes and false-starts as a newbie.
First of all, what you want and what you are asking is completely different.
Monitoring is required when you are running it for first time i.e. when you don't know its resource utilization (CPU, Memory,Disk etc.).
You can follow below procedure to drill down the bottleneck,
Monitor system resources (Generally 10-20 seconds interval should be fine with Munin, ganglia or other tool).
In this you should be able to identify if your hw is bottleneck or not i.e are you running out of resources Ex. 100% cpu util, very low memory, high io etc.
If this your case then probably think about upgrading hw or tuning the existing.
Then you tune your application/utility. Use profilers/loggers to find out which method, process is taking time. Try to tune that process. If you have single threaded codes then probably use parallelism. If DB etc. are involved try to tune your queries, DB params.
Then again run test with monitoring to drill down more :)
I think a graph representation should be helpful for solving your problem and i advice you Munin.
It's a resource monitoring tool with a web interface. By default it monitors disk IO, memory, cpu, load average, network usage... It's light and easy to install. It's also easy to develop your own plugins and set alert thresholds.
http://munin-monitoring.org/
Here is an example of what you can get from Munin : http://demo.munin-monitoring.org/munin-monitoring.org/demo.munin-monitoring.org/

Memory of type "root-set" reallocation Error - Erlang

I have been running a crypto-intensive application that was generating pseudo-random strings, with special structure and mathematical requirements. It has generated around 1.7 million voucher numbers per node in over the last 8 days. The generation process was CPU intensive, with very low memory requirements.
Mnesia running on OTP-14B02 was the storage database and the generation was done within each virtual machine. I had 3 nodes in the cluster with all mnesia tables disc_only_copies type. Suddenly, as activity on the Solaris boxes increased (other users logged on remotely and were starting webservers, ftp sessions, and other tasks), my bash shell started reporting afork: not enough space error.
My erlang Vms also, went down with this error below:
Crash dump was written to: erl_crash.dump
temp_alloc: Cannot reallocate 8388608 bytes of memory (of type "root_set").
Usually, we get memory allocation errors and not memory re-location errors and normally memory of type "heap" is the problem. This time, the memory type reported is type "root-set".
Qn 1. What is this "root-set" memory?
Qn 2. Has it got to do with CPU intensive activities ? (why am asking this is that when i start the task, the Machine reponse to say mouse or Keyboard interrupts is too slow meaning either CPU is too busy or its some other problem i cannot explain for now)
Qn 3. Can such an error be avoided? and how ?
The fork: not enough space message suggests this is a problem with the operating system setup, but:
Q1 - The Root Set
The Root Set is what the garbage collector uses as a starting point when it searches for data that is live in the heap. It usually starts off from the registers of the VM and off from the stack, if the stack has references to heap data that still needs to be live. There may be other roots in Erlang I am not aware of, but these are the basic stuff you start off from.
That it is a reallocation error of exactly 8 Megabyte space could mean one of two things. Either you don't have 8 Megabyte free in the heap, or that the heap is fragmented beyond recognition, so while there are 8 megabytes in it, there are no contiguous such space.
Q2 - CPU activity impact
The problem has nothing to do with the CPU per se. You are running out of memory. A large root set could indicate that you have some very deep recursions going on where you keep around a lot of pointers to data. You may be able to rewrite the code such that it is tail-calling and uses less memory while operating.
You should be more worried about the slow response times from the keyboard and mouse. That could indicate something is not right. Does a vmstat 1, a sysstat, a htop, a dstat or similar show anything odd while the process is running? You are also on the hunt to figure out if the kernel or the C libc is doing something odd here due to memory being constrained.
Q3 - How to fix
I don't know how to fix it without knowing more about what the application is doing. Since you have a crash dump, your first instinct should be to take the crash dump viewer and look at the dump. The goal is to find a process using a lot of memory, or one that has a deep stack. From there on, you can seek to limit the amount of memory that process is using. either by rewriting the code so it can give memory up earlier, by tuning the garbage collection setup for the process (see the spawn options in the erlang man pages for this), or by adding more memory to the system.

Resources