I'm happily running Pygmo2 to solve a 18-parameters problem using self-adapting differential evolution.
Everything runs fine but at an high cost: Pygmo hugely overallocates memory, requesting about 170G while actually using about 10G.
I'm running on a shered cluster with a total of 500G, so I can't run multiple instances at the same time without affecting the server performance for other users. As it takes 2-3 hours to complete one run this is somewhat limiting for exploratory analysis and objective function optimization.
I looked at the documentation, other SO questions, git threads, but I've to say I didn't find much about memory usage.
So, my questions are:
Is this memory-greedy behaviour normal for problems with multiple parameters? Or is something due to how the objective function is coded? (I'd post the code, but is a 600-line piece of code describing a thermodynamic biochemical equilibrium, if not necessary I would not clog the post)
If this overallocation is normal, what function does it have?
Is there a way to limit the memory pygmo allocates?
Tips/tricks/experiences/suggestions?
Few details about the setting:
pygmo 2.8
18-parameter problem
archipelago with 4 islands
population of 40 parents (interesting statement about lack of performance increase exploding the number of parents regardless of the number of parameters here http://www1.icsi.berkeley.edu/~storn/code.html)
Thanks!
Related
I am doing memory profiling using YourKit and to simplify the matters for a Spark application, I am running the app in DirectRunner mode. The machine I am testing on has 32 cores. The captured snapshot looks like:
The "direct-runner-worker" has 32 threads and it seems like I was under the false assumption that direct runner occupies just one thread. My question is - shouldn't there be a limit on the number of parallelization threads? In the snapshot a thread occupies between 250 and 350 MB and this will inevitably blow up.
Another question is I am not sure if I should follow http://spark.apache.org/developer-tools.html#profiling for my case, the documentation seems to be for an application running with a SparkCluster but since I am using DirectRunner (for debugging purpose) then maybe whatever I am doing is good enough - does anyone have experience with this?
Any pointers are appreciated! :)
PS: my mind is boggled by the creation of 215 million objects but that should go down with the thread count. However, ~6 million objects per thread seem like a lot.
I realize this might be a very broad question, but this is my issue: I developed a Spark Application in Java which uses an algorithm to analyse several JSON messages (1kB of size each) which are received through a socket connection, in one second intervals.
I'm only using 6 map methods, but the functions inside have several loops that can run up to 1000 times each (there are even cases where I have a loop inside a loop which leads to them being run 1000*1000 times in total).
I'm running the application in local mode, that is, with just one node (I assume) to perform the Spark tasks and jobs.
The problem here is that I am taking up to 7 minutes to process one of these messages, which is an insane amount of time, and causes great scheduling delays.
Is this normal given the complexity of my algorithm + running in local mode+ possibly some memory leakage?
If so, how can I proceed to improve the throughput?
Don't know if it helps, but here are some specifications of my computer:
Processor: Intel Core i5, 2.60GHz
RAM: 3.87GB usable memory
64 bit operating system
Thank you so much.
We have a process that takes about 20 hours to run on our Linux box. We would like to make it faster, and as a first step need to identify bottlenecks. What is our best option to do so?
I am thinking of sampling the process's CPU, RAM, and disk usage every N seconds. So unless you have other suggestions, my specific questions would be:
How much should N be?
Which tool can provide accurate readings of these stats, with minimal interference or disruption from the fact that the tool itself is running?
Any other tips, nuggets of wisdom, or references to other helpful documents would be appreciated, since this seems to be one of these tasks where you can make a lot of time-consuming mistakes and false-starts as a newbie.
First of all, what you want and what you are asking is completely different.
Monitoring is required when you are running it for first time i.e. when you don't know its resource utilization (CPU, Memory,Disk etc.).
You can follow below procedure to drill down the bottleneck,
Monitor system resources (Generally 10-20 seconds interval should be fine with Munin, ganglia or other tool).
In this you should be able to identify if your hw is bottleneck or not i.e are you running out of resources Ex. 100% cpu util, very low memory, high io etc.
If this your case then probably think about upgrading hw or tuning the existing.
Then you tune your application/utility. Use profilers/loggers to find out which method, process is taking time. Try to tune that process. If you have single threaded codes then probably use parallelism. If DB etc. are involved try to tune your queries, DB params.
Then again run test with monitoring to drill down more :)
I think a graph representation should be helpful for solving your problem and i advice you Munin.
It's a resource monitoring tool with a web interface. By default it monitors disk IO, memory, cpu, load average, network usage... It's light and easy to install. It's also easy to develop your own plugins and set alert thresholds.
http://munin-monitoring.org/
Here is an example of what you can get from Munin : http://demo.munin-monitoring.org/munin-monitoring.org/demo.munin-monitoring.org/
I'm seeing a huge (~200++) faults/sec number in my mongostat output, though very low lock %:
My Mongo servers are running on m1.large instances on the amazon cloud, so they each have 7.5GB of RAM ::
root:~# free -tm
total used free shared buffers cached
Mem: 7700 7654 45 0 0 6848
Clearly, I do not have enough memory for all the cahing mongo wants to do (which, btw, results in huge CPU usage %, due to disk IO).
I found this document that suggests that in my scenario (high fault, low lock %), I need to "scale out reads" and "more disk IOPS."
I'm looking for advice on how to best achieve this. Namely, there are LOTS of different potential queries executed by my node.js application, and I'm not sure where the bottleneck is happening. Of course, I've tried
db.setProfilingLevel(1);
However, this doesn't help me that much, because the outputted stats just show me slow queries, but I'm having a hard time translating that information into which queries are causing the page faults...
As you can see, this is resulting in a HUGE (nearly 100%) CPU wait time on my PRIMARY mongo server, though the 2x SECONDARY servers are unaffected...
Here's what the Mongo docs have to say about page faults:
Page faults represent the number of times that MongoDB requires data not located in physical memory, and must read from virtual memory. To check for page faults, see the extra_info.page_faults value in the serverStatus command. This data is only available on Linux systems.
Alone, page faults are minor and complete quickly; however, in aggregate, large numbers of page fault typically indicate that MongoDB is reading too much data from disk and can indicate a number of underlying causes and recommendations. In many situations, MongoDB’s read locks will “yield” after a page fault to allow other processes to read and avoid blocking while waiting for the next page to read into memory. This approach improves concurrency, and in high volume systems this also improves overall throughput.
If possible, increasing the amount of RAM accessible to MongoDB may help reduce the number of page faults. If this is not possible, you may want to consider deploying a shard cluster and/or adding one or more shards to your deployment to distribute load among mongod instances.
So, I tried the recommended command, which is terribly unhelpful:
PRIMARY> db.serverStatus().extra_info
{
"note" : "fields vary by platform",
"heap_usage_bytes" : 36265008,
"page_faults" : 4536924
}
Of course, I could increase the server size (more RAM), but that is expensive and seems to be overkill. I should implement sharding, but I'm actually unsure what collections need sharding! Thus, I need a way to isolate where the faults are happening (what specific commands are causing faults).
Thanks for the help.
We don't really know what your data/indexes look like.
Still, an important rule of MongoDB optimization:
Make sure your indexes fit in RAM. http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingAdviceandFAQ-MakesureyourindexescanfitinRAM.
Consider that the smaller your documents are, the higher your key/document ratio will be, and the higher your RAM/Disksize ratio will need to be.
If you can adjust your schema a bit to lump some data together, and reduce the number of keys you need, that might help.
We are testing our software for the first time on a machine with > 12 cores for scalability and we are encountering a nasty drop in performance after the 12th thread is added. After spending a couple days on this, we are stumped regarding what to try next.
The test system is a dual Opteron 6174 (2x12 cores) with 16 GB of memory, Windows Server 2008 R2.
Basically, performance peaks from 10 - 12 threads, then drops off a cliff and is soon performing work at about the same rate it was with about 4 threads. The drop-off is fairly steep and by 16 - 20 threads it reaches bottom in terms of throughput. We have tested both with a single process running multiple threads and as multiple processes running single threads-- the results are pretty much the same. The processing is fairly memory intensive and somewhat disk intensive.
We are fairly certain this is a memory bottleneck, but we don't believe it a cache issue. The evidence is as follows:
CPU usages continues to climb from 50 to 100% when scaling from 12 to 24 threads. If we were having synchronization/deadlock issues, we would have expected CPU usage to top out before reaching 100%.
Testing while copying a large amount of files in the background had very little impact on the processing rates. We think this rules out disk i/o as the bottleneck.
The commit charge is only about 4 GBs, so we should be well below the threshold in which paging would become an issue.
The best data comes from using AMD's CodeAnalyst tool. CodeAnalyst shows the windows kernel goes from taking about 6% of the cpu time with 12 threads to 80-90% of CPU time when using 24 threads. A vast majority of that time is spent in the ExAcquireResourceSharedLite (50%) and KeAcquireInStackQueuedSpinLockAtDpcLevel (46%) functions. Here are the highlights of the kernel's factor change when going from running with 12 threads to running with 24:
Instructions: 5.56 (times more)
Clock cycles: 10.39
Memory operations: 4.58
Cache miss ratio: 0.25 (actual cache miss ratio is 0.1, 4 times smaller than with 12 threads)
Avg cache miss latency: 8.92
Total cache miss latency: 6.69
Mem bank load conflict: 11.32
Mem bank store conflict: 2.73
Mem forwarded: 7.42
We thought this might be evidence of the problem described in this paper, however we found that pinning each worker thread/process to a particular core didn't improve the results at all (if anything, performance got a little worse).
So that's where we're at. Any ideas on the precise cause of this bottleneck or how we might avoid it?
I'm not sure that I understand the issues completely such that I can offer you a solution but from what you've explained I may have some alternative view points which may be of help.
I program in C so what works for me may not be applicable in your case.
Your processors have 12MB of L3 and 6MB of L2 which is big but in my view they're seldom big enough!
You're probably using rdtsc for timing individual sections. When I use it I have a statistics structure into which I send the measurement results from different parts of the executing code. Average, minimum, maximum and number of observations are obvious but also standard deviation has its place in that it can help you decide whether a large maximum value should be researched or not. Standard deviation only needs to be calculated when it needs to be read out: until then it can be stored in its components (n, sum x, sum x^2). Unless you're timing very short sequences you can omit the preceding synchronizing instruction. Make sure you quantifiy the timing overhead, if only to be able to rule it out as insignificant.
When I program multi-threaded I try to make each core's/thread's task as "memory limited" as possible. By memory limited I mean not doing things which requires unnecessary memory access. Unnecessary memory access usually means as much inline code as possible and as litte OS access as possible. To me the OS is a great unknown in terms of how much memory work a call to it will generate so I try to keep calls to it to a minimum. In the same manner but usually to a lesser performance impacting extent I try to avoid calling application functions: if they must be called I'd rather they didn't call a lot of other stuff.
In the same manner I minimize memory allocations: if I need several I add them together into one and then subdivide that one big allocation into smaller ones. This will help later allocations in that they will need to loop through fewer blocks before finding the block returned. I only block initialize when absolutely necessary.
I also try to reduce code size by inlining. When moving/setting small blocks of memory I prefer using intrinsics based on rep movsb and rep stosb rather than calling memcopy/memset which are usually both optimized for larger blocks and not especially limited in size.
I've only recently begun using spinlocks but I implement them such that they become inline (anything is better than calling the OS!). I guess the OS alternative is critical sections and though they are fast local spinlocks are faster. Since they perform additional processing it means that they prevent application processing from being performed during that time. This is the implementation:
inline void spinlock_init (SPINLOCK *slp)
{
slp->lock_part=0;
}
inline char spinlock_failed (SPINLOCK *slp)
{
return (char) __xchg (&slp->lock_part,1);
}
Or more elaborate (but not overly so):
inline char spinlock_failed (SPINLOCK *slp)
{
if (__xchg (&slp->lock_part,1)==1) return 1;
slp->count_part=1;
return 0;
}
And to release
inline void spinlock_leave (SPINLOCK *slp)
{
slp->lock_part=0;
}
Or
inline void spinlock_leave (SPINLOCK *slp)
{
if (slp->count_part==0) __breakpoint ();
if (--slp->count_part==0) slp->lock_part=0;
}
The count part is something I've brought along from embedded (and other programming) where it is used for handling nested interrupts.
I'm also a big fan of IOCPs for their efficiency in handling IO events and threads but your description does not indicate whether your application could use them. In any case you appear to economize on them, which is good.
To address your bullet points:
1) If you have 12 cores at 100% usage and 12 cores idle, then your total CPU usage would be 50%. If your synchronization is spinlock-esque, then your threads would still be saturating their CPUs even while not accomplishing useful work.
2) skipped
3) I agree with your conclusion. In the future, you should know that Perfmon has a counter: Process\Page Faults/sec that can verify this.
4) If you don't have the private symbols for ntoskrnl, CodeAnalyst may not be able to tell you the correct function names in its profile. Rather, it can only point to the nearest function for which it has symbols. Can you get stack traces with the profiles using CodeAnalyst? This could help you determine what operation your threads perform that drives the kernel usage.
Also, my former team at Microsoft has provided a number of tools and guidelines for performance analysis here, including taking stack traces on CPU profiles.