Performance Problems with Node.js (Mac OSX) - Processes - linux

i hope to find some little help here.
we are using node, mongodb, supertest, mocha and spawn in our test env.
we've tried to improve our mocha test env to run tests in parallel, because our test cases run now for almost 5minutes! (600 cases)
we are spawning for example 4 processes and run tests in parallel. it is very successful, but just on linux.
on my mac the tests still run very slow. it seems like the different processes are not really running in parallel.
test times:
macosx:
- running 9 tests in parallel: 37s
- running 9 tests not in parallel: 41s
linux:
- running 9 tests in parallel: 16s
- running 9 tests not in parallel: 25s
maxosx early 2011:
10.9.2
16gb ram
core i7 2,2ghz
physical processors: 1
cores: 4
threads: 8
linux dell:
ubuntu
8gb ram
core i5-2520M 2.5ghz
physical processors: 1
cores: 2
threads: 4
my questions are:
are there any tips improving the process performance on macosx?
(except of ulimit, launchctl (maxfiles)?)
why do the tests running much faster on linux?
thanks,
kate

I copied my comment here, as it describes most of the answer.
Given the specs of your machines, I doubt the extra 8 gigs of ram, and processing power effect things much, especially given nodes single process model, and that you're only launching 4 processes. I doubht for the linux machine that 8 gigs, 2.5 ghz, and 4 threads is a bottleneck at all. As such, I would actually expect the time the processor spends running your tests to be roughly equivalent for both machines. I'd be more interested in your Disk I/O performance, given you're running mongo. Your Disk I/O has the most potential to slow stuff down. What are the specs there?
Your specs:
macosx: Toshiba 5400RPM 8MB
linux: Seagate 7200 rpm 16mb
Your Linux drive is significantly, 1.33X faster, than your mac drive, as well as having a significantly larger cache. For database based applications hard drive performance is crucial. Most of the time spent in your application will be waiting for I/O, particularly in Nodes single process method of doing work. I would suggest this as the culprit for 90% of the performance difference, and chalk the rest up to the fact that linux probably has less crap going on in the background, further exacerbating your Mac Disk Drive's performance issues.
Furthermore, launching multiple node processes isn't likely to help this. Since processor time isn't your bottleneck, launching too many processes will just slow your disk down. Another proof that this is the problem is that the performance of multiple processes on linux is proportionally better than the performance of multiple processes on mac. 1 process is nearly maxing out the performance of your 5400 drive, and so you don't see significant performance increase from running multiple processes. Whereas the multiple linux Node processes use the disk to it's full potential. You would likely see diminishing returns on the linux OS if you were to launch many more processes, unless of course you were to upgrade to a SSD.

Related

How do I force MPI to not run on all cores if I have more threads than cores?

Context: I'm debugging a simulation code that requires that the number of MPI threads does not change when continuing the simulation from a restart file. This code was running on a large cluster, but I'm debugging it on a smaller local machine so that I don't have to wait to submit the job to a queue. The code requires 72 threads, which is more than the number of cores on the local machine. This is not a problem in itself - I can run with more threads than cores, and just take the performance hit, which is not a major issue when debugging.
The Problem: I want to leave some cores free for other tasks and other users. For instance, if my small local computer has 48 cores, I want to run my 72 threads on, say, 36 cores, and leave 12 cores free. I want to debug my large code locally without completely taking over the machine.
Assuming I'm willing to deal with the memory and performance issues of running on more threads than cores, how do I actually do this? Do I have to get into the back-end of the scheduler somehow? Does it depend on whether I'm using MPICH or Open-MPI etc?
I'm essentially looking for something like mpirun -np 72 --cpus-per-proc 0.5, if that were possible.
taskset -c 0-35 mpiexec -np 72 ./a.out should do the trick if the process are to be launched all on the same host and should work with basically all MPI distributions (Open MPI, MPICH, Intel MPI, etc.). Also, make sure to disable any process binding by the MPI library, i.e. pass --bind-to none for Open MPI 1.8+, -bind-to none for MPICH with Hydra or -genv I_MPI_PIN=0 for Intel MPI.

What to expect in terms of performance from my Spark Streaming Application in local mode?

I realize this might be a very broad question, but this is my issue: I developed a Spark Application in Java which uses an algorithm to analyse several JSON messages (1kB of size each) which are received through a socket connection, in one second intervals.
I'm only using 6 map methods, but the functions inside have several loops that can run up to 1000 times each (there are even cases where I have a loop inside a loop which leads to them being run 1000*1000 times in total).
I'm running the application in local mode, that is, with just one node (I assume) to perform the Spark tasks and jobs.
The problem here is that I am taking up to 7 minutes to process one of these messages, which is an insane amount of time, and causes great scheduling delays.
Is this normal given the complexity of my algorithm + running in local mode+ possibly some memory leakage?
If so, how can I proceed to improve the throughput?
Don't know if it helps, but here are some specifications of my computer:
Processor: Intel Core i5, 2.60GHz
RAM: 3.87GB usable memory
64 bit operating system
Thank you so much.

why performance improvement of CPU-intensive task differs between windows and linux when using multi processes

Here is my situation:
my company need to run tests on tons of test samples. But if we start a single process on a windows PC machine, this test could last for hours, even days. so we try to split the test set and start a process to test each one of the slices on a multi-core linux server.
we expect a linear performance improvement for the server solution, but the truth is we could only observe a 2~3 times improvement when the test task finished by 10~20 processes.
I tried several means to locate the problem:
disable hyper-threading;
use max-performance power policy
use taskset to pin each process on different core
but no luck, the problem remains.
Why does this happen? which is the root cause, our code, OS or hardware?
here is the info of my pc and server:
PC: os: win10; cpu: i5-4570, 2 physical core; mem : 16gb
server: os: redhat 6.5 cpu: E5-2630 v3, 2 physical core; mem : 32gb
Edit:
About CPU: the server has 2 processors, and each of them has 8 physical cores. check this link for more information.
About My Test: it's handwriting recognition related(that's why it's a cpu-sensitive task).
About IO: the performance check points do not involve much IO if logging doesn't count.
we expect a linear performance improvement for the server solution,
but the truth is we could only observe a 2~3 times improvement when
the test task finished by 10~20 processes.
This seems very logical considering there are only 2 cores on the system. Starting 10-20 processes will only add some overhead due to task switching.
Also, I/O could be a bottleneck here too, if multiple processes are reading from disk at the same time.
Ideally, the number of running threads should not exceed 2 x the number of cores.

Matlabpool number of threads vs core

I have a laptop running Ubuntu on Intel(R) Core(TM) i5-2410M CPU # 2.30GHz. According to Intel website for the above processor (located here), this processor has two cores and can run 4 threads at a time in parallel (because although it has 2 physical cores it has 4 logical cores).
When I start matlabpool it starts with local configuration and says it has connected to 2 labs. I suppose this means that it can run 2 threads in parallel. Does it not know that the CPU can actually run 4 threads in parallel?
In my experience, the local configuration of matlabpool uses, by default, the number of physical cores a machine possesses, rather than the number of logical cores. Hence on your machine, matlabpool only connects to two labs.
However, this is just a setting and can be overwritten with the following command:
matlabpool poolsize n
where n is an integer between 1 and 12 denoting the number of labs you want Matlab to use.
Now we get to the interesting bit that I'm a bit better equipped to answer thanks to a quick lesson from #RodyOldenhuis in the comments.
Hyper-threading implies a given physical core can have two threads run through it at the same time. Of course, they can't literally be processed simultaneously. The idea goes more like this: If one of the threads is inefficient in allocating tasks to the core, then the core may exhibit some "down-time". A second thread can take advantage of this "down-time" to get some work done.
In my experience, Matlab is often efficient in its allocation of threads to cores, therefore with one Matlab thread (ie one lab) running through it, a core may have very little "down-time" and hence there will be very little advantage to hyper-threading. My desktop is a core-i7 with 4 physical cores but 8 logical cores. However, I notice very little difference between running a parfor loop with 4 labs versus 8 labs. In fact, 8 labs is often slower due to the start-up costs associated with initializing the extra labs.
Of course, this is probably all complicated by other external factors such as what other programs you might be running simultaneously to Matlab too.
In summary, my suspicion is that even though you could force Matlab to initialize 4 labs (or even 12 labs), you won't see much of a speed-up over 2 labs, since Matlab is generally fairly efficient at allocating tasks to the processor.

Linux per-process resource limits - a deep Red Hat Mystery

I have my own multithreaded C program which scales in speed smoothly with the number of CPU cores.. I can run it with 1, 2, 3, etc threads and get linear speedup.. up to about 5.5x speed on a 6-core CPU on a Ubuntu Linux box.
I had an opportunity to run the program on a very high end Sunfire x4450 with 4 quad-core Xeon processors, running Red Hat Enterprise Linux. I was eagerly anticipating seeing how fast the 16 cores could run my program with 16 threads..
But it runs at the same speed as just TWO threads!
Much hair-pulling and debugging later, I see that my program really is creating all the threads, they really are running simultaneously, but the threads themselves are slower than they should be. 2 threads runs about 1.7x faster than 1, but 3, 4, 8, 10, 16 threads all run at just net 1.9x! I can see all the threads are running (not stalled or sleeping), they're just slow.
To check that the HARDWARE wasn't at fault, I ran SIXTEEN copies of my program independently, simultaneously. They all ran at full speed. There really are 16 cores and they really do run at full speed and there really is enough RAM (in fact this machine has 64GB, and I only use 1GB per process).
So, my question is if there's some OPERATING SYSTEM explanation, perhaps some per-process resource limit which automatically scales back thread scheduling to keep one process from hogging the machine.
Clues are:
My program does not access the disk or network. It's CPU limited. Its speed scales linearly on a
single CPU box in Ubuntu Linux with
a hexacore i7 for 1-6 threads. 6
threads is effectively 6x speedup.
My program never runs faster than
2x speedup on this 16 core Sunfire
Xeon box, for any number of threads
from 2-16.
Running 16 copies of
my program single threaded runs
perfectly, all 16 running at once at
full speed.
top shows 1600% of
CPUs allocated. /proc/cpuinfo shows
all 16 cores running at full 2.9GHz
speed (not low frequency idle speed
of 1.6GHz)
There's 48GB of RAM free, it is not swapping.
What's happening? Is there some process CPU limit policy? How could I measure it if so?
What else could explain this behavior?
Thanks for your ideas to solve this, the Great Xeon Slowdown Mystery of 2010!
My initial guess would be shared memory bottlenecks. From what you say, your performance pretty much flatlines after 2 CPUs. You initially blame Redhat, but I'd be curious to see what happens if you install Ubuntu on the same hardware. I assume, of course, that you're running 64 bit SMP kernels across both tests.
It's probably not possible that the motherboard would peak at utilizing 2 CPUs. You have another machine with multiple cores that has provided better performance. Do you have hyperthreading turned on with the new machine? (and how does that answer compare to the old machine?). You're not, by chance, running in a virtualized environment?
Overall, your evidence is pointing to a ludicrously slow bottleneck somewhere. As you said, you're not I/O bound, so that leaves the CPU and memory. Either something is wrong with the hardware, or something is wrong with the hardware. Test one by changing the other, and you'll narrow down your possibilities quickly.
Do some research on rlimit - it's quite possible the shell/user acct you're running in has some RH-default or admin-set resource limits in place.
When you see this kind of odd scaling behaviour, especially if problems are seen with multiple threads, but not multiple processes, one thing to start looking at is the impacts of lock contention and other synchronisation primitives, which can cause threads running on different processors to have to wait for each other, potentially forcing multiple cores to flush their cache to main memory.
This means memory architecture starts to come into play, and that's going to be substantially faster when you have 6 cores on a single piece of silicon than when you're coordinating across 4 separate processors. Specifically, the single CPU case likely isn't needing to hit main memory for locking operations at all - everything is likely being handled at the L3 cache level, allowing the CPU to get on with things while data is flushed to main memory in the background.
While I expect the OP has lost interest in the question after all this time (or may not even have access to the hardware any more), one way to check this would be to see if the scaling up to 4 threads improves if the process affinity is set to lock it to a single physical CPU. Even better though would be to profile the application itself to see where it is spending it's time.As you change architectures and increase the number of cores, it gets harder and harder to guess where the bottlenecks are, so you really need to start measuring things directly, as in this example: http://postgresql.1045698.n5.nabble.com/Sun-Donated-a-Sun-Fire-T2000-to-the-PostgreSQL-community-td2057445.html

Resources