How many concurrently-running threads can a WOW64 process have? - multithreading

Let me start by clarifying two aspects:
(1) by concurrently-running, I mean executing on the hardware at any one point in time, rather than being in some other OS state such as ready or waiting; and
(2) assume that the hardware has a sufficiently large number of hardware threads (aka logical processors), so that that's not the limiting factor. E.g. 4096 hardware threads. (Obviously I don't have such a machine, yet.)
I've read that 32-bit Windows only supports 32 concurrently-running threads, and that a 64-bit process (on 64-bit Windows) can have 64 concurrently-running threads per processor group and up to 20 processor groups (when using multiple groups) on Windows 10.
But I've been unable to find anything relevant about WOW64. I've found lots of information on the maximum number of threads that can be created, but nothing on concurrently-running threads.
So, how many concurrently-running threads can a WOW64 process have (on Windows 10)?
Is it?
(a) 32, for compatibility with 32-bit Windows; or
(b) 64, because processor groups aren't accessible by 32-bit code, so all threads run in the default processor group; or
(c) a larger number, because a WOW64 process is partly 64-bit code, and that (Microsoft) code can use multiple processor groups. (I don't think that this is likely, but include it as another possibility.)
Edit.
This is not a duplicate of any of the following questions on Stack Overflow, because their answers focus mainly on thread maximums due to address space limits.
What is the maximum number of threads a process can have in windows [closed]
What's the maximum number of threads in Windows Server 2003?
"What's the maximum number of threads possible for a threads in Windows 8.1?
The maximum number of thread [duplicate]
Similarly, the following two oft-cited articles also are concerned with address space limits.
Pushing the Limits of Windows: Processes and Threads
Does Windows have a limit of 2000 threads per process?

Related

How to execute an application using a specific core or cores?

I'm writing an application that needs to be executed on a specific core of a processor.
For Example:
If we have 4 cores and i want to execute code on 2nd core only. I need help how to do this.
I'm writing an application that needs to be executed on a specific core of a processor.
This is extremely improbable on most platforms (since most multi-core processors are homogeneous). You really need to explain, motivate and justify such an usual requirement.
You can't do that in general. And if you could do that, how exactly you should proceed is operating system specific, and platform specific. Most multi-core processors are homogeneous (all the cores are the same), some are not.
On Linux/x86-64, the kernel scheduler sees all cores as the same, and will move a task (e.g. a thread of a multi-threaded process) from one core to another at arbitrary moments. Since scheduling is preemptive.
On some processors, moving periodically (e.g dozen of times per second) a task from one core to another is actually recommended (and done automagically by the kernel, or the firmware - e.g. SMM) to avoid overheating of that core. Read about dark silicon.
Some unusual hardware (e.g. ARM big.LITTLE) have two sets of different cores (e.g. 2 high-end ARM cores with 2 low-end ones, all sharing the same memory). If your platform is such, please state that in your question, and ask how to achieve processor affinity on your specific platform. Very likely your OS has appropriate system calls for that purpose.
Some high-end motherboards are multi-sockets. In such case, a RAM module is closer to one socket (in timing) than to another. You then care about non-uniform memory access.
So read more about processor affinity and non-uniform memory access. Most OSes have some support for both. On Linux, see pthread_setaffinity_np(3), sched_setaffinity(2), numa(7) etc...
To learn more about OSes, read Operating Systems: Three Easy pieces.
Notice that by pinning some thread to a some fixed core, you might lower the performance of your program. Since processor affinity is rarely useful.
The programmer can prescribe his/her own affinities (hard affinities) but
Rule of thumb: use the default scheduler unless a good reason not to.
here is a C/C++ function to assign a thread to a certain core
Kernel scheduler API
#include <sched.h>
int sched_setaffinity(pid_t pid, unsigned int len, unsigned long * mask);
sets the current affinity mask of process 'pid' to *mask
'len' is the system word size: sizeof(unsigned int long)
To query affinity of a running process:
[~]$ taskset -p 3935
pid 3945's current affinity mask: f

How many threads does my machine really have?

I have a Mac Pro with 12 cores and 24 threads (2.7 GHz 12-Core Intel Xeon E5), but when I go into the terminal and type the "top" command, it says there are 1621 threads. How can this even be possible? Is the word "thread" being used differently by top? I thought there were only 24. It seems there are far more because, aside from what I've said about top, I can compile with several dozen threads. When I type "make -j60", for example, the computer has no issue with launching 60 different compile processes, each working independently, compiling its own object file (or at least that's how it appears).
Thanks in advance,
-AA
The CPU can execute 24 threads at the same time, two in each of its 12 cores. There are currently 1,621 threads created by the software on your system. The vast majority of them currently have nothing to do. As those threads become ready to run, the system's scheduler will cause some of them to be executed, subject to the 24 at a time maximum the CPU is capable of.

Curious about how to specify the number of core for MPI in order to get the fastest scientific computation

I have been running several scientific program package in conjunction with MPI by using the following command
nohup mpirun -np N -x OMP_NUM_THREADS=M program.exe < input > output &
where the value of N and M depend on the physical CPU cores of my machine. For example, my machine has the specification like this
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2440 0 # 2.40GHz
Stepping: 7
In this case, I first tried setting with N = 24 and M = 1, so the calculation ran very slowly. Then I changed N and M to 12 and 2 respectively. So I found that the latter had obviously provided me the fastest computation.
I was wondering that why did I set N & M are 12 and 2 provide more performance higher than the first case ?
there is no absolute rule on how to run MPI+OpenMP application.
the only advice is not to run an OpenMP process on more than one socket
(OpenMP was designed for SMP machines with flat memory access, but today, most systems are NUMA)
then just experiment.
some apps run best in flat MPI (e.g. one thread per task), while some other work best with one MPI task per socket, and all available cores for OpenMP.
last but not least, if you run more than one OpenMP thread per MPI task, make sure your MPI library bound the MPI tasks as expected.
for example, if you run with 12 OpenMP threads but MPI bind tasks to one core, you will end up doing time sharing and performance will be horrible.
or if you run with 12 OpenMP threads, and MPI task was bound to 12 cores, make sure the 12 cores are on the same socket (and not 6 on each socket)
There is no general rule about this because, most of the time, this performance is dependent on the computation properties of the application itself.
Applications with coarse synchronization granularity may scale well using plain MPI code (no multithreading).
If the synchronization granularity is fine, then using shared memory multithreading (such as OpenMP) and placing all the threads in a process close to each other (in the same socket) becomes more important: synchronization is cheaper and memory access latency is critical.
Finally, compute-bound applications (performance is limited by the processor) are likely not to benefit from hyper-threading at all, since two threads sharing a core contend for the functional units it contains. In this case, you may find applications that perform better using N=2 and M=6 than using N=2 and M=12.
indeeed there is no absolute rule on how to run MPI+OpenMP application.
I agree with all Gilles said.
so I want to talk about the CPU in your case.
in the specification you give, it shows the system enables hyper-thread.
but this not always helps. your computer has 12 physical cores in fact.
so I advice you try some combinations that make M * N = 12 to 24,
like 12*1, 6*2, 6*3
which one is best, depends on how well your application.

Matlabpool number of threads vs core

I have a laptop running Ubuntu on Intel(R) Core(TM) i5-2410M CPU # 2.30GHz. According to Intel website for the above processor (located here), this processor has two cores and can run 4 threads at a time in parallel (because although it has 2 physical cores it has 4 logical cores).
When I start matlabpool it starts with local configuration and says it has connected to 2 labs. I suppose this means that it can run 2 threads in parallel. Does it not know that the CPU can actually run 4 threads in parallel?
In my experience, the local configuration of matlabpool uses, by default, the number of physical cores a machine possesses, rather than the number of logical cores. Hence on your machine, matlabpool only connects to two labs.
However, this is just a setting and can be overwritten with the following command:
matlabpool poolsize n
where n is an integer between 1 and 12 denoting the number of labs you want Matlab to use.
Now we get to the interesting bit that I'm a bit better equipped to answer thanks to a quick lesson from #RodyOldenhuis in the comments.
Hyper-threading implies a given physical core can have two threads run through it at the same time. Of course, they can't literally be processed simultaneously. The idea goes more like this: If one of the threads is inefficient in allocating tasks to the core, then the core may exhibit some "down-time". A second thread can take advantage of this "down-time" to get some work done.
In my experience, Matlab is often efficient in its allocation of threads to cores, therefore with one Matlab thread (ie one lab) running through it, a core may have very little "down-time" and hence there will be very little advantage to hyper-threading. My desktop is a core-i7 with 4 physical cores but 8 logical cores. However, I notice very little difference between running a parfor loop with 4 labs versus 8 labs. In fact, 8 labs is often slower due to the start-up costs associated with initializing the extra labs.
Of course, this is probably all complicated by other external factors such as what other programs you might be running simultaneously to Matlab too.
In summary, my suspicion is that even though you could force Matlab to initialize 4 labs (or even 12 labs), you won't see much of a speed-up over 2 labs, since Matlab is generally fairly efficient at allocating tasks to the processor.

Linux per-process resource limits - a deep Red Hat Mystery

I have my own multithreaded C program which scales in speed smoothly with the number of CPU cores.. I can run it with 1, 2, 3, etc threads and get linear speedup.. up to about 5.5x speed on a 6-core CPU on a Ubuntu Linux box.
I had an opportunity to run the program on a very high end Sunfire x4450 with 4 quad-core Xeon processors, running Red Hat Enterprise Linux. I was eagerly anticipating seeing how fast the 16 cores could run my program with 16 threads..
But it runs at the same speed as just TWO threads!
Much hair-pulling and debugging later, I see that my program really is creating all the threads, they really are running simultaneously, but the threads themselves are slower than they should be. 2 threads runs about 1.7x faster than 1, but 3, 4, 8, 10, 16 threads all run at just net 1.9x! I can see all the threads are running (not stalled or sleeping), they're just slow.
To check that the HARDWARE wasn't at fault, I ran SIXTEEN copies of my program independently, simultaneously. They all ran at full speed. There really are 16 cores and they really do run at full speed and there really is enough RAM (in fact this machine has 64GB, and I only use 1GB per process).
So, my question is if there's some OPERATING SYSTEM explanation, perhaps some per-process resource limit which automatically scales back thread scheduling to keep one process from hogging the machine.
Clues are:
My program does not access the disk or network. It's CPU limited. Its speed scales linearly on a
single CPU box in Ubuntu Linux with
a hexacore i7 for 1-6 threads. 6
threads is effectively 6x speedup.
My program never runs faster than
2x speedup on this 16 core Sunfire
Xeon box, for any number of threads
from 2-16.
Running 16 copies of
my program single threaded runs
perfectly, all 16 running at once at
full speed.
top shows 1600% of
CPUs allocated. /proc/cpuinfo shows
all 16 cores running at full 2.9GHz
speed (not low frequency idle speed
of 1.6GHz)
There's 48GB of RAM free, it is not swapping.
What's happening? Is there some process CPU limit policy? How could I measure it if so?
What else could explain this behavior?
Thanks for your ideas to solve this, the Great Xeon Slowdown Mystery of 2010!
My initial guess would be shared memory bottlenecks. From what you say, your performance pretty much flatlines after 2 CPUs. You initially blame Redhat, but I'd be curious to see what happens if you install Ubuntu on the same hardware. I assume, of course, that you're running 64 bit SMP kernels across both tests.
It's probably not possible that the motherboard would peak at utilizing 2 CPUs. You have another machine with multiple cores that has provided better performance. Do you have hyperthreading turned on with the new machine? (and how does that answer compare to the old machine?). You're not, by chance, running in a virtualized environment?
Overall, your evidence is pointing to a ludicrously slow bottleneck somewhere. As you said, you're not I/O bound, so that leaves the CPU and memory. Either something is wrong with the hardware, or something is wrong with the hardware. Test one by changing the other, and you'll narrow down your possibilities quickly.
Do some research on rlimit - it's quite possible the shell/user acct you're running in has some RH-default or admin-set resource limits in place.
When you see this kind of odd scaling behaviour, especially if problems are seen with multiple threads, but not multiple processes, one thing to start looking at is the impacts of lock contention and other synchronisation primitives, which can cause threads running on different processors to have to wait for each other, potentially forcing multiple cores to flush their cache to main memory.
This means memory architecture starts to come into play, and that's going to be substantially faster when you have 6 cores on a single piece of silicon than when you're coordinating across 4 separate processors. Specifically, the single CPU case likely isn't needing to hit main memory for locking operations at all - everything is likely being handled at the L3 cache level, allowing the CPU to get on with things while data is flushed to main memory in the background.
While I expect the OP has lost interest in the question after all this time (or may not even have access to the hardware any more), one way to check this would be to see if the scaling up to 4 threads improves if the process affinity is set to lock it to a single physical CPU. Even better though would be to profile the application itself to see where it is spending it's time.As you change architectures and increase the number of cores, it gets harder and harder to guess where the bottlenecks are, so you really need to start measuring things directly, as in this example: http://postgresql.1045698.n5.nabble.com/Sun-Donated-a-Sun-Fire-T2000-to-the-PostgreSQL-community-td2057445.html

Resources