Processor Utilization by threads spawned by a program - multithreading

I have a java program which spawns multiple threads say, 10-20 threads. This program is scheduled to be run on a machine that has 32 processors.
I am keen to know if all the processors' power would be utilized by these threads.
Solaris is the environment; does that make any difference?

A good profiler should tell you this. If the threads are compute bound, then yes you will use as many cores as you have threads, if you are blocked doing I/O or on contention it will be less than that.
Given that you aren't on Windows, the following doesn't apply, but a decent profiler should still be able to provide a measurement of CPU cycles burned by your process for a give period of time...
If you are on Windows a good free tool to use is the Windows Performance Toolkit (xperf) which is now part of the platform sdk. It will show you the processor cycles burned for each thread or process for a period of time (as opposed to just elapsed times).

Related

Purpose of multiprocessors and multi-core processor

I do want to clarify things in my head and model concrete knowledge. dual-core with one processor system, only two threads within the one process can be executed concurrently by each core. Uni-core with two processor system, two different process can be executed by each CPU.
So can we say, each processor can execute processes concurrently. While multi-core processor execute threads within the process concurrently?
I think you have a fundamental misunderstanding of what a process and thread are and how they relate to the hardware itself.
A CPU core can only execute 1 machine level instruction per clock cycle (so essentially, just 1 assembly instruction). CPU's are typically measured by the number of clock cycles they go through in a second. So a 2.5 GHz core can execute 2.5 billion instructions per second.
The OS (the operating system, like Windows, Linux, macOS, Android, iOS, etc.) is responsible for launching programs and giving them access to the hardware resources. Each program can be considered a "process".
Each process can launch multiple threads.
To ensure that multiple processes can share the same hardware resources, the idea of pre-emptive computing came about over 40 years ago.
In a nut-shell, pre-emptive computing, or time-slicing, is a function of the OS. It basically gives a few milliseconds to each thread that is running, regardless of which process that thread is a part of, and keeps the "context" of each thread so that the state of each thread can be handled appropriately when it's time for that thread to run; that's also known as a context switch.
A dual, quad, or even 128 core CPU does not change that, nor will the amount of CPU's in the system (e.g. 4 CPU's each with 128 cores). Each core can only execute 1 instruction per clock cycle.
What changes is how many instructions can be run in true parallel. If my CPU has 16 cores, then that means it can execute 16 instructions per clock cycle, and thus run 16 separate threads of execution without any context switching being necessary (though it does still happen, but that's a different issue).
This doesn't cover hyper-threading, in which 1 core can execute 2 instructions per cycle, essentially doubling your CPU count, and doesn't cover the idea of cache-misses or other low-level ideas in which extra cycles could be spent on a thread, but it covers the general idea of CPU scheduling.

cpu time jumps a lot in virtual machine

I have a C++ program running with 20 threads (boost threads) on one of the RHEL6.5 systems virtualized in dell server. The result is deterministic, but the cpu time and wall time varies a lot in different runs. Sometimes, it takes 200s cpu time to finish, sometimes it may take up to 300s cpu time to finish. This bothers me as performance is a criterion for our testing.
I've changed the originally used boost::timer::cpu_timer for wall/cpu time calc and use sys apis 'clock_gettime' and 'getrusage'. It doesn't help.
Is it because of the 'steal time' by hypervisor (Vmware)? Is steal time included in the user/sys time collected by 'getrusage'?
Anyone have knowledge on this? Many Thanks.
It would be useful if you provided some extra information. For example are your threads dependent? meaning is there any synchronization going among them?
Since you are using a virtual machine, how is your CPU shared with other users of the server. It might be that even the same single CPU core is shared, thus not each time you have the same allocation of CPU resources [this is the steal time you mention above].
Also you mention that CPU time is different: this is the time spent in user code. If you have sync among threads (such as a mutex, etc) then depending on how operating system wakes up threads etc, the over all time might vary.

is multi-threading dependent on the architecture of the machine?

I have been reading lately about system architecture and the topic of multi-threading has not been covered in detail with latest improvements in technology. I did my part of search, but could not find answers for the following:
The questions have are
1) Is multi-threading dependent on the system architecuture (CPU). do all CPU (single core) support multi-threading? If it does not, what happens to multi-threaded applications when run on those machines
It is cited here that
Intel CPUs support multithreading, but only two threads per CPU.
AMD CPUs do not support multithreading and AMD often sites Microsoft's
recommendations to turn off Hyperthreading on Intel CPUs when running applications
like peoplesoft and Exchange.
2) so what does it mean it say only two threads per CPU here. At any given time, CPU (single core) can process only thread. and the other thread is waiting to be processed correct?
3) how is it different from an application that spawns, say, 10 threads and waiting for them to be executed. If the CPU at the most can tackle only two threads, shouldn't programmer keep that fact in consideration when writing multi-threaded applications.
Even with multi-core processors (say quad-core) at the most 8 threads can be queued, but only 4 threads can be processed at the same time.
P.S: I have a read a little about hyper-threading but I am not sure if that is relevant here and if
all processors support hyper-threading
1) It depends on the operating system more than anything. Even for single core architectures, multi-threading can be supported, but the threads are not executing in parallel - The OS will context-switch between them.
2) Intel usually supports two-way hardware threading ( also called simultaneous multi-threading), where each thread is allocated a pipeline. So if you have a process with two threads they can both execute on the same core simultaneously.
3) See 1. Basically the operating system is going to allocate as many threads as it can to hardware before it plans to context-switch between the threads it couldn't allocate. This process is dependent on the OS's scheduler, and you can read about the Linux one to get a good idea of what's going on.
Edit: Hypethreading is basically the hardware threading feature I mentioned.
In your question CPU means core.
1) It does. I believe memory access on ARMs is in words, so write to char is not atomic
Also memory ordering differs Modern OSes (anything but DOS) support context switching: while one thread executes, others wait. Total number of threads in all Windows processes is about 1000. Common time quant (time to load CPU) is 1-10 ms. One core multithreading don't improve computational power but allows asynchronous tasks. For example GUI doesn't freeze during network activity. One threads waits net, another one responds to user activity.
2) Yes
3) It is common practice to spawn number of threads equal to number of (virtual) cores, ie number of cores in system for AMD and twice for Intel. It is true only for computational threads. Web server threads usually wait net and don't load CPU a lot, so it is better to spawn thousands of threads.
Hyperthreading is cool for tasks that wait RAM. While one thread waits data another one executes. For math it usually not increase performance. It is good for work with data that is not cache-friendly: lists, trees, hash tables that don't fit into cache.

What is the relationship between threads (in a Java or a C++ program) and number of cores in the CPU?

Can someone shed some light on it?
An i7 processor can run 8 threads but I am pretty sure we can create more than 8 threads in a JAVA or C++ program(not sure though). I have an i5 processor and while studying concurrency I have created 10 threads for assignments. I am just trying to understand how Core rating of CPU is related to threads.
The thread you are refering to is called a software thread; and you can create as many software threads as you need, as long as your operating system allows it. Each software thread, or code snippet, can run concurrently from the others.
For each core, there is at least one hardware thread to which the operating system can assign a software thread. If you have 8 cores, for example, then you have a hardware thread pool of capacity 8. You can map tens or hundreds of software threads to this 8-slot pool, where only 8 threads are actually running on hardware at the same time, i.e. in parallel.
Software threads are like people sharing the same computer. Each one can use this computer up to some time, not necessarily have his task completed, then give it up to another.
Hardware threads are like people having a computer for each of them. All of them can proceed with their tasks at the same time.
Note: For i7, there are two hardware threads (so called hyper-threading) in each core. So you can have up to 16 threads running in parallel.
There are already a couple of good answers talking about the hardware side of things, but there isn't much talk about the software side of things.
The essential fact that I believe you're missing is that not all threads have to be executing all the time. When you have thousands of threads on an 8 core machine, only a few of them are actually running at any given time. The others are sitting around doing nothing until some processor time becomes free. This has huge advantages because threads might be waiting on other resources, too. For example, if I have one thread trying to read a file from disk, then there's no reason for it to be taking up CPU time while it's waiting for the hard disk data to load into RAM. Another example is when the thread is waiting for a response from some other machine (such as a web request over the internet). When you have more threads than your processor can handle at once, the operating system and/or runtime (It depends on the OS and runtime implementation.) are responsible for deciding which threads should get available processor time. This kind of set up lets you maximize your machine's productivity because CPU cycles are doing something useful almost all the time.
A "thread" is a software abstraction which defines a single, self-consistent path of execution through the program: in most modern systems, the number of threads is basically only limited by memory. However, only a comparatively small number of threads can be run simultaneously by the CPU. Broadly speaking, the "core count" is how many threads the CPU can run truly in parallel: if there are more threads that want to run than there are cores available, the operating system will use time-slicing of some sort to let all the threads get some time to execute.
There are a whole bunch of terms which are thrown around when it comes to "cores:"
Processor count: the number of physical CPU chips on a system's motherboard. This used to be the only number that mattered, until CPUs with multiple cores became available.
Logical core count: the number of threads that the system hardware can run in parallel
Physical core count: the number of copies of the CPU execution hardware that the system has -- this is not always equal to the logical core count, due to features like SMT ("simultaneous multithreading") which use a single piece of hardware to run multiple threads in parallel
Module count: Recent (Bulldozer-derived) AMD processors have used an architecture which is a hybrid between SMT and the standard one-logical-core-per-physical-core model. On these CPUs, there is a separate copy of the integer execution hardware for each logical core, but two logical cores share a floating-point unit and the fetch-and-decode frontend; AMD calls the unit containing two logical cores a module.
It should also be briefly mentioned that GPUs, graphics cards, have enormous core counts and run huge numbers (thousands) of threads in parallel. The trade-off is that GPU cores have very little memory and often a substantially restricted programming model.
Threads are handled by the OS's scheduler. The number of cores in a CPU determine how many threads it can run at the same time.
Note that threads are constantly switched in and out by the scheduler to give the "illusion" that everything is running at the same time.
More here, if you're interested.
no, no, no... Your I7 has eight execution threads and can run 8 threads at once.
1000 threads or more can be waiting for processor time.
calling thread.sleep moves a thread off the execution core and back into memory where it waits until woken.

Running mlutiple threads in CPU

All we know that JVM schedules the user threads in a single CPU based machine .Why cant a single CP run mltiple process/threads in parallel,What is the constrain stops that capability
Also JVM is like a another software which is running in any machine,There may be thousands of other programs may waiting for the CPU cycle at a given time between this how JVM threads get the schedules from the CPU What is the parameter which gives the speed/possibility of the allocation of cycles for any process in any machine.
This is not really a Java question, but a cpu architecture question.
And some CPUs DO run multiple threads in parallel per core. Look at Intel and Hyperthreading.. a 4 core machine with 8 threads, does the opposite of what you suggest.
traditional single-core processors can only process one instruction at a time, meaning that they can only work in a single thread at any one point in time.
multithread support is achieved synthetically by giving threads 'turns' on the cpu so that they appear to be running concurrently.
multi-core processors can process an instruction per CPU at any one point in time.
this question is more in relation to CPU hardware design than programming and especially not a single language ie java as the restriction is across the board.

Resources