Is there a way to determine how many CPU cycles my single-threaded program requires to execute?

Is there a way to determine how many CPU cycles my single-threaded program requires to execute? - cpu-cycles

I have a single-threaded, computationally intensive program and I would like to know if there is a way (such as a tool/profiler) that I can use to determine the number of CPU cycles my program requires to execute.

Related

What's the difference between interactive and non-interactive thread? And performance in different CPU Scheduler?

A scheduler that approximates SRTF, like a multi-level feedback queue design, will tend to favor interactive programs that perform short CPU bursts. Linux's Completely Fair Scheduler sometimes does so, but since it has a different scheduling goal, it often wil not. In which of the following scenarios is CFS likely to result in much worse performance for the interactive thread than an MLFQ-like scheduler that approximates SRTF?
running one interactive thread with short CPU bursts that, if running alone, would use very little CPU time and one very CPU-intensive thread that never does I/O
running one interactive thread with short CPU bursts that, if running alone, would use very little CPU time and one non-interactive thread with much longer CPU bursts that performs disk I/O frequently
running one interactive thread with frequent short CPU bursts that, if running alone, would use most of the available CPU time, and one very CPU-intensive thread that never does I/O
running one interactive thread with short CPU bursts and a very large number of CPU-intensive threads that never do I/O
The correct answers are 3 and 4.
Why 3 & 4 are correct? What's the difference between interactive and non-interactive thread?

In this context, an interactive thread is one that tends to spend most of its time waiting for I/O, only doing small amounts of computation in between. That is, it mostly responds quickly to inputs rather than doing longer computations.
More broadly speaking, when we speak of interactive programs, we usually mean ones that are primarily responding to some external input. A common scheduling goal is to provide programs like these with higher priority than normal programs to provide at least the appearance of better performance to users waiting for the machine to do something. When thinking about interactivity this way, exact definitions vary --- there are different notions of what counts as an "external input".
For answering this question in particular, we don't actually need to use any definition of "interactive". The reason the question specifies that one thread is interactive is to motivate the question --- this is a case where SRTF-like schedulers can do better than CFS by identifying interactive threads by their tendency to have short CPU bursts. Rather than relying on us saying the thread is "interactive", we can understand how the SRTF scheduling policy will work based on the CPU burst lengths, which we are told explicitly. We can understand how the CFS policy will apply by considering that it splits the CPU time approximately fairly between the available threads.
For 1 and 2:
since the interactive thread doesn't use much CPU time overall, it will tend to be run first by CFS, but it will also tend to be run first by SRTF since it has the shortest CPU bursts
For 3:
CFS will end up giving the interactive thread about half the available CPU time (fairly splitting CPU time between the two available threads), but under SRTF, it would would always be run first (whenever it could run) because of its shorter CPU burst and would end up getting much more than half the time (since "running alone, [it] would use most of the available CPU")
For 4:
CFS will end up giving the interactive thread about 1/N of the available CPU time where N is the total number of threads and we are told that N is very large. Under SRTF, the thread would always run first, so it would almost certainly get more than the small sliver of CPU time that 1/N represents
--answer from my professor

How does multithreading utilizes multiple cores?

So recently I've learned some basic knowledge about multithreading. What I've understood is that thread is a lightweight process that runs under processes by sharing memory, while one process is running under one CPU core.
Yet by this perspective I couldn't understand some saying that threads utilizes multiple cores and make the whole program executes more effective. From what I've known, threads created by one process should run only under that specific process, which means that it should only run under that very one CPU core. If we want to utilize multiple cores, we should actually use multiprocess to run parallelly. Most of what I've researched is only about the conclusion, i.e multithreading utilizes multiple cores, but none of them explains my question. Did I think anything wrong? Thanks!

Your confusion lies here:
[...] while one process is running under one CPU core.
[...] threads created by one process should run only under that specific process, which means that it should only run under that very one CPU core.
This is not true. I think what the various explanations you have read meant that any process have at least one thread (where a 'thread' is a sequence of instructions ran by a CPU core).
If you have a multithreaded program, the process will have several threads (sequences of instructions ran by a CPU core) that can run concurrently on different CPU cores.
There are many processes executing on your computer at any given time. The Operating System (OS) is the program that allocates the hardware resources (CPU cores) to all these processes and decides which process can use which cores for what amount of time before another process gets to use the CPU. Whether or not a process gets to use multiple cores is not entirely up to the process. More confusing still, multithreaded programs can use more threads than there are cores on the computer's CPU. In that case you can be certain that all your threads do not run in parallel.
One more thing:
[...] threads utilizes multiple cores and make the whole program executes more effective
I am going to sound very pedantic, but it is more complicated than that. It depends on what you mean by "effective". Are we talking about total computation time, energy consumption ..?
A sequential (1 thread) program may be very effective in terms of power consumption but taking a very long time to compute. If you are able to use multiple threads, you may be able to reduce that computation time but it will probably incur new costs (synchronization between threads, additional protection mechanisms against concurrent accesses ...).
Also, multithreading cannot help for certain tasks that fall outside of the CPU realm. For example, unless you have some very specific hardware support, reading a file from the hard-drive with 2 or more concurrent threads cannot be parallelized efficiently.

Improvement on execution time from an aplication done with multi-threading is limited by the number of physical cores?

I was doing some testing with multi-threading on a linux virtual machine, and I implemented a benchmark with 10 threads (in this application each instruction would be executed 10x times more than in the single-thread scenario) and i was tweaking with the number of "physical cores" from the VM settings and with the single thread case I obtain 3s on average independently of the number of physical cores, If the number of cores is set to 1, and I run the multi-thread version, the execution time will be 30s. If I run it with 2 cores I obtain 15s and with 8 cores (the maximum number I can set) I obtain 6s, I obtain this dependancy due to the fact that I´m executing 10x times each instruction or is always like this?

If you have N threads running on N cores, and if they are all doing pure computation (i.e., not waiting for any I/O devices), and if they are all completely independent of each other, then they should be able to do N times as much work in a given amount of time as a single thread can do in the same amount of time.
But, that's if they are completely independent. That's a hard thing to achieve. For example, if the threads can't each do all of their work in their own, independent cache (e.g., in L1 cache,) then they will compete with each other for access to the main memory. They will sometimes have to wait for one another, because only one core can access main memory at any given moment. So, if the threads need to use memory, then the speedup will be somewhat less than N times.
If the threads need to share data in main memory, then it gets worse because then they will need to use mutual exclusion locks. One thread may keep a lock locked while it executes dozens of instructions, and any other thread that wants the same lock will have to wait until it is finished.
If the threads need to synchronize with each other/communicate with each other, then it gets worse still because unless their work loads are carefully balanced, a thread with less work to do may spend long periods of time awaiting signals from threads that have more work to do.
It's not unusual for a novice programmer to invent a multi-threaded version of some single-threaded algorithm, and find out that the multi-threaded version actually is slower than the single-threaded version.
There are some algorithms, for which even an expert programmer can't get much speed up by throwing more threads at it.

Performance of multi-threading exceeding cores

If I have a process that starts X amount of threads, will there ever be a performance gain having X higher than the number of CPU cores (assuming all the threads are working synchronously without async calls to storage/network)?
E.G. If I have a two cores CPU, will I just slow down the application starting 3+ constantly working threads?

It really depends on what your code does. it is too broad.
Having more threads than cores might speed up the program for example if some of the threads sleep or try to block on a lock. in this case, the OS scheduler can wake different thread and that thread will work while the other thread is sleeping.
Having more threads than the number of cores may also decrease the program execution time because the OS scheduler has to do more work to switch between the threads execution and that scheduling might be a heavy operation.
As always, benchmarking your application with different amount of threads is the best way to achieve maximum performance. there are also algorithms (like Hill-Climbing) which may help the application fine tune the best number of threads on runtime.

It is possible that such a thing happens.
Both Intel and AMD currently implement forms of SMT in their CPUs. This means that, in general, one single thread of execution may not be able to exploit 100% of the computing resources.
This happens because modern CPUs execute instructions in multiple pipelined steps, so that the clock frequency can be increased (less stuff gets done in every cycle, so you can do more cycles). The downside of this approach is that, if you have two consecutive instructions A and B, with the latter depending on the result of the former, you may have to wait some clock cycles without doing anything, just waiting for instruction A to complete. So, they came up with SMT, which allows the CPU to interleave instructions from two different threads/processes on the same pipeline, in order to fill such gaps.
Note: it is not exactly like this, CPUs don't just wait. They try to guess the result of the first operation and execute the second assuming that result. If their guess is wrong, they cancel the pending instructions and start over. Also, they have some feedback circuits that allow tighter execution of interdependent instructions. And nowadays branch predictors are surprisingly good. Things get better for the pipeline if you can just fill gaps with instructions from some other process, rather than going with a guess, but this potentially halves the amount of cache each executing thread can use.

It makes sense to run more threads if your threads make read/write/send/recv syscalls or similar, or sleep on locks, etc.
If your threads are pure computation threads, adding more of them will slow down system because of context switches.
If you still need more threads by design, you might want to look into the cooperative multitasking. Both Windows and Linux have API for that and that will work faster than the context switches. In Windows it called fibers:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms682661(v=vs.85).aspx
In Linux it is a set of functions make/get/swapcontext():
http://man7.org/linux/man-pages/man3/makecontext.3.html

This question: Optimal number of threads per core might help you.
In the thread I wrote an answer describing a scenario when having higher number of threads than the available number of cores boosts performance.

fork vs thread on one single core

Imagine that I have two tasks, each of them needs 2 seconds to finish its job.
In this case, if I create two threads for each of them and my PC is single-core, this won't save any time. Am I right ?
What if I use fork to create two processes (the machine is still single-core) and each process takes charge of one task ? Can this save any time ?
If not, I have a question:
In current modern machine (including multi-core), if I have several heavy tasks, which method should I use ?
fork ?
thread ?
fork + thread, meaning that create some processes and
each process contains more than one thread ?

Even with a single core having two threads may speed up execution. If your routine is purely CPU bound then two threads won't improve anything, indeed the performance will be worse because of context switching overhead. But if the routine has to wait for memory, disk or or network (which is usually the case) then two threads will provide performance gains even with a single core.
About fork vs threads, threads require less resources so, in principle, should be the first choice. But there are two caveats: 1) maybe you want to be able to terminate a parallel routine, this is much safer to do with processes than with threads and 2) some languages (notably Python and Ruby) provide pseudo-thread libraries which do not use real threads but switch between routines using the same thread. This simulated threading can be very useful for example when waiting for network requests but it must be taken into account that it's not real multithreading.
Amendment: As commented by Sergio Tulentsev, Ruby and Python do indeed provide real threads and not only coroutines.

"job takes 2 seconds" - If those 2 seconds are fully occupying the CPU (100% load), you won't gain anything with either thread nor fork if you have no cores to share. The single-core CPU is simply busy and you cannnot make it more busy.
In case this 2 seconds include waiting time (for example on I/O, storage, whatever) you could gain something, even with a single core. The amount of gain depends on the CPU working vs. CPU waiting ratio and the overhead of your multiprocessing. Most non-trivial programs have at least some amount of "CPU waiting", so multithreading is often useful even on single-core CPUs.
This overhead for setting up a coroutine and context switching can be considerable and needs to be measured. Obviously, the shorter the run time of your actiual task is, the larger will be the ratio of overhead (for setting up a thread or process, etc.) and the smaller will be you multi-processing gain.
Traditionally, threads used to have considerably less overhead than processes (after all, that was why they were invented), but the "considerably" has maybe vanished over time - On modern Linux systems, processes are only a tad slower to set up than threads (actually, both use the same system calls). You rather decide between thread or process based on the requirements related to amount of protection (or sharing) of data than execution speed.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string