Calculating execution time for 2-threaded CPUs?

Calculating execution time for 2-threaded CPUs? - multithreading

Given 3 programs P0, P1, P2 and two CPUs and each CPU having 2 threads. The running times of the programs take 5, 10 and 20 msecs respectively. How long will it take to execute all 3 programs? Assuming they do not change CPUs and do not block during execution.
My answer is 20 msecs because no matter how we organize the programs on the CPUs they will be completed as fast as the slowest program(P2), therefore 20 msecs. However, the solutions manual gives answers 20, 25 and 30. Can anyone tell me how that answer came to be?
It says
If P0 and P2 are scheduled on same CPU and P1 is scheduled on the other CPU it will take 25 msecs
the question is why though, shouldn't the first CPU take P2 time (20 msecs) and second one P1 and given than P2 takes longer and both CPUs run concurrently shouldn't the answer be 20 msecs as well?

p0 - 5 ms
p1 - 10 ms
p2 - 20 ms
Scenario 1:
| P0 | | |
| P1 | | p2 |
\_____/ \____/
CPU0 CPU1
It will take CPU0 15ms to finish both jobs, while CPU1 will run 20ms before p2 is done, they run in parallel, so both jobs will be done in 20ms.
Scenario 2:
| P2 | | |
| P0 | | p1 |
\_____/ \____/
CPU0 CPU1
It will take CPU0 25ms to finish both jobs, while CPU1 will run 10ms They run in parallel, while CPU1 will go idle after 10ms, it will take CPU0 an extra 15ms to finish. Thus, 25ms.
Scenario 3:
| P2 | | |
| P1 | | p0 |
\_____/ \____/
CPU0 CPU1
It will take CPU0 30ms to finish both jobs, while CPU1 will run 5ms Again, they run in parallel, so while CPU1 will go idle after 5ms, it will take CPU0 an extra 25ms to finish. Thus, 30ms.
Pay attention, jobs remain on the same CPU they were scheduled to run on according to your question. If one job needs 20 ms to run and the other one needs 10 ms to run, it will take 30 ms to finish them both. If your jobs can run on different cores of same CPU then it will be 20 ms anyway (that is the best case), but that is not a granted to be the situation.

Related

Performance degrading when more threads/processes takes cores even when not all cores are used

To explore the CPU performance on Ubuntu, I'm using sysbench's cpu test. I'm also using taskset to bind certain CPUs to a process.
The script I used to run the experiment is as below:
for concur in {1..24}
do
end=$(($concur-1))
echo "--> Create $concur jobs"
for i in $(seq 0 $end)
do
j=`expr $i \\* 2`
taskset -c $j-$(($j+1)) sysbench --test=cpu --cpu-max-prime=200000 --num-threads=2 run > sysbench-cpu-$concur-$i.out &
pids[${i}]=$!
done
echo "--> Waiting for ${pids[*]}"
for pid in ${pids[*]}; do
wait $pid
done
unset pids
echo "--> Clean up ${pids[*]}"
done
I have a 48-core server so the maximum threads does not exceed number of cores.
Now the results are:
Concurrent Jobs | Average Time For Each Job(s)
1 331.8708
2 342.1978
3 354.9352333
4 360.505625
5 363.86354
6 363.632
7 363.9364429
8 363.761375
9 365.1571667
10 365.81912
11 366.2094909
12 366.56025
13 377.9281923
14 388.7300214
15 398.2705067
16 406.7716438
17 414.7786471
18 422.2813444
19 429.1379421
20 434.907065
21 440.5540571
22 445.3167864
23 450.0088783
24 456.5647417
As the results show, the time is increasing when more jobs are running. But why?
As the job is solely CPU-bound job, and in sysbench, the job is just each thread computing prime up to --cpu-max-prime, so no locking, besides it's not even taking memory to store array. I know when too many cores are taken, threads can be scheduled off for other system threads, but here, when moderate jobs are spawned, for example 10 jobs which take 20 cores, the performance is still worse than 1 job.

Cannot allocate exclusively a CPU for my process

I have a simple mono-threaded application that does almost pure processing
It uses two int buffers of the same size
It reads one-by-one all the values of the first buffer
each value is a random index in the second buffer
It reads the value at the index in the second buffer
It sums all the values taken from the second buffer
It does all the previous steps for bigger and bigger
At the end, I print the number of voluntary and involuntary CPU context switches
If the size of the buffers become quite big, my PC starts to slow down: why? I have 4 cores with hyper-threading so 3 cores are remaing. Only one is 100% busy. Is it because my process uses almost 100% for the "RAM-bus"?
Then, I created a CPU-set that I want to dedicate to my process (my CPU-set contains both CPU-threads of the same core)
$ cat /sys/devices/system/cpu/cpu3/topology/core_id
3
$ cat /sys/devices/system/cpu/cpu7/topology/core_id
3
$ cset set -c 3,7 -s my_cpuset
$ cset set -l
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0-7 y 0 y 934 1 /
my_cpuset 3,7 n 0 n 0 0 /my_cpuset
It seems that absolutely no task at all is running on my CPU-set. I can relaunch my process and while it is running, I launch:
$ taskset -c 7 ./TestCpuset # Here, I launch my process
...
$ ps -mo pid,tid,fname,user,psr -p 25244 # 25244 being the PID of my process
PID TID COMMAND USER PSR
25244 - TestCpus phil -
- 25244 - phil 7
PSR = 7: my process is well running on the expected CPU-thread. I hope it is the only one running on it but at the end, my process displays:
Number of voluntary context switch: 2
Number of involuntary context switch: 1231
If I had involuntary context switches, it means that other processes are running on my core: How is it possible? What must I do in order to get Number of involuntary context switch = 0?
Last question: When my process is running, if I launch
$ cset set -l
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0-7 y 0 y 1031 1 /
my_cpuset 3,7 n 0 n 0 0 /my_cpuset
Once again I get 0 tasks on my CPU-set. But I know that there is a process running on it: it seems that a task is not a process?

If the size of the buffers become quite big, my PC starts to slow down: why? I have 4 cores with hyper-threading so 3 cores are remaing. Only one is 100% busy. Is it because my process uses almost 100% for the "RAM-bus"?
You reached the hardware performance limit of a single-threaded application, that is 100% CPU time on the single CPU your program is allocated to. Your application thread will not run on more than one CPU at a time (reference).
What must I do in order to get Number of involuntary context switch = 0?
Aren't you missing --cpu_exclusive option in cset set command?
By the way, if you want to achieve lower execution time, i suggest you to make a multithreaded application and let operating system, and the hardware beneath parallelize execution instead. Locking a process to a CPU set and preventing it from doing context-switch might degrade the operating system performance and is not a portable solution.

How to get percentage of processor use with bash?

I wonder how do I get the percentage of my processor usage from 0% to 100%?
to know how many percent'm using my processor preferably in bash or other methods provided that percentage.
I have this script that I found on google however it is very much imprecisso
I tried to make more improvements could not, does anyone know any method to get the percentage of CPU utilization in% 0-100
my script
NUMCPUS=`grep ^proc /proc/cpuinfo | wc -l`; FIRST=`cat /proc/stat | awk '/^cpu / {print $5}'`; sleep 1; SECOND=`cat /proc/stat | awk '/^cpu / {print $5}'`; USED=`echo 2 k 100 $SECOND $FIRST - $NUMCPUS / - p | dc`; echo ${USED}% CPU Usage

Processor use or utilization is a measurement over time. One way to measure utilization in % is by computation over two successive reads of /proc/stat. A simple common bash script to compute the percentage is:
#!/bin/bash
# Read /proc/stat file (for first datapoint)
read cpu user nice system idle iowait irq softirq steal guest< /proc/stat
# compute active and total utilizations
cpu_active_prev=$((user+system+nice+softirq+steal))
cpu_total_prev=$((user+system+nice+softirq+steal+idle+iowait))
usleep 50000
# Read /proc/stat file (for second datapoint)
read cpu user nice system idle iowait irq softirq steal guest< /proc/stat
# compute active and total utilizations
cpu_active_cur=$((user+system+nice+softirq+steal))
cpu_total_cur=$((user+system+nice+softirq+steal+idle+iowait))
# compute CPU utilization (%)
cpu_util=$((100*( cpu_active_cur-cpu_active_prev ) / (cpu_total_cur-cpu_total_prev) ))
printf " Current CPU Utilization : %s\n" "$cpu_util"
exit 0
use/output:
$ bash procstat-cpu.sh
Current CPU Utilization : 10
output over 5 iterations:
$ ( declare -i cnt=0; while [ "$cnt" -lt 5 ]; do bash procstat-cpu.sh; ((cnt++)); done )
Current CPU Utilization : 20
Current CPU Utilization : 18
Current CPU Utilization : 18
Current CPU Utilization : 18
Current CPU Utilization : 18

top -bn1 | sed -n '/Cpu/p'
gives the following line
Cpu(s): 15.4%us, 5.3%sy, 0.0%ni, 78.6%id, 0.5%wa, 0.0%hi, 0.1%si, 0.0%st
You can pull any CPU field with the following will take the user CPU (us)
top -bn1 | sed -n '/Cpu/p' | awk '{print $2}' | sed 's/..,//'
Output:
15.4%
If you want another field like system CPU (sy) you can change the awk field from $2,
top -bn1 | sed -n '/Cpu/p' | awk '{print $3}' | sed 's/..,//'
Output:
5.3%
If you want other CPU:
us: user CPU used by user processes
sy: system CPU used by system/kernel processes
ni: nice CPU used by processes that were reniced
id: idle CPU not used
wa: io wait Essentially idle CPU waiting on IO devices
hi: hardware irq CPU used to service hardware IRQs
si: software irq CPU used to service soft IRQs
st: steal time CPU time which the hypervisor dedicated (or ‘stole’) for other guests in the system.

To get usage percent total since bringing the system up:
awk '/cpu /{print 100*($2+$4)/($2+$4+$5)}' /proc/stat
To get the usage percentage over the last second:
awk -v a="$(awk '/cpu /{print $2+$4,$2+$4+$5}' /proc/stat; sleep 1)" '/cpu /{split(a,b," "); print 100*($2+$4-b[1])/($2+$4+$5-b[2])}' /proc/stat
Explanation
From man 5 proc, the meaning of the first four numbers on the cpu line in /proc/stat is given by:
cpu 3357 0 4313 1362393
The amount of time, measured in units of USER_HZ (1/100ths of a second on most architectures, use sysconf(_SC_CLK_TCK)
to obtain the right value), that the system spent
in user mode, user mode with low priority (nice), system mode, and the idle task, respectively. The last value
should be USER_HZ times the second entry in the uptime
pseudo-file.
The get the CPU usage, we add the user and system times and divide by the total of user, system, and idle time.
Let's look again at the calculation for total CPU usage since system up:
awk '/cpu /{print 100*($2+$4)/($2+$4+$5)}' /proc/stat
By requiring that the line match cpu, we get system totals. The second column is user time, the fourth is system time, and the fifth is idle time. The ratio is multiplied by 100 to get a percentage.
Now, let's consider the recent CPU usage:
awk -v a="$(awk '/cpu /{print $2+$4,$2+$4+$5}' /proc/stat; sleep 1)" '/cpu /{split(a,b," "); print 100*($2+$4-b[1])/($2+$4+$5-b[2])}' /proc/stat
This reads /proc/cpu twice, a second apart. The first time, the CPU user + system, and user+system+idle times are saved in the variable a. sleep is called to delay for a second. Then, /proc/cpu is read a second time. Tne old user+system total is subtracted from the new total and divided by the change in the total of all times. The result is multiplied by 100 to convert it to percent and printed.

Using vmstat the command is short, moderately accurate and takes one second :
vmstat 1 2 | awk 'END { print 100 - $15 }'

Very simple script that considers only System, Idle and User.
The benefit over the other answers is that it requires no utilities, not even top, and also displays fractions, which the current top answer does not.
#!/bin/bash
read u1 s1 i1 <<< $(grep 'cpu ' /proc/stat | awk '{print $2" "$4" "$5}' )
sleep 1
read u2 s2 i2 <<< $(grep 'cpu ' /proc/stat | awk '{print $2" "$4" "$5}' )
u=$(echo "scale=4;$u2-$u1" | bc)
s=$(echo "scale=4;$s2-$s1" | bc)
i=$(echo "scale=4;$i2-$i1" | bc)
cpu=$(echo "scale=4;($u+$s)*100/($u+$s+$i)" | bc)
echo $cpu
Brief description - we pull data from /proc/stat from the line that starts with 'cpu'. We then get parse out the second token which is user time, the fourth token which is system time and fifth token which is idle time.
At this point, you may be tempted to just do the math, but all that will give you is the utilization since boot time. We need one more data point.
We sleep 1 second and we pull the data from /proc/stat again. Now we get the difference between the first pull and the second pull. This is the CPU utilization for that 1 second while we slept.
We get the difference for each of the variables, and then do the math on the difference. The strange 'scale=4' in front of each calculation is to force a float answer with 4 digit precision.

Threads, Coro, Anyevent confusion

I am relatively new to perl and even newer to threading in perl. I have a perl script that takes input from 3 different sources. (2 LDAP queries and a file that isn't always there) Because some parts can take longer than others so I decided to use threads and queues. During development, testing individual components of the script worked very well, but after putting it all together the performance seems to degrade.
Basic structure is this
2 threads:(Read file or Read AD entries) -> Queue1 -> 2 threads:(scrub data) -> Queue2 -> 3-4 threads(compare against existing local LDAP entries). Several threads report statistics back to the main script and once all threads are done an email is sent with all the stats and status of that run.
I am using dequeue_nb and I thought that would help but no luck.
The performance hit seems to be in the queues. While looking for tips to improve performance I've run into several articles saying perl threads are no good and to use coro, POE, Anyevent, IO:async, etc.
This doesn't seem like a "event" problem so I didn't think AnyEvent or POE would be the way to go by from what I'm seeing, coros only seems to use one CPU at a time so I'm not sure this would work either. I thought about using a combination of them but then my head started to hurt. Does anyone have any suggestions on how to either fix/troubleshoot my script or any suggestions how to implement one of the other modules?

A problem with parallelism is synchronization. It is a performance killer, it is bad, it is to be avoided if possible.
OPs architecture
Lets look at your architecture:
+--------------+--------------+
| Input 1 | Input 2 |
+--------------+--------------+
| QUEUE A |
+--------------+--------------+
| Scrub 1 | Scrub 2 |
+--------------+--------------+
| QUEUE B |
+---------+---------+---------+
| Compare | Compare | Compare |
+---------+---------+---------+
Discussion
Queue A has to be synchronized across four threads; Queue B across 5-6. Only one thread can access the Queue at any time, so most of the time your threads will be waiting, not working!
Parallel Pipeline Architecture
A somewhat different architecture could look like this:
+-----------+ +-----------+
| Input 1 | | Input 2 |
+-----------+ +-----------+
| QUEUE 1A | | QUEUE 2A |
+-----------+ +-----------+
| Scrub 1 | | Scrub 2 |
+-----------+ +-----------+
| QUEUE 1B | | QUEUE 2B |
+-----+-----+ +-----+-----+
| Cmp | Cmp | | Cmp | Cmp |
+-----+-----+ +-----+-----+
Discussion
Here, the A Queues are only affiliated with two threads (->less waiting), the B Queues only with three. This architecture should perform faster for similar input size/complexity. If Input 2 were considerably shorter, the whole Pipeline 2 would have run before Pipeline 1 is even half finished. It is, however, far better than Using a single process for each Pipeline.
The Lawn Sprinkler Architecture
Concept
An even better architecture would try to distribute the output of a process across multiple Queues. (The reverse, getting threads fetch their input from multiple queues is bad when a queue is empty.)
Each Queue write should go to a different queue:
+-----------+ +-----------+
| Input 1 | | Input 2 |
+-----------+ +-----------+
| \ / |
+-----------+ +-----------+
| QUEUE 1A | | QUEUE 2A |
+-----------+ +-----------+
| Scrub 1 | | Scrub 2 |
+-----------+ +-----------+
/ | \ \ / / | \
+-------+-------+-------+-------+
| Q. 1B | Q. 2B | Q. 3B | Q. 4B |
+-------+-------+-------+-------+
| Cmp | Cmp | Cmp | Cmp |
+-------+-------+-------+-------+
This makes sure each thread has the same workload, but it cannot make sure that all threads finish at the same time.
Discussion
All Queues are shared among 3 Threads. The problem is that two Threads will block each other when writing to a queue. If the time between Queue write accesses is significantly larger than the write duration, this should be no problem, else the second architecture can be mixed in.
So if this architecture makes sense depends on your exact requirements.
It is slower for evenly sized inputs, but performs better on irregular input.
Appendices
On implementing:
What framework is used is secondary to the architecture. If you only pass around text strings, I strongly advise using pipes. If you have to pass Perl data types or objects, you probably have to embrace the additional overhead of using a real Queue: When adding an unshared variable to a queue, a deep copy has to be made (see #Leon Timmermans answer) in addition to all the synchronization overhead.
On scalability:
Architecture 1 and 3 are not fixed in the number of threads. I strongly suggest using this flexibility to benchmark different compositions. A rule of thumb is that you should use n to 2n threads where n is the number of processors (or hardware threads). This can be seen as a maximal sensible number for the threads of one stage. Above that, you only get a memory penalty and no speedup. A performance saturation point may be reached earlier, when a stage can process the input faster than it is supplied.

What kind of data are you putting in the queues? AFAIK simple data is cheaper than complex structures, since it needs to be clones and copied at least twice. I've been planning to write a faster queue implementation (most of the work is already done actually), but haven't published that yet.

How to find the processor queue length in linux

Trying to determine the Processor Queue Length (the number of processes that ready to run but currently aren't) on a linux machine. There is a WMI call in Windows for this metric, but not knowing much about linux I'm trying to mine /proc and 'top' for the information. Is there a way to determine the queue length for the cpu?
Edit to add: Microsoft's words concerning their metric: "The collection of one or more threads that is ready but not able to run on the processor due to another active thread that is currently running is called the processor queue."

sar -q will report queue length, task list length and three load averages.
Example:
matli#tornado:~$ sar -q 1 0
Linux 2.6.27-9-generic (tornado) 01/13/2009 _i686_
11:38:32 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
11:38:33 PM 0 305 1.26 0.95 0.54
11:38:34 PM 4 305 1.26 0.95 0.54
11:38:35 PM 1 306 1.26 0.95 0.54
11:38:36 PM 1 306 1.26 0.95 0.54
^C

vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 256368 53764 75980 220564 2 28 60 54 774 1343 15 4 78 2
The first column (r) is the run queue - 2 on my machine right now
Edit: Surprised there isn't a way to just get the number
Quick 'n' dirty way to get the number (might vary a little on different machines):
vmstat|tail -1|cut -d" " -f2

The metrics you seek exist in /proc/schedstat.
The format of this file is described in sched-stats.txt in the kernel source. Specifically, the cpu<N> lines are what you want:
CPU statistics
--------------
cpu<N> 1 2 3 4 5 6 7 8 9
First field is a sched_yield() statistic:
1) # of times sched_yield() was called
Next three are schedule() statistics:
2) This field is a legacy array expiration count field used in the O(1)
scheduler. We kept it for ABI compatibility, but it is always set to zero.
3) # of times schedule() was called
4) # of times schedule() left the processor idle
Next two are try_to_wake_up() statistics:
5) # of times try_to_wake_up() was called
6) # of times try_to_wake_up() was called to wake up the local cpu
Next three are statistics describing scheduling latency:
7) sum of all time spent running by tasks on this processor (in jiffies)
8) sum of all time spent waiting to run by tasks on this processor (in
jiffies)
9) # of timeslices run on this cpu
In particular, field 8. To find the run queue length, you would:
Observe field 8 for each CPU and record the value.
Wait for some interval.
Observe field 8 for each CPU again, and calculate how much the value has increased.
Dividing that difference by the length of the time interval waited (the documentation says it's in jiffies, but it's actually in nanoseconds since the addition of CFS), by Little's Law, yields the mean length of the scheduler run queue over the interval.
Unfortunately, I'm not aware of any utility to automate this process which is usually installed or even packaged in a Linux distribution. I've not used it, but the kernel documentation suggests http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c, which unfortunately refers to a domain that is no longer resolvable. Fortunately, it's available on the wayback machine.
Why not sar or vmstat?
These tools report the number of currently runnable processes. Certainly if this number is greater than the number of CPUs, some of them must be waiting. However, processes can still be waiting even when the number of processes is less than the number of CPUs, for a variety of reasons:
A process may be pinned to a particular CPU.
The scheduler may decide to schedule a process on a particular CPU to make better utilization of cache, or for NUMA optimization reasons.
The scheduler may intentionally idle a CPU to allow more time to a competing, higher priority process on another CPU that shares the same execution core (a hyperthreading optimization).
Hardware interrupts may be processable only on particular CPUs for a variety of hardware and software reasons.
Moreover, the number of runnable processes is only sampled at an instant in time. In many cases this number may fluctuate rapidly, and the contention may be occurring between the times the metric is being sampled.
These things mean the number of runnable processes minus the number of CPUs is not a reliable indicator of CPU contention.

uptime will give you the recent load average, which is approximately the average number of active processes. uptime reports the load average over the last 1, 5, and 15 minutes. It's a per-system measurement, not per-CPU.
Not sure what the processor queue length in Windows is, hopefully it's close enough to this?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string