I try to calculate number of instruction cycles and delay cycles for HCS12. I have some information about HCS12
The HCS12 uses the bus clock (E clock) as a timing
reference.
The frequency of the E clock is half of that of the onboard clock oscillator (clock, 48 MHz, E-clock, 24 MHz).
Execution times of the instructions are also measured in E clock cycles
I wonder the 24Mhz is crystal frequency? If so, only half of the
crystal’s oscillator frequency is used for CPU instruction time. So,
should it be halved?
How can I make 100-ms time delay for a demo board with a 24-MHz bus
clock?
In order to create a 100-ms time delay, we need to repeat the preceding instruction sequence 60,000 times [100 ms ÷ (40 ÷ 24,000,000) μs = 60,000]. The following instruction sequence will create the desired delay:
There is an example but I don't understand how 60000 and 40 values are calculated.
ldx #60000
loop psha ; 2 E cycles
pula ; 3 E cycles
psha ; 2 E cycles
pula ; 3 E cycles
psha ; 2 E cycles
pula ; 3 E cycles
psha ; 2 E cycles
pula ; 3 E cycles
psha ; 2 E cycles
pula ; 3 E cycles
psha ; 2 E cycles
pula ; 3 E cycles
psha ; 2 E cycles
pula ; 3 E cycles
nop ; 2 E cycles
nop ; 3 E cycles
dbne x,loop
Your first section explains that if the internal oscillator (or external crystal) is 48 MHz, the EClock is 24 MHz. So if you want to delay by 100 millisec, that is 24,000,000 * 100 / 1,000 EClocks, namely 2,400,000 instruction cycles.
The maximum register size available is 16-bits, so a loop counter value is chosen that is <= 65535.
Conveniently 60,000 is a factor of 2,400,000 being 60,000 * 40. So the inner loop is contrived to take 40 cycles. However the timing comments on the last 3 lines are incorrect, they should be
nop ; 1 E cycle
nop ; 1 E cycle
dbne x,loop ; 3 E cycles
Giving the required 40 cycles execution time.
Note that if you have interrupts, other processes, this hard coded method is not very accurate, and a timer interrupt would be better.
Related
I'm currently using eBPF maps, and whenever I try to set the value (associated with a key in a hash table type map) to a variable that I increment at the end of the eBPF program, such that the value is incremented every time the function is run, the verifier throws an error,
invalid indirect read from stack R3 off -128+6 size 8
processed 188 insns (limit 1000000) max_states_per_insn 1 total_states 11 peak_states 11 mark_read 8
The main goal is to directly take the value and increment it every time the function is run.
I was under the impression that this would work.
bpf_spin_lock(&read_value->semaphore);
read_value->value += 1;
bpf_spin_unlock(&read_value->semaphore);
But this also throws the following error,
R1 type=inv expected=map_value
processed 222 insns (limit 1000000) max_states_per_insn 1 total_states 15 peak_states 15 mark_read 9
Is it legal/valid to access program global variables from an internal subroutine called from an OpenMP task?
ifort 2021.7.0 20220726 doesn't report an error, but appears to produce random results depending on compiler options. Example:
program test1
implicit none
integer :: i, j, g
g = 42
!$OMP PARALLEL DEFAULT(SHARED)
!$OMP SINGLE
i = 0
j = 1
do while (j < 60)
i = i + 1
!$OMP TASK DEFAULT(SHARED) FIRSTPRIVATE(i,j)
call sub(i,j)
!$OMP END TASK
j = j + j
end do
!$OMP END SINGLE
!$OMP END PARALLEL
stop
contains
subroutine sub(i,j)
implicit none
integer i,j
!$OMP CRITICAL(unit6)
write(6,*) i,j,g
!$OMP END CRITICAL(unit6)
end subroutine sub
end program test1
Compiled with: ifort -o test1 test1.f90 -qopenmp -warn all -check all
Expected result:
5 16 42
4 8 42
6 32 42
3 4 42
2 2 42
1 1 42
Obtained result:
2 2 -858993460
5 16 -858993460
4 8 -858993460
6 32 -858993460
1 1 -858993460
3 4 -858993460
Note: the order of output lines doesn't matter --- just the number in the third column should be 42.
Different unexpected results are obtained by changing compiler options. For example, with "ifort -o test1 test1.f90 -qopenmp -warn all -O0", the third column is 256 and with "ifort -o test1 test1.f90 -qopenmp -O0" it is -740818552.
Of course g could be passed to sub() as an argument, but the program I'm assisting with working on has dozens of shared global variables (that don't change in the parallel part) and subroutine calls go several layers deep.
Thanks, Peter McGavin.
Please try the oneAPI compiler package 2022.2 or 2022.3.
/iusers/xtian/temp$ ifx -qopenmp jimtest.f90
/iusers/xtian/temp$ ./a.out
2 2 42
1 1 42
3 4 42
5 16 42
4 8 42
6 32 42
I want to extract relevant data of a traffic junction and it's connections from a log file. Example log:
SCN DD1251 At Glasgow Road - Kilbowie Road
Modified By ________
Type CR
Region WS Subregion
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1271 C DD1271 R
DD1351 D DD1351 B
E
Stage Suffix for Offset Optimizer 1
Double Cycle Initially ? N Force Single / Double Cycling status ? N
Double Cycle Group 00 Double Cycle Ignore ? N
Allow Link Max Saturation N Link Max Sat Override N
Stages 1 2 3 4
Fixed N N N Y
LRT stage N N N N
Skip allowed N N N N
Ped stage N N N N
Ped invite N N N N
Ghost stage N N N N
Offset authority pointer 0 Split authority pointer 0
Offset opt emiss weight 000 I/green feedback inhibit N
Bus Authority 00 ACIS node 00000
Bus Mode - Central extensions N Local extensions N Recalls N
Stage skipping N Stage truncation N Cancels N
Bus Priority Selection - Multiple buses N Queue Calculation N
Hold recall if faulty N Disable recall N Disable long jtim N Real Cancel N
Bus recall recovery type 0 Bus extension recovery type 0
Offset Bus authority pointer 0 Split Bus authority pointer 0
Bus skip recovery 0 Skip importance factor 0
Bus priority status OFF
LRT sat 1 000 LRT sat 2 000 LRT sat 3 000
PEDESTRIAN FACILITIES
Ped Node N Num Ped Wait Imp Factor 000
Ped Priority 0 Max Ped Priority Freq 00
Ped Lower Sat Threshold 000 Ped Upper Sat Threshold 000
Max Ped Wait Time 000
PEDESTRIAN VARIABLE INVITATION TO CROSS
Allow Ped Invite N Ped Priority Auto 000
Ped Invite Upper Sat 000 Prio Level 1 2 3 4
Max Ped Priority Smoothed Time 000 000 000 000
Max Ped Priority Increase Length 00 00 00 00
CYCLE TIME FACILITIES
Allow Node Independence N Operator Node Independence 0
Ghost Demand Stage N Num Ghost Assessment Cycles 15
Upper Trigger Ghost 04 Lower Trigger Ghost 0
SCN DD1271 At Glasgow Road - Hume Street
Modified 13-OCT-15 15:06 By BDAVIDSON
Type CR
Region WS Subregion
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1301 T DD1301 A
DD1251 R DD1251 C
Stage Suffix for Offset Optimizer 1
Double Cycle Initially ? N Force Single / Double Cycling status ? N
Double Cycle Group 00 Double Cycle Ignore ? N
Allow Link Max Saturation N Link Max Sat Override N
Stages 1 2 3
Fixed N Y Y
LRT stage N N N
Skip allowed N N N
Ped stage N N N
Ped invite N N N
Ghost stage N N N
Offset authority pointer 0 Split authority pointer 0
Offset opt emiss weight 000 I/green feedback inhibit N
Bus Authority 00 ACIS node 00000
Bus Mode - Central extensions N Local extensions N Recalls N
Stage skipping N Stage truncation N Cancels N
Bus Priority Selection - Multiple buses N Queue Calculation N
Hold recall if faulty N Disable recall N Disable long jtim N Real Cancel N
Bus recall recovery type 0 Bus extension recovery type 0
Offset Bus authority pointer 0 Split Bus authority pointer 0
Bus skip recovery 0 Skip importance factor 0
Bus priority status OFF
LRT sat 1 000 LRT sat 2 000 LRT sat 3 000
PEDESTRIAN FACILITIES
Ped Node N Num Ped Wait Imp Factor 000
Ped Priority 0 Max Ped Priority Freq 00
Ped Lower Sat Threshold 000 Ped Upper Sat Threshold 000
Max Ped Wait Time 000
PEDESTRIAN VARIABLE INVITATION TO CROSS
Allow Ped Invite N Ped Priority Auto 000
Ped Invite Upper Sat 000 Prio Level 1 2 3 4
Max Ped Priority Smoothed Time 000 000 000 000
Max Ped Priority Increase Length 00 00 00 00
CYCLE TIME FACILITIES
Allow Node Independence N Operator Node Independence 0
Ghost Demand Stage N Num Ghost Assessment Cycles 15
Upper Trigger Ghost 04 Lower Trigger Ghost 0
I can already extract the first relevant line using the following Bash script:
grep SCN* LOG.TXT > JUNCTIONS.txt
Which creates a list of all the junctions like so:
SCN DD1251 At Glasgow Road - Kilbowie Road
SCN DD1271 At Glasgow Road - Hume Street
SCN DD1301 At Glasgow Road - Argyll Road - Cart Street
SCN DD1351 At Kilbowie Road - Chalmers Street
...
However, I want to extract the lines immediately after each link title, down to the final link of the node just before a large amount of whitespace and without capturing anything from Stage Suffix onwards until the next link.
Is there a way to modify my BASH script to include an additional number of lines after each matching instance it finds?
Sounds like you want sed -n -e '/SCN/,/^\s*$/p'
which will print all lines between the line matching /SCN/ and the first line that is all whitespace.
If you have a fixed number of lines you want to match and have the appropriate grep, you might also try:
grep -A 9 SCN
which prints lines that match SCN and the 9 lines after a match.
With perl you can print the matching sections like this:
perl -ne 'print if(/^SCN / .. /^\s*$/);' LOG.TXT
-n assume while (<>) { ... } loop around program
-e <program> (the code to execute)
print each line from a line starting with SCN until a line with only whitespace characters (zero or more) is encountered.
This will be repeated for all SCN sections in LOG.TXT and would produce the following output from your example:
SCN DD1251 At Glasgow Road - Kilbowie Road
Modified By ________
Type CR
Region WS Subregion
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1271 C DD1271 R
DD1351 D DD1351 B
E
SCN DD1271 At Glasgow Road - Hume Street
Modified 13-OCT-15 15:06 By BDAVIDSON
Type CR
Region WS Subregion
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1301 T DD1301 A
DD1251 R DD1251 C
Or awk (very similar to William's sed solution):
$ awk '/SCN/,/^\s*$/' LOG.TXT
SCN DD1251 At Glasgow Road - Kilbowie Road
Modified By ________
Type CR
Region WS Subregion
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1271 C DD1271 R
DD1351 D DD1351 B
E
SCN DD1271 At Glasgow Road - Hume Street
Modified 13-OCT-15 15:06 By BDAVIDSON
Type CR
Region WS Subregion
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1301 T DD1301 A
DD1251 R DD1251 C
For the sake of clarity. I don't think this is a script, is a pipeline.
grep SCN* LOG.TXT > JUNCTIONS.txt
But you have run the grep with two options:
Like this:
grep -A2 -B2 SCN
A = After
B = Before.
To script, you can run a file, like script.sh with +x ACL.
like this:
$chmod +x script.sh
#!/bin/env bash
var=$1
m=$2
n=$3
grep -A $m -B $n $1 >somefile
then run follow command:
bash script.sh SCN 2 2
I wrote this simple program that multiplies matrices. I can specify how
many OS threads are used to run it with the environment variable
OMP_NUM_THREADS. It slows down a lot when the thread count gets
larger than my CPU's physical threads.
Here's the program.
static double a[DIMENSION][DIMENSION], b[DIMENSION][DIMENSION],
c[DIMENSION][DIMENSION];
#pragma omp parallel for schedule(static)
for (unsigned i = 0; i < DIMENSION; i++)
for (unsigned j = 0; j < DIMENSION; j++)
for (unsigned k = 0; k < DIMENSION; k++)
c[i][k] += a[i][j] * b[j][k];
My CPU is i7-8750H. It has 12 threads. When the matrices are large
enough, the program is fastest on around 11 threads. It is 4 times as
slow when the thread count reaches 17. Then run time stays about the
same as I increase the thread count.
Here's the results. The top row is DIMENSION. The left column is the
thread count. Times are in seconds. The column with * is when
compiled with -fno-loop-unroll-and-jam.
1024 2048 4096 4096* 8192
1 0.2473 3.39 33.80 35.94 272.39
2 0.1253 2.22 18.35 18.88 141.23
3 0.0891 1.50 12.64 13.41 100.31
4 0.0733 1.13 10.34 10.70 82.73
5 0.0641 0.95 8.20 8.90 62.57
6 0.0581 0.81 6.97 8.05 53.73
7 0.0497 0.70 6.11 7.03 95.39
8 0.0426 0.63 5.28 6.79 81.27
9 0.0390 0.56 4.67 6.10 77.27
10 0.0368 0.52 4.49 5.13 55.49
11 0.0389 0.48 4.40 4.70 60.63
12 0.0406 0.49 6.25 5.94 68.75
13 0.0504 0.63 6.81 8.06 114.53
14 0.0521 0.63 9.17 10.89 170.46
15 0.0505 0.68 11.46 14.08 230.30
16 0.0488 0.70 13.03 20.06 241.15
17 0.0469 0.75 20.67 20.97 245.84
18 0.0462 0.79 21.82 22.86 247.29
19 0.0465 0.68 24.04 22.91 249.92
20 0.0467 0.74 23.65 23.34 247.39
21 0.0458 1.01 22.93 24.93 248.62
22 0.0453 0.80 23.11 25.71 251.22
23 0.0451 1.16 20.24 25.35 255.27
24 0.0443 1.16 25.58 26.32 253.47
25 0.0463 1.05 26.04 25.74 255.05
26 0.0470 1.31 27.76 26.87 253.86
27 0.0461 1.52 28.69 26.74 256.55
28 0.0454 1.15 28.47 26.75 256.23
29 0.0456 1.27 27.05 26.52 256.95
30 0.0452 1.46 28.86 26.45 258.95
Code inside the loop compiles to this on gcc 9.3.1 with
-O3 -march=native -fopenmp. rax starts from 0 and increases by 64
each iteration. rdx points to c[i]. rsi points to b[j]. rdi
points to b[j+1].
vmovapd (%rsi,%rax), %ymm1
vmovapd 32(%rsi,%rax), %ymm0
vfmadd213pd (%rdx,%rax), %ymm3, %ymm1
vfmadd213pd 32(%rdx,%rax), %ymm3, %ymm0
vfmadd231pd (%rdi,%rax), %ymm2, %ymm1
vfmadd231pd 32(%rdi,%rax), %ymm2, %ymm0
vmovapd %ymm1, (%rdx,%rax)
vmovapd %ymm0, 32(%rdx,%rax)
I wonder why the run time increases so much when the thread count
increases.
My estimate says this shouldn't be the case when DIMENSION is 4096.
What I thought before I remembered that the compiler does 2 j loops at
a time. Each iteration of the j loop needs rows c[i] and b[j].
They are 64KB in total. My CPU has a 32KB L1 data cache and a 256KB L2
cache per 2 threads. The four rows the two hardware threads are working
with don't fit in L1 but fit in L2. So when j advances, c[i] is
read from L2. When the program is run on 24 OS threads, the number of
involuntary context switches is around 29371. Each thread gets
interrupted before it has a chance to finish one iteration of the j
loop. Since 8 matrix rows can fit in the L2 cache, the other software
thread's 2 rows are probably still in L2 when it resumes. So the
execution time shouldn't be much different from the 12 thread case.
However measurements say it's 4 times as slow.
Now that I have realized 2 j loops are done at a time. This way each
j iteration works on 96KB of memory. So 4 of them can't fit in the
256KB L2 cache. To verify this is what slows the program down, I
compiled the program with -fno-loop-unroll-and-jam. I got
vmovapd ymm0, YMMWORD PTR [rcx+rax]
vfmadd213pd ymm0, ymm1, YMMWORD PTR [rdx+rax]
vmovapd YMMWORD PTR [rdx+rax], ymm0
The results are in the table. They are like when 2 rows are done at a
time. Which makes me wonder even more. When DIMENSION is 4096, 4
software threads' 8 rows fit in the L2 cache when each thread works on 1
row at a time, but 12 rows don't fit in the L2 cache when each thread
works on 2 rows at a time. Why are the run times similar?
I thought maybe it's because the CPU warmed up when running with less
threads and has to slow down. I ran the tests multiple times, both in
the order of increasing thread count and decreasing thread count. They
yield similar results. And dmesg doesn't contain anything related to
thermal or clock.
I tried separately changing 4096 columns to 4104 columns and setting
OMP_PROC_BIND=true OMP_PLACES=cores, and the results are similar.
This problem seems to come from either the CPU caches (due to the bad memory locality) or the OS scheduler (due to more threads than the hardware can simultaneously execute).
I cannot exactly reproduce the same effect on my i5-9600KF processor (with 6 cores and 6 threads) and with a matrix of size 4096x4096. However, similar effects occur.
Here are performance results (with GCC 9.3 using -O3 -march=native -fopenmp on Linux 5.6):
#threads | time (in seconds)
----------------------------
1 | 16.726885
2 | 9.062372
3 | 6.397651
4 | 5.494580
5 | 4.054391
6 | 5.724844 <-- maximum number of hardware threads
7 | 6.113844
8 | 7.351382
9 | 8.992128
10 | 10.789389
11 | 10.993626
12 | 11.099117
24 | 11.283873
48 | 11.412288
We can see that the computation time starts to significantly grow between 5 and 12 cores.
This problem is due to a lot more data fetched from the RAM. Indeed, 161.6 Gio are loaded from memory with 6 threads while 424.7 Gio are loaded with 12 threads! In both cases, 3.3 Gio are written to the RAM. Because my memory throughput is roughly 40 Gio/s, the RAM accesses represent more than 96% of the overall execution time with 12 threads!
If we dig deeper, we can see that the number of L1 cache references and L1 cache misses are the same whatever the number of threads used. Meanwhile, there are a lot more L3 cache misses (as well as more references). Here are L3-cache statistics:
With 6 threads: 4.4 G loads
1.1 G load-misses (25% of all LL-cache hits)
With 12 threads: 6.1 G loads
4.5 G load-misses (74% of all LL-cache hits)
This means that the locality of the memory access is clearly worse with more threads. I guess this is because the compiler is not clever enough to do high-level cache-based optimizations that could reduce RAM pressure (especially when the number of threads is high). You have to do tiling yourself in order to improve the memory locality. You can find a good guide here.
Finally, note that using more threads that the hardware can simultaneously execute is generally not efficient. One problem is that the OS scheduler often badly place threads to core and frequently move them. The usual way to fix that is to bind software threads to hardware threads using OMP_PROC_BIND=TRUE and set the OMP_PLACES environment variable. Another problem is that the threads are executed using preemptive multitasking with shared resources (eg. caches).
PS: please note that BLAS libraries (eg. OpenBLAS, BLIS, Intel MKL, etc.) are much more optimized than this code as most they already include clever optimization including manual vectorization for the target hardware, loop unrolling, multithreading, tiling and fast matrix transpositions when needed. For a 4096x4096 matrix, they are about 10 times faster.
This is the ouput of the blktrace. I could not understand what is "N 0 (00 ..) [multipathd]". I'm testing the write IO performance of the FS.
I have 2 doubts,
N - is a action, but I dont find the usage of it in the blktrace.pdf.
What is the difference between IOSTAT and BLKTRACE.
blktrace o/p:
8,128 7 11 85.638053443 4009 I N 0 (00 ..) [multipathd]
8,128 7 12 85.638054275 4009 D N 0 (00 ..) [multipathd]
8,128 2 88 89.861199377 5210 A W 384 + 8 <- (253,0) 384
8,128 2 89 89.861199876 5210 Q W 384 + 8 [i_worker_0]
8,128 2 90 89.861202645 5210 G W 384 + 8 [i_worker_0]
8,128 2 91 89.861204604 5210 P N [i_worker_0]
8,128 2 92 89.861205587 5210 I WA 384 + 8 [i_worker_0]
8,128 2 93 89.861210869 5210 D WA 384 + 8 [i_worker_0]
8,128 2 94 89.861499857 0 C WA 384 + 8 [0]
8,128 2 95 99.845910681 5230 A W 384 + 8 <- (253,0) 384
8,128 2 96 99.845911148 5230 Q W 384 + 8 [i_worker_20]
8,128 2 97 99.845913846 5230 G W 384 + 8 [i_worker_20]
8,128 2 98 99.845915910 5230 P N [i_worker_20]
8,128 2 99 99.845917081 5230 I WA 384 + 8 [i_worker_20]
8,128 2 100 99.845922597 5230 D WA 384 + 8 [i_worker_20]
There is introduction to blktrace http://duch.mimuw.edu.pl/~lichota/09-10/Optymalizacja-open-source/Materialy/10%20-%20Dysk/gelato_ICE06apr_blktrace_brunelle_hp.pdf
difference between IOSTAT and BLKTRACE.
Check slides 5 and 6:
The iostat utility does provide information pertaining to request queues associated with
specifics devices
– Average I/O time on queue, number of merges, number of blocks read/written, ...
– However, it does not provide detailed information on a perI/O basis
Blktrace. Low-overhead, configurable kernel component which emits events for specific operations performed on each I/O entering the block I/O layer
So, iostat is generic tool to output statistics; and blktrace is tool to capture and output more information about all I/O requests served in the time when tool was active.
Slide 11 has some decoding intro
8,128 7 11 85.638053443 4009 I N 0 (00 ..) [multipathd]
maj/min cpu seq# timestamp_s.ns pid ACT RWBS blocks process
multipathd is kernel daemon, because its name is included into [] braces.
The default format is described in the blktrace.pdf (here is source of the pdf: http://git.kernel.org/cgit/linux/kernel/git/axboe/blktrace.git/tree/doc/blktrace.tex)
"%D %2c %8s %5T.%9t %5p %2a %3d "
%D Displays the event's device major/minor as: \%3d,\%-3d.
%2c CPU ID (2-character field).
%8s Sequence number
%5T.%9t 5-charcter field for the seconds portion of the
time stamp and a 9-character field for the nanoseconds in the time stamp.
%5p 5-character field for the process ID.
%2a 2-character field for one of the actions.
%3d 3-character field for the RWBS data.
Actions
C -- complete
D -- issued
I -- inserted
Q -- queued
B -- bounced
M -- back merge
F -- front merge
G -- get request
S -- sleep
P -- plug
U -- unplug
T -- unplug due to timer
X -- split
A -- remap
m -- message
RWBS
'R' - read,
'W' - write
'D' - block discard operation
'B' for barrier operation or
'S' for synchronous operation.
So, for multipathd we have "I" action = "inserted" and N for RWBS, and the N is strange. There is no N in the doc and even in the source: blkparse_fmt.c - fill_rwbs(). Why? Because it is old doc and old source.
In modern kernel, for example, 3.12 there is N in the fill_rwbs: http://sources.debian.net/src/linux/3.12.6-2/kernel/trace/blktrace.c?hl=1038#L1038
if (t->action == BLK_TN_MESSAGE) {
rwbs[i++] = 'N';
goto out;
}
And the blktrace_api.h declares BLK_TN_MESSAGE as
#define BLK_TN_MESSAGE (__BLK_TN_MESSAGE | BLK_TC_ACT(BLK_TC_NOTIFY))
* Trace categories
BLK_TC_NOTIFY = 1 << 10, /* special message */
* Notify events.
__BLK_TN_MESSAGE, /* Character string message */
So, 'N' is notify action with string message. I think the message is seen instead of "blocks" field. I was able to find the patch which added the TN_MESSAGE, but there was no update of the documentation (just as planned in bazaar-model like linux) http://lkml.org/lkml/2009/3/27/31 "[PATCH v2 6/7] blktrace: print out BLK_TN_MESSAGE properly" 2009