Characterization using YOSYS

Characterization using YOSYS - delay

Is there any way to obtain the Area, Energy Consumption or time delay of a mapped circuit using YOSYS?
This is my synthesis script:
read_verilog UBBKA_15_0_15_0.v
hierarchy -top UBBKA_15_0_15_0
prep; flatten; synth
clean -purge
dfflibmap -liberty NanGate15nm.lib
abc -liberty NanGate15nm.lib
clean -purge
write_verilog -noattr -noexpr netlits.v

You can use stat or stat -lib (with a liberty file) for area/gate usage information.
When given a delay target with -D, abc will print some delay information about the mapping (at a minimum, whether it was able to meet that target).
I am not aware of a way to do power analysis with Yosys at present.

Related

What is the sampling rate for intel_pt event i.e., perf record -e intel_pt//?

Sampling rate can be set for perf record command using -F. I want to know what is the sampling rate for intel_pt event i.e., for command
perf record -e intel_pt// -- ./a.out
With -F in user mode max sampling rate allowed is 8000. While it is possible that perf record stores the trace few thousand times per second, but the trace event that are recorded using perf record -e intel_pt// have much higher frequency.
In other words with intel_pt event a trace of an application execution is collected. Is it the case that perf record work differently while recording using intel_pt event, i.e., in some non-sampling mode?

Yes, intel_pt mode of perf record is different and is not same sampling (statistical) profiling with software (cpu-clock) or hardware (cycles) events. Sampling has 4000 of current EIP samples per second and gives you basic inexact view over code execution. intel_pt is hardware-based tracing technique which generates a lot of data about every control flow instruction (in default perf intel_pt mode) allowing to reconstruct full control flow, but it has bigger overhead. So, frequency of Intel PT is same as how many calls, branches and returns are executed per second by program code (100s of millions).
With sampling on hardware events, perf record will ask hardware PMU to count some events like CPU cycles, and to generate an overflow interrupt after for example 2 million of such events. On such interrupt perf_events subsystem in kernel will record current OS timestamp, pid/tid of current thread, EIP instruction pointer to ring buffer and reset the PMU counter for new value. perf subsystem does limit maximum frequency of interrupts by autotuning the value, and -F option can be used to change desired frequency of interrupts. When the ring buffer (around several megabytes in size) is filled, perf user-space tool will dump it contents into perf.data file, and you can view raw data with perf script or perf script -D. Or just to make histograms with perf report (sort EIPs by how often there was an interrupt on that EIP instruction address, which is proportional to time taken by that code). This mode has around 4 thousand events per second of thread execution (perf report --header | grep sample_freq), with 48 bytes per sample, or 192 kilobyte per second. Overhead is basically low enough, but the sampling is not exact.
perf wiki has separate page for intel processor trace (intel_pt) - https://perf.wiki.kernel.org/index.php/Perf_tools_support_for_Intel%C2%AE_Processor_Trace
Control flow tracing is different from other kinds of performance analysis and debugging. It provides fine-grained information on branches taken in a program, but that means there can be a vast amount of trace data. Such an enormous amount of trace data creates a number of challenges, but it raises the central question: how to reduce the amount of trace data that needs to be captured. That inverts the way performance analysis is normally done. Instead of taking a test case and creating a trace of it, you need first to create a test case that is suitable for tracing.
So, intel_pt is tracing (logging) module integrated into CPU hardware, and when armed it will generate "hundreds of megabytes of trace data per CPU per second", according to used settings. With some settings it may event generate tracing data (packet log) faster than it can be written to disk or even to RAM ("overflow packets"). According to https://lwn.net/Articles/648154/ article, perf_events (kernel-mode) in intel_pt mode will just save full packet log into separate (bigger?) ring buffer and perf tool (user-space) will just periodically save data from ring buffer into file for offline filtering, parsing and decode. (Period of saving aux or ring mmap into the file is not the same as overflow interrupt frequency option -F) PT decoder then will be used to reconstruct PT packet log into perf-compatible samples. Log data volume is huge, overhead is 1% - 5% - 10% or more depending on branch frequency in code executed.
Documentation of intel_pt is manpage man perf-intel-pt and long text stored inside linux kernel source code at
https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/perf-intel-pt.txt
Intel PT is first supported in Intel Core M and 5th generation Intel Core
processors that are based on the Intel micro-architecture code name Broadwell.
Trace data is collected by 'perf record' and stored within the perf.data file. ... Trace data must be 'decoded' which involves walking the object code and matching the trace data packets. ... Decoding is done on-the-fly. The decoder outputs samples in the same format as
samples output by perf hardware events, for example as though the "instructions"
or "branches" events had been recorded. Presently 3 tools support this:
'perf script', 'perf report' and 'perf inject'. ... The main distinguishing feature of Intel PT is that the decoder can determine
the exact flow of software execution. Intel PT can be used to understand why
and how did software get to a certain point, or behave a certain way. ... A limitation of Intel PT is that it produces huge amounts of trace data
(hundreds of megabytes per second per core) which takes a long time to decode
By default perf record -e intel_pt// is same as -e intel_pt/tsc=1,noretcomp=0/. config terms section of manpage man perf-intel-pt says what is default settings:
tsc Always supported.
Produces TSC timestamp packets to provide timing information. In some cases it is possible to decode without timing information, for example a per-thread context that does not overlap executable memory maps.
noretcomp Always supported. Disables "return compression" so a TIP
packet is produced when a function returns. Causes more packets to be
produced but might make decoding more reliable.
pt Specifies pass-through which enables the branch config term.
branch Enable branch tracing. Branch tracing is enabled by default
To represent software control flow, "branches" samples are produced.
By default a branch sample is synthesized for every single branch.
As it says, intel_pt in default mode is used to produce control flow log, by asking hardware to generate log packets for every control flow instruction like call, branch, return, and to add timestamps to synchronize pt log with some service perf samples (like exec or mmap to find actual code being loaded into memory). It tries to generate not too much, for example [single bit is used per conditional branch (tnt)](https://conference.hitb.org/hitbsecconf2017ams/materials/D1T1 - Richard Johnson - Harnessing Intel Processor Trace on Windows for Vulnerability Discovery.pdf#page=12) and several bytes per indirect branch, but there are hundreds of millions branches per second for many programs.
Some useful and short slides on perf + intel_pt:
Andi Kleen, 2015 https://halobates.de/pt-tracing-summit15.pdf (PT modes current: Full trace mode, Snapshot mode; Upcoming: Sampling mode, Core dump, System crash mode)
Andi Kleen's posts on PT: https://halobates.de/blog/p/category/pt
Suchakrapani Datt Sharma, POLYTECHNIQUE MONTREAL, 2015 https://hsdm.dorsal.polymtl.ca/system/files/10Dec2015_0.pdf (trace packets overview - PSB (Packet Stream Boundary), TNT (Taken Not-Taken), TIP (Target IP) at branches, non-default CYC Packets : Cycle counter data for IPC, MTC (Mini Timestamp Counter), ...)
Jack Henschel, 2017 about design and use-cases https://blog.cubieserver.de/publications/Henschel_Intel-PT_2017.pdf
[https://events.static.linuxfound.org/sites/events/files/slides/lcna13_kleen.pdf Efficient and Large Scale Program Flow Tracing in Linux, Alexander Shishkin], Intel, 2013 ("What is it good for? •Profiling / performance measurement •Functional debugging •Code coverage analysis")
About generic difference between sampling and (software) tracing: https://danluu.com/perf-tracing/
Update: While intel pt trace log has full trace (there are packets inside for every branch/call/return), perf report does run conversion from pt log into sample set like in classic perf.data, and there is sampling rate in sample set. This is configured with --itrace option of perf report (iNNTT, where NN is amount and TT is type - i/t/us/ns, as described in man page of perf-report:
--itrace
Options for decoding instruction tracing data. The options are:
i synthesize instructions events
g synthesize a call chain (use with i or x)
The default is all events i.e. the same as --itrace=ibxwpe,
In addition, the period (default 100000, ...)
for instructions events can be specified in units of:
i instructions
t ticks
ms milliseconds
us microseconds
ns nanoseconds (default)
So it seems like by default perf report will convert full trace log into instruction samples at sampling rate of 100000 instructions (1 perf sample generated per 100 thousands instructions). It can be changed to higher rate, but processing time will increase.
Manpage of perf-intel-pt gives more examples of itrace option usage:
Because samples are synthesized after-the-fact, the sampling period
can be selected for reporting. e.g. sample every microsecond
sudo perf report pt_ls --itrace=i1usge
See the sections below for more information about the --itrace
option.
Beware the smaller the period, the more samples that are produced,
and the longer it takes to process them.
Also note that the coarseness of Intel PT timing information will
start to distort the statistical value of the sampling as the
sampling period becomes smaller.
To see every possible IPC value, "instructions" events can be used
e.g. --itrace=i0ns
--itrace=i10us
sets the period to 10us i.e. one instruction sample is synthesized
for each 10 microseconds of trace. Alternatives to "us" are "ms"
(milliseconds), "ns" (nanoseconds), "t" (TSC ticks) or "i"
(instructions).
For Intel PT, the default period is 100us.
Setting it to a zero period means "as often as possible".
In the case of Intel PT that is the same as a period of 1 and a unit
of instructions (i.e. --itrace=i1i).
http://halobates.de/blog/p/410 has some additional examples of complex conversions:
perf script --ns --itrace=cr
Record program execution and display function call graph.
perf script by defaults “samples” the data (only dumps a sample every
100us). This can be configured using the --itrace option (see
reference below)
perf script --itrace=i0ns --ns -F time,pid,comm,sym,symoff,insn,ip | xed -F insn: -S /proc/kallsyms -64
Show every assembly instruction executed with disassembler.
perf report --itrace=g32l64i100us --branch-history
Print hot paths every 100us as call graph histograms
perf script --itrace=i100usg | stackcollapse-perf.pl > workload.folded
flamegraph.pl workloaded.folded > workload.svg
google-chrome workload.svg
Generate flame graph from execution, sampled every 100us

Documentation for uinput

I am trying very hard to find the documentation for the uinput but the only thing I have found was the linux/uinput.h. I have also found some tutorials on the internet but no documentation at all!
For example I would like to know what UI_SET_MSCBIT does but I can't find anything about it.
How does people know how to use uinput?

Well, it takes some investigation effort for such subtle things. From
drivers/input/misc/uinput.c and include/uapi/linux/uinput.h files you can see bits for UI_SET_* definitions, like this:
MSC
REL
LED
etc.
Run next command in kernel sources directory:
$ git grep --all-match -e 'MSC' -e 'REL' -e 'LED' -- Documentation/*
or use regular grep, if your kernel doesn't have .git directory:
$ grep -rl MSC Documentation/* | xargs grep -l REL | xargs grep -l LED
You'll get this file: Documentation/input/event-codes.txt, from which you can see:
EV_MSC: Used to describe miscellaneous input data that do not fit into other types.
EV_MSC events are used for input and output events that do not fall under other categories.
A few EV_MSC codes have special meaning:
MSC_TIMESTAMP: Used to report the number of microseconds since the last reset. This event should be coded as an uint32 value, which is allowed to wrap around with no special consequence. It is assumed that the time difference between two consecutive events is reliable on a reasonable time scale (hours). A reset to zero can happen, in which case the time since the last event is unknown. If the device does not provide this information, the driver must not provide it to user space.
I'm afraid this is the best you can find out there for UI_SET_MSCBIT out there.

Setting Probes for SimVision in Verilog Code

I am working on simulations of verilog builded digital logic and need to restart a simulation very often to see the changes. I am using Cadence SimVision to review the waveforms.
Is there a way to write commands in verilog for the SimVision environment? I mean things like probes and Parameters.

It is not Verilog but you can create a tcl file.
shm.tcl:
database -open waves -shm
probe -create your_top_level -depth all -all -shm -database waves
run
exit
Now to run your simulation use:
irun -access +r testcase.sv -input shm.tcl

It's not standard Verilog, but the Cadence tools (ncvlog, ncsim, Incisive) will allow you to set probes from within the Verilog/SV source using a system call.
Check for documentation for $shm_open and $shm_probe.
initial begin
$shm_open("waves.shm");
$shm_probe("AS");
end
That said, the answer from #Morgan is the recommended way to do it so that you can control it at runtime.

How to use linux `perf` tool to generate "Off-CPU" profile

Brendan D. Gregg (author of DTrace book) has interesting variant of profiling: the "Off-CPU" profiling (and Off-CPU Flame Graph; slides 2013, p112-137) to see, where the thread or application were blocked (was not executed by CPU, but waiting for I/O, pagefault handler, or descheduled due short of CPU resources):
This time reveals which code-paths are blocked and waiting while off-CPU, and for how long exactly. This differs from traditional profiling which often samples the activity of threads at a given interval, and (usually) only examine threads if they are executing work on-CPU.
He also can combine Off-CPU profile data and On-CPU profile together: http://www.brendangregg.com/FlameGraphs/hotcoldflamegraphs.html
The examples given by Gregg are made using dtrace, which is not usually available in Linux OS. But there are some similar tools (ktap, systemtap, perf) and the perf as I think has widest installed base. Usually perf generated On-CPU profiles (which functions were executed more often on CPU).
How can I translate Gregg's Off-CPU examples to perf profiling tool in Linux?
PS: There is link to systemtap variant of Off-CPU flamegraphs in the slides from LISA13, p124: "Yichun Zhang created these, and has been using them on Linux with SystemTap to collect the proﬁle data. See: • http://agentzh.org/misc/slides/off-cpu-flame-graphs.pdf"" (CloudFlare Beer Meeting on 23 August 2013)

The perf technique I published[1] was a high-overhead workaround, until perf has BPF support for doing this.
Right now, the lowest cost way of generating an off-CPU flame graph on Linux is on a 4.6+ kernel (which has BPF stack trace support), and with bcc/BPF. I wrote a tool for it, offcputime[2], which can be run with a -f option for "folded output", suitable for feeding into flamegraph.pl. This offcputime tool does the timing and stack counting all in kernel content, and dumps a report that is then printed with symbols.
One day, I expect that perf itself will be able to do this as well: run a BPF program that does the in-kernel counting, and dumping of a report.
In the meantime, we can use bcc/BPF. If for some reason you can't use bcc, you can, right now, take that offcputime program and write it in C. A more complicated version is available in the Linux source, as samples/bpf/offwaketime*. With the new BPF features on Linux, if there's a will, there's a way.
[1] http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html
[2] https://github.com/iovisor/bcc/blob/master/tools/offcputime_example.txt

Brendan Gregg published instruction about Off-cpu flame graph generating:
http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html
and https://github.com/brendangregg/FlameGraph/issues/47#
Off-CPU time flame graphs may solve (say) 60% of the issues, with the remainder requiring walking the thread wakeups to find root cause. I explained off-CPU time flame graphs, this wakeup issue, and additional work, in my LISA13 talk on flame graphs (slides, youtube).
Here I'll show one way to do off-CPU time flame graphs using Linux perf_events.
# perf record -e sched:sched_stat_sleep -e sched:sched_switch \
-e sched:sched_process_exit -a -g -o perf.data.raw sleep 1
# perf inject -v -s -i perf.data.raw -o perf.data
# perf script -f comm,pid,tid,cpu,time,period,event,ip,sym,dso,trace | awk '
NF > 4 { exec = $1; period_ms = int($5 / 1000000) }
NF > 1 && NF <= 4 && period_ms > 0 { print $2 }
NF < 2 && period_ms > 0 { printf "%s\n%d\n\n", exec, period_ms }' | \
./stackcollapse.pl | \
./flamegraph.pl --countname=ms --title="Off-CPU Time Flame Graph" --colors=io > offcpu.svg
stackcollapse.pl and flamegraph.pl from Gregg are used to draw flamegraph.
There are perf options used from 3.17 kernels and newer...

Sox mixing with a delay

I have a problem with mixing sounds with delay.
I run this
sox -M f1.wav f1.wav f1.wav f1.wav out.wav delay 3 3 4 4 5 5
In the final file volume of the sound is changing(decreasing). How can i avoid this.

You can also control the volume for each of the signals using -v. You hadn't asked for this, but my experience is that at some point you might use this. And it took me a while to find this option online.
sox -m -v 1 file1.wav -v 0.5 file2.wav out.wav
Hope someone finds this useful.

While man sox does not describe automatic attenuation for merge (-M), this surprises me.
The following applies to "mix" mode (-m)
Unlike the other methods, 'mix' combining has the potential to cause
clipping in the combiner if no balancing is performed. So here, if
manual volume adjustments are not given, to ensure that clipping does
not occur, SoX will automatically adjust the volume (amplitude) of
each input signal by a factor of ¹/ n , where n is the number of input
files. If this results in audio that is too quiet or otherwise
unbalanced then the input file volumes can be set manually as
described above; using the norm effect on the mix is another
alternative.
So even though the manual explicitly excludes all modes other than "mix", I would certainly try the 'norm' effect or otherwise specify volume adjustments, since I cannot see how one would merge signals whilst avoiding clipping without attenuating somewhere.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string