Rocketchip (riscv) acclerator performance evaluation

Rocketchip (riscv) acclerator performance evaluation - riscv

I have implemented accelerator on Rocket chip generator using Rocc. How to compute the performance of accelerator and compare with C implementation. I have written C implementation and computing the cycles as "Cycle = End - Begin". Where end and begin are calling read_csr(mcycle). I use it by reading dhrystone.h in RISCV-test github. Is this right way to calculate the cycles ?
I am thinking of using this technique for both accelerator and C implementation.
Can i use CSR for this purpose.

This is a great use of the RISC-V Hardware Performance Monitors (HPMs). If you are running your benchmark in machine mode, you can read mcycle to measure the passage on cycles. If you are running in user mode, you have instructions like rdcycle to give you user-level access to the cycle counter.

Related

How do I monitor the amount of SIMD instruction usage

How can I monitor the amount of SIMD (SSE, AVX, AVX2, AVX-512) instruction usage of a process? For example, htop can be used to monitor general CPU usage, but not specifically SIMD instruction usage.

I think the only reliable way to count all SIMD instructions (not just FP math) is dynamic instrumentation (e.g. via something like Intel PIN / SDE).
See How to characterize a workload by obtaining the instruction type breakdown? and How do I determine the number of x86 machine instructions executed in a C program? specifically sde64 -mix -- ./my_program to print the instruction mix for your program for that run, example output in libsvm compiled with AVX vs no AVX
I don't think there's a good way to make this like top / htop, if it's even possible to safely attach to already-running processes, especially multi-threaded once.
It might also be possible to get dynamic instruction counts using last-branch-record stuff to record / reconstruct the path of execution and count everything, but I don't know of tools for that. In theory that could attach to already-running programs without much danger, but it would take a lot of computation (disassembling and counting instructions) to do it on the fly for all running processes. Not like just asking the kernel for CPU usage stats that it tracks anyway on context switches.
You'd need hardware instruction-counting support for this to be really efficient the way top is.
For SIMD floating point math specifically (not FP shuffles, just real FP math like vaddps), there are perf counter events.
e.g. from perf list output:
fp_arith_inst_retired.128b_packed_single
[Number of SSE/AVX computational 128-bit packed single precision
floating-point instructions retired. Each count represents 4
computations. Applies to SSE* and AVX* packed single precision
floating-point instructions: ADD SUB MUL DIV MIN MAX RCP RSQRT SQRT
DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as
they perform multiple calculations per element]
So it's not even counting uops, it's counting FLOPS. There are other events for ...pd packed double, and 256-bit versions of each. (I assume on CPUs with AVX512, there are also 512-bit vector versions of these events.)
You can use perf to count their execution globally across processes and on all cores. Or for a single process
## count math instructions only, not SIMD integer, load/store, or anything else
perf stat -e cycles:u,instructions:u,fp_arith_inst_retired.{128,256}b_packed_{double,single}:u ./my_program
# fixme: that brace-expansion doesn't expand properly; it separates with spaces not commas.
(Intentionally omitting fp_arith_inst_retired.scalar_{double,single} because you only asked about SIMD and scalar instructions on XMM registers don't count, IMO.)
(You can attach perf to a running process by using -p PID instead of a command. Or use perf top as suggested in
See Ubuntu - how to tell if AVX or SSE, is current being used by CPU app?
You can run perf stat -a to monitor globally across all cores, regardless of what process is executing. But again, this only counts FP math, not SIMD in general.
Still, it is hardware-supported and thus could be cheap enough for something like htop to use without wasting a lot of CPU time if you leave it running long-term.

What is difference between soft core on NIOS and hard core?

I am learning FPGA recently. I have tried to use sdram, somebody recommends me use it through nios ii. But I see some articles using ip core on nios ii(c/c++) it may slow than you write through verilog? Why? Because Hardware(fast, parallel) and Software?

What is a soft-CPU? FPGAs are composed of, among other things, reconfigurable logic blocks (LUTs), Memory, and multipliers/DSPs. A soft CPU is a CPU made up of the FPGAs configurable logic. Nios II is Altera/Intel's flavour of a soft CPU. This differs from a hardened CPU like the ARM cores included in many Altera/Intel and Xilinx SoC FPGAs. In these cases, the ARM cores in made of fixed transistors instead of FPGA fabric, and cannot be reconfigured for other purposes.
Why have hardened CPUs? They're typically faster than soft CPUs, take up less space, and don't consume any of the valuable FPGA routing. Since many designs use some sort of CPU, hardening one (like is done with many popular I/O interfaces) it produces an overall net gain. (If you don't need a CPU, you can simple buy a non-SoC FPGA.
As for using a CPU vs pure logic/hardware, there are also tradeoffs. Writing software is typically easier than Verilog, and your CPU will be set up to manage things like response times and other memory quirks. However, you'll be restricted by the CPU speed (Nios is typically 100-200MHz, depending on your FPGA), and the extra latency of needing to interface with a CPU, and the CPU instruction execution speed.
In a similar vein to why FPGAs are gaining popularity, pure-hardware circuit have specialization that can allow them to operate faster than a more multipurposed CPU (either soft or hardened). The tradeoff you get for that speed boost is the extra work involved in writing timing-accurate Verilog.

Are RISC-V instruction execution durations standardized for the sake of cryptographic security?

Some cryptographic functions require a consistent execution duration to avoid timing attacks. I read that such functions targeting x86 are hard to write for reasons potentially including the emulated nature of the ISA and out-of-order processing. Therefore preventing timing attacks on the x86 is not easy because it depends on complex, and/or unknown factors in any given moment.
In a standard RISC-V core, are instruction timings predictably consistent relative to each another? What about in the case of a standard core with out-of-order processing or proprietary implementations of the base ISA?

RISC-V could be implemented in a machine with deterministic latencies; this has to do more with the implementation than the ISA.
See this project for a RISC-V implementation that supports predictable-latency execution: https://github.com/pretis/flexpret. It was developed for the embedded space, but would seem to be suitable for your proposed application as well.

It is important differentiate an ISA from an implementation of it. Nothing in the RISC-V spec mandates the instruction execution latencies. Most implementations will do whatever gives them the highest performance. A security paranoid processor could be designed to have consistent latencies for all instructions and yet still conform to the RISC-V spec.
A nice feature of RISC-V is that plenty of opcode space was intentionally left unused to make room for ISA extensions. There appear to be no publicly announced plans for a crypto extension, so this feature could be incorporated into a crypto extension when it is made if needed.

I'm not sure about core, but I've read that in RISC-V Cryptography Extensions Volume I (riscv-crypto-spec-scalar-v1.0.1.pdf), cryptographic instructions are required of this:
This instruction must always be implemented such that its execution latency does not depend on the data being operated on.
So in the context of cryptographic-specific instructions, yes.

"is there a standard for how long each instruction should take to complete relative to other operations?"
No.
Such behavior will be consistent with all other major ISAs as far as I am aware of.
An out-of-order processor will execute instructions as their dependencies resolve. Cache misses and the potentially random nature of issue select will mean that successive loop iterations will behave differently with regards to when instructions execute relative to one another. Any number of other micro-architecture issues get in the way, including instruction fetch misses, dcache misses, resource stalls causing replays, etc. Even a typical in-order core will face such issues.
how does the RISC-V team plan to address potential standard or non-standard complexity that a cryptographic library developer must find some way to address?
I can't speak for the RISC-V team, but if I may hazard a guess, I suspect that this (and similar) areas will involve the wider community to discuss and address such issues.

Implementation of clock()

I was going through stackoverflow threads on various mechanisms for computing CPU time of a process.
How is clock() internally implemented ? Does it use rdtsc() ( If that's the case then it is sensitive to migration between cores ).
Also, getrusage() implemented ? Does it also depend on TSC ?
Thanks in advance

The kernel keeps track of CPU utilization for processes in sizes of ticks.
Both clock() and getrusage() are both based on these.
Ticks are accumulated by processes by the kernel using a sampling method in which the kernel receives a hardware interrupt for the clock and executes the clock handler, which adds the tick to the currently running process. At least, this is how it worked last time I looked.
So, rtdsc does not come into play at all - which is a good thing since rdtsc does not measure accurately across CPUs.

You could easily glance at some libc code. Here is the time/ directory of musl-libc
On several libraries, some low level timing syscalls are using VDSO to avoid paying the cost of a real syscall (from user-space to kernel and back), so somehow uses RTDSC.
But I am surprised that you ask. If it is curiosity, just study the source code of free software implementation. Otherwise, trust the specifications & the implementations.
Gory details could be complex, since implementation and system specific. The real implementation could be dynamically tuned at run-time (eg thru VDSO set-up in the kernel).

Time Stamp Counter

I am using time stamp counter in my C++ programme by querying the register. However, one problem I encounter is that the function to acquire the time stamp would acquire from different CPU. How could I ensure that my function would always acquire the timestamp from the same CPU or is there anyway to synchronize the CPU? By the way, my programme is running on 4 cores server in Fedora 13 64 bit.
Thanks.

Look at the following excerpt from Intel manual. According to section 16.12, I think the "newer processors" below refers to any processor newer than pentium 4. You can simultaneously and atomically determine the tsc value and the core ID using the rdtscp instruction if it is supported. I haven't tried it though. Good Luck.
Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 3 (3A & 3B): System Programming Guide:
Chapter 16.12.1 Invariant TSC
The time stamp counter in newer processors may support an enhancement, referred
to as invariant TSC. Processor’s support for invariant TSC is indicated by
CPUID.80000007H:EDX[8].
The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is
the architectural behavior moving forward. On processors with invariant TSC
support, the OS may use the TSC for wall clock timer services (instead of ACPI or
HPET timers). TSC reads are much more efficient and do not incur the overhead
associated with a ring transition or access to a platform resource.
Intel also has a guide on code execution benchmarking that discusses cpu association with rdtsc - http://download.intel.com/embedded/software/IA/324264.pdf

In my experience, it is wise to avoid TSC altogether, unless you really want to measure individual clock cycles on individual cores/CPUs.
Potential problems with TSC:
Frequency scaling. Counter does not increment linearly with time...
Different clocks on different CPUs/cores (I would not rule out different frequency scaling on different CPUs, or even differently clocked CPUs - though the latter should be rare).
Unsynchronized counters on different CPUs/cores (even if they use the same frequency).
This basically boils down to that you can only use the TSC to measure elapsed CPU cycles (not elapsed time) on a single CPU in a single threaded application, if you force the affinity for the thread.
The preferred alternative is to use system functions. The most portable (on Unix/Mac) is gettimeofday(), which is usually very accurate. A more appropriate function might be clock_gettime(), but check if it is supported on your system first. Under Windows you can safely use QueryPerformanceCounter().

You can use sched_setaffinity or cpuset feature that lets you create a cpuset and assign tasks to the set.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string