how to get a critical path / bottleneck analysis of FIRRTL code? - riscv

I have some FIRRTL and I want to get a critical path / bottleneck analysis of the code so as to maximize the clock rate by minimizing the critical path.
I can write a weighted topological sort myself, but I do not know the weights that I should use for various circuit components as well as for and fanout slowdown.
I have heard the RISC-V grad students speak of running a critical path analysis when optimizing their chips, so the Chisel / RISC-V infrastructure must provide one. I would expect this to be a flag on the firrtl tool, but I see no such flag.

This is normally done through a Synthesis/PnR tool such as Genus/Innovous. While you could look at the FIRRTL/RTL to get a generalization regarding your design, there would be many factors that ultimately effect the timing.
For example, all other things being equal, a 4 combo gate path would run faster than a 5 combo gate path. The issue being where gates are placed, their drive strength (and your power requirements), routing, etc. would likely contribute more towards timing than the gates themselves.

Related

Treat signal as a clock in Verilog

for example, here is the diagram from previous question in here which I want to ask.
If I treat an data like the diagram here, and write it into Verilog code
What is the disadvantages here. thanks to answer.
Maybe there will be encounter some problems when we are synthesis or implementation in the tools that we use?
but actually it's works okay here when I program the code above into my FPGA.
Short Answer
Unreliable Sporadic Behaviour!
Long Answer
FPGA & ASIC designs use what is sometimes called synchronous design methodology. The basic idea is that clock pins are always driven by a clock. This allows synthesis tools to perform an analysis called 'static timing' which gives a degree of confidence that the design will operate properly because the delays have all been analyzed to be within the designers constraints.
In the design shown, the delay on the Q output of the first stage will be a determining factor on the correct operation of the circuit. Designers want to reduce the dependence on delay reducing the concerns to those that can be performed by static timing analysis.
The style shown is used in older references (my college digital design textbook in the 90's had these) and is sometimes part of what is called a 'ripple counter'. This was a popular method of digital design prior to the prevalence of FPGA and ASIC. In those days digital circuits were done using discrete logic on a printed circuit board, and the design concerns were different.
Its a bit difficult to find information on this topic. This post discussis the same topic a bit but does not go deep on the main point.
https://electronics.stackexchange.com/questions/115967/what-is-a-ripple-clock
One reason that its difficult to find information is that the term 'asynchronous design' has different meanings, and the more ubiquitous meaning pertains to the design of digital circuits where feedback around combinational logic is used. The logic settles or 'latches' into a stable state. This is different than the discussion whose main idea is 'always drive clock pins with a clock'
Another bad practice that was part of asynchronous design was to use the asynchronous reset pin of a flip-flop as control logic. In synchronous design the asynchronous reset pin is often not used, and when it is used, its asserted asynchronously, de-asserted synchronously and used mostly for global power on resets.
This is a reply to a similar issue discussed on the Xilinx question &
answer forum.
https://support.xilinx.com/s/question/0D52E0000757EsGSAU/that-dangerous-asynchronous-reset?language=en_US
The author (Xilinx engineer Ken Chapman) used the phrase 'Unreliable Sporadic Behavior' in the answer.
Another (good) synchronous design practice is to use very low skew clock resources to distribute the clock, so that the clock effectively is changing at the same time everywhere in the physical part.
Use synchronous design techniques & static timing as part of verification and to save debug effort for more important issues.
The term 'synchronous design' has kind of been forgotten since the 90's and is not widely used, its just the way designs are done. Google searching 'static timing' would be helpful to understand these concepts. A complete answer to 'what is static timing analysis' is beyond the scope of this question.
Do the following as a basis for synchronous design:
Drive clock pins with a clock
Use a clock buffer or clock tree to distribute the clock
Have a corresponding reset for each clock
Don't use asynchronous reset pins as control
Learn how to cross clock domains
Specify a timing constraint for each clock
Perform static timing analysis, understand the results

Use SimPy to simulate Chord distributed system

I am doing some research on several distributed systems such as Chord, and I would like to be able to write algorithms and run simulations of the distributed system with just my desktop.
In the simulation, I need to be able to have each node execute independently and communicate with each other, while manually inducing elements such as lag, packet loss, random crashes etc. And then collect data to estimate the performance of the system.
After some searching, I find SimPy to be a good candidate for my purpose.
Would SimPy be a suitable library for this task?
If yes, what are some suggestions/caveats for implementing such a system?
I would say yes.
I used SimPy (version 2) for simulating arbitary communication networks as part of my doctorate. You can see the code here:
https://github.com/IncidentNormal/CommNetSim
It is, however, a bit dense and not very well documented. Also it should really be translated to SimPy version 3, as 2 is no longer supported (and 3 fixes a bunch of limitations I found with 2).
Some concepts/ideas I found to be useful:
Work out what you want out of the simulation before you start implementing it; communication network simulations are incredibly sensitive to small design changes, as you are effectively trying to monitor/measure emergent behaviours from the system.
It's easy to start over-engineering the simulation, using native SimPy objects is almost always sufficient when you strip away the noise from your design.
Use Stores to simulate mediums for transferring packets/payloads. There is an example like this for simulating latency in the SimPy docs: https://simpy.readthedocs.io/en/latest/examples/latency.html
Events are tricky - as they can only fire once per simulation step, so often this can be the source of bugs as behaviour is effectively lost if multiple things fire the same event in a step. For robustness, try not to use them to represent behaviour in communication networks (you rarely need something that low-level), as mentioned above - use Stores instead as these act like queues by design.
Pay close attention to the probability distributions you use to generating randomness. Expovariate distributions are usually closer to simulating natural systems than uniform distributions, but make sure to check every distribution you use for sanity. Generating network traffic usually follows a Poisson distribution, for example, and data volume often follows a Power Law (Pareto) distribution.

Any benefits from implementing CSA versus just using multiplication symbol when synthesizing?

I am synthesizing some multiplication units in verilog and I was wondering if you generally get better results in terms of area/power savings if you implement your own CSA using booth encoding when multplying or if you just use the * symbol and let the synthesis tool take care of the problem for you?
Thank you!
Generally, I tend to trust the compiler tools I use and don't fret so much about the results as long as they meet my timing and area budgets.
That said, with multipliers that need to run at fast speeds I find I get better results (in DC, at least) if I create a Verilog module containing the multiply (*) and a retiming register or two, and push down into this module to synthesise it before popping up to toplevel synthesis. It seems as if the compiler gets 'distracted' by other timing paths if you try to do everything at once, so making it focus on a multiplier that you know is going to be tricky seems to help.
You have this question tagged with "FPGA." If your target device is an FPGA then it may be advisable to use FPGA's multiplier megafunction (don't remember what Xilinx calls it these days.)
This way, you will be sure that the tool utilizes the whatever internal hardware structure that you intend to use irrespective of synthesizer tool. You will be sure to get an optimum solution that is also predictable from a timing and latency standpoint.
Additionally, you don't have to test it for all the corner cases, especially important if you are doing signed multiplication and what kind of coding guidelines you follow.
I agree with #Marty in that I would use *. I have previously built my own low power adder structures, which then ran in to problems when the design shifted process/had to be run at a higher frequency. Hard coded architectures like this remove quite a bit of portability from the code.
Using the directives is nice in trials to see the different size (area) of architectures, but I leave the decision to the synthesis tool to make the best call based on the timing constraints and available area. I am not sure how power aware the tools are by default. Previously we ended up getting an extra license which added a lot of power aware knowledge to the synthesis.

Measuring the scaling behaviour of multithreaded applications

I am working on an application which supports many-core MIMD architectures (on consumer/desk-computers). I am currently worrying about the scaling behaviour of the application. Its designed to be massively parallel and addressing next-gen hardware. That's actually my problem. Does anyone know any software to simulate/emulate many-core MIMD Processors with >16 cores on a machine-code level? I've already implemented a software based thread sheduler with the ability to simulate multiple processors, by simple timing techniques.
I was curious if there's any software which could do this kind of simulation on a lower level preferably on an assembly language level to get better results. I want to emphasize once again that I'm only interested in MIMD Architectures. I know about OpenCL/CUDA/GPGPU but thats not what I'm looking for.
Any help is appreciated and thanks in advance for any answers.
You will rarely find all-purpose testing tools that are ALSO able to target very narrow (high-performance) corners - for a simple reason: the overhead of the "general-purpose" object will defeat that goal in the first place.
This is especially true with paralelism where locality and scheduling have a huge impact.
All this to say that I am affraid that you will have to write your own testing tool to target your exact usage pattern.
That's the price to pay for relevance.
If you are writing your application in C/C++, I might be able to help you out.
My company has created a tool that will take any C or C++ code and simulate its run-time behavior at bytecode level to find parallelization opportunities. The tool highlights parallelization opportunities and shows how parallel tasks behave.
The only mismatch is that our tool will also provide refactoring recipes to actually get to parallelization, whereas it sounds like you already have that.

How can the reliability of Software be checked through analysis?

How can we analyze the software reliability? How to check the reliabilty of any application or product?
First try to define "software reliability" and the way to quantify it.
If you accomplish this task, you will probably be able to "check" this characteristic.
The most effective way to check reliability is going to be to run your software and gather statistics on its actual reliability. There are too many variables in play, both at the hardware and software levels, to realistically analyze reliability prior to execution, with the possible exception of groups with massive resources like NASA.
There are various methods for determining whether a piece of software meets a specification, but most of the really productive ones do this by construction, i.e., by constraining the way in which the software is written so that it can be easily shown to be correct. Check out VDM, Z and the B toolkit for schemes for doing this sort of thing. Note that these tend to be expensive ways to program if you're not in a safety-critical systems environment.
Proving the correctness of the specification itself is really non-trivial!
Reliability is about continuity of correct service.
The best approach to assess reliability of a software is by dynamic analysis, in other words: testing.
In order to reduce your testing time you may want to apply input profiles different from operational one.
Apply various input distributions, measure how much time your software runs without failure. Then find out how far your input distributions are from operational profile and draw your conclusion about how much time the software would have run with operational profile.
This involves modeling techniques such as Markov chains or stochastic Petri nets.
For further digging, useful keywords are: fault forecasting and statistical testing.

Resources