for example, here is the diagram from previous question in here which I want to ask.
If I treat an data like the diagram here, and write it into Verilog code
What is the disadvantages here. thanks to answer.
Maybe there will be encounter some problems when we are synthesis or implementation in the tools that we use?
but actually it's works okay here when I program the code above into my FPGA.
Short Answer
Unreliable Sporadic Behaviour!
Long Answer
FPGA & ASIC designs use what is sometimes called synchronous design methodology. The basic idea is that clock pins are always driven by a clock. This allows synthesis tools to perform an analysis called 'static timing' which gives a degree of confidence that the design will operate properly because the delays have all been analyzed to be within the designers constraints.
In the design shown, the delay on the Q output of the first stage will be a determining factor on the correct operation of the circuit. Designers want to reduce the dependence on delay reducing the concerns to those that can be performed by static timing analysis.
The style shown is used in older references (my college digital design textbook in the 90's had these) and is sometimes part of what is called a 'ripple counter'. This was a popular method of digital design prior to the prevalence of FPGA and ASIC. In those days digital circuits were done using discrete logic on a printed circuit board, and the design concerns were different.
Its a bit difficult to find information on this topic. This post discussis the same topic a bit but does not go deep on the main point.
https://electronics.stackexchange.com/questions/115967/what-is-a-ripple-clock
One reason that its difficult to find information is that the term 'asynchronous design' has different meanings, and the more ubiquitous meaning pertains to the design of digital circuits where feedback around combinational logic is used. The logic settles or 'latches' into a stable state. This is different than the discussion whose main idea is 'always drive clock pins with a clock'
Another bad practice that was part of asynchronous design was to use the asynchronous reset pin of a flip-flop as control logic. In synchronous design the asynchronous reset pin is often not used, and when it is used, its asserted asynchronously, de-asserted synchronously and used mostly for global power on resets.
This is a reply to a similar issue discussed on the Xilinx question &
answer forum.
https://support.xilinx.com/s/question/0D52E0000757EsGSAU/that-dangerous-asynchronous-reset?language=en_US
The author (Xilinx engineer Ken Chapman) used the phrase 'Unreliable Sporadic Behavior' in the answer.
Another (good) synchronous design practice is to use very low skew clock resources to distribute the clock, so that the clock effectively is changing at the same time everywhere in the physical part.
Use synchronous design techniques & static timing as part of verification and to save debug effort for more important issues.
The term 'synchronous design' has kind of been forgotten since the 90's and is not widely used, its just the way designs are done. Google searching 'static timing' would be helpful to understand these concepts. A complete answer to 'what is static timing analysis' is beyond the scope of this question.
Do the following as a basis for synchronous design:
Drive clock pins with a clock
Use a clock buffer or clock tree to distribute the clock
Have a corresponding reset for each clock
Don't use asynchronous reset pins as control
Learn how to cross clock domains
Specify a timing constraint for each clock
Perform static timing analysis, understand the results
Related
I have some FIRRTL and I want to get a critical path / bottleneck analysis of the code so as to maximize the clock rate by minimizing the critical path.
I can write a weighted topological sort myself, but I do not know the weights that I should use for various circuit components as well as for and fanout slowdown.
I have heard the RISC-V grad students speak of running a critical path analysis when optimizing their chips, so the Chisel / RISC-V infrastructure must provide one. I would expect this to be a flag on the firrtl tool, but I see no such flag.
This is normally done through a Synthesis/PnR tool such as Genus/Innovous. While you could look at the FIRRTL/RTL to get a generalization regarding your design, there would be many factors that ultimately effect the timing.
For example, all other things being equal, a 4 combo gate path would run faster than a 5 combo gate path. The issue being where gates are placed, their drive strength (and your power requirements), routing, etc. would likely contribute more towards timing than the gates themselves.
SystemVerilog introduced some very useful constructs to improve coding style. However, as one of my coworkers always says, "You are not writing software, you are describing hardware." With that in mind, what features of the language should be avoided when the end result needs to be synthesized? This paper shows what features are currently synthesizable by the Synopsys tools, but to be safe I think one should only use the features that are synthesizable by all of the major vendors. Also, what constructs will produce strange results in the netlist which will be difficult to follow in an ECO?
In summary: I like compact and easy to maintain code, but not if it causes issues in the back end. What should I avoid?
Edit: In response to the close vote I want to try to make this a bit more specific. This question was inspired by this answer. I am a big fan of using the 'sugar' as Dave calls it to reduce the code complexity, but not if some synthesis tools are going to mangle signal names and make the result difficult to deal with. I am looking for more examples like this.
Theoretically, if you can write software that is synthesized into machine code to run on a piece of hardware, that software can be synthesized into hardware. And conversely, there are hardware constructs in Verilog-1995 that are not considered synthesizable simply because none of the major vendors ever got around to supporting it (e.g. assign/deassign). We still have people using //synopsis translate on/off because it took so long for them to support `ifdef SYNOPSYS.
Most of what I consider to be safe for synthesis in SystemVerilog is what I call syntactic sugar for Verilog. This is just more convenient ways of writing the same Verilog code with a lot less typing. Examples would be:
data types: typedef, struct, enum, int, byte
use of those types as ports, arguments and function return values
assignment operators: ++ -- +=
type casting and bit-streaming
packages
interfaces
port connection shortcuts
defaults for function/tasks/macro arguments, and port connections
Most of the constructs that fall into this category are taken from C and don't really change how the code gets synthesized. It's just more convenient to define and reference signals.
The place it gets difficult to synthesize is where there is dynamically allocated storage. This would be class objects, queues, dynamic arrays, and strings. as well as dynamically created processes with fork/join.
I think some people have a misconception about SystemVerilog thinking it is only for Verification when in fact the first version of the standard was the synthesizable subset, and Intel was one of the first users of it as a language for Design.
SystemVerilog(SV) can be used both as a HDL (Hardware Description Language) and HVL (Hardware Verification Language) and that is why it is often termed an "HDVL".
There are several interesting design constructs in SV which are synthesizable and can be used instead instead of older Verilog constructs, which are helpful in optimizing code and achieving faster results.
enum of SV vs parameter of Verilog while modelling FSM.
Use of logic instead of reg and wire.
Use of always_ff, always_comb, always_latch in place of
single always blocks in Verilog.
Use of the unique and priority statements instead of Verilog's
full and parallel case statements.
Wide range of data types available in SV.
Now what I have discussed above are those constructs of SystemVerilog which are used in RTL design.
But, the constructs which are used in the Verification Environment are non-synthesizable. They are as follows:
Dynamic arrays and associative arrays.
Program Blocks and Clocking blocks.
Mailboxes
Semaphores
Classes and all their related features.
Tasks
Chandle data types.
Queues.
Constrained random features.
Delay, wait, and event control statements.
I am synthesizing some multiplication units in verilog and I was wondering if you generally get better results in terms of area/power savings if you implement your own CSA using booth encoding when multplying or if you just use the * symbol and let the synthesis tool take care of the problem for you?
Thank you!
Generally, I tend to trust the compiler tools I use and don't fret so much about the results as long as they meet my timing and area budgets.
That said, with multipliers that need to run at fast speeds I find I get better results (in DC, at least) if I create a Verilog module containing the multiply (*) and a retiming register or two, and push down into this module to synthesise it before popping up to toplevel synthesis. It seems as if the compiler gets 'distracted' by other timing paths if you try to do everything at once, so making it focus on a multiplier that you know is going to be tricky seems to help.
You have this question tagged with "FPGA." If your target device is an FPGA then it may be advisable to use FPGA's multiplier megafunction (don't remember what Xilinx calls it these days.)
This way, you will be sure that the tool utilizes the whatever internal hardware structure that you intend to use irrespective of synthesizer tool. You will be sure to get an optimum solution that is also predictable from a timing and latency standpoint.
Additionally, you don't have to test it for all the corner cases, especially important if you are doing signed multiplication and what kind of coding guidelines you follow.
I agree with #Marty in that I would use *. I have previously built my own low power adder structures, which then ran in to problems when the design shifted process/had to be run at a higher frequency. Hard coded architectures like this remove quite a bit of portability from the code.
Using the directives is nice in trials to see the different size (area) of architectures, but I leave the decision to the synthesis tool to make the best call based on the timing constraints and available area. I am not sure how power aware the tools are by default. Previously we ended up getting an extra license which added a lot of power aware knowledge to the synthesis.
I've noticed that all designs I have come across can be multi-threaded using the actor mode - separating each work module into a different actor and using a message queue (for me a .NET ConcurrentQueue) to pass messages. What other good multi threaded models exist?
Communicating Sequential Processes is, I think, a far better model for concurrency than the actor model. It addresses a number of problems with the actor model (and other models) such as deadlock, livelock, starvation. Take a look at this and, more practically useful, this.
The main difference is as follows. In the actor model a message is sent asynchronously. However in CSP messages are sent synchronously; the sender cannot send until the receiver is ready to receive.
This one simple restriction makes the world of difference. If you've got an incorrect design with deadlock potential then in the actor model it may or may not occur (and it usually occurs only when demo-ing to the boss...). However in CSP the deadlock will always occur, leaving you in no doubt that your design is incorrect. Ok, so you've still got to fix it but that's OK; fixing problems you know are there is much easier than attempting to exhaustively test for the absence of problems (your only choice in the actor model).
The strictly synchronous approach of CSP seems like it will cause problems with response times; for example one fears that a GUI thread can't move on because it's not been able to send a message to a busy worker thread that's not got as far as its 'read'. What you have to do is to ensure that the workload is spread across enough threads so that they can all get back to waiting for new messages within an acceptable period of time. CSP doesn't let you get away with it. The actor model does, however don't be deceived; you're just building up future problems.
In .NET a ConcurrentQueue is not the right primitive for CSP, not unless you layer a synchronising mechanism on top. I've added strict synchronisation on top of TCP sockets too. In fact I generally end up writing some sort of library that abstracts both sockets and pipes so that it becomes immaterial as to whether a 'Process' (as they're known in CSP parlance) is a thread on this machine or a whole other process on another machine at the end of a network connection. Nice - scalabilty built in from the very beginning.
I've been doing it the CSP way for 23 years now, I won't do it any other way. Built some big systems with thousands of threads that way.
==EDIT==
It seems this answer is still attracting some attention, so I thought I'd add to it. For Windows developers there is the DataFlow namespace for the Task Parallel Library. It has to be separately downloaded. Microsoft desribe it thusly: "This dataflow model promotes actor-based programming by providing in-process message passing for coarse-grained dataflow and pipelining tasks." Excellent! It uses classes like BufferBlocks as communications channels. The important thing is that a BufferBlock has a BoundedCapacity property that defaults to Unbounded, which fits the Actor model. Set this to a value of 1, and you have now transformed it into a CSP-style communcation channel.
To add to my last, there are various other multi threading models beyond CSP. This Wikipedia page lists several others like CCS, ACP, and LOTOS. Reading those articles hints at a deep and dark cavern where academics roam, waiting to pounce on a stray software developer.
The problem is that academic obscurity often means a complete lack of tools and libraries at the practical, usable level. It takes a lot of effort to convert a sound, proven academic study into a set of libraries and tools. There's little real incentive for the wider software community to take up a theoretical paper and turn it into a practical reality.
I like CSP because it's actually dead simple to implement your own CSP library based on select() or pselect(). I've done that several times now (I must learn about code re-use), plus the nice people at Kent University put together JCSP for those who like Java. I don't recommend developing in Occam (though it's still just about possible); support and maintainability are going to be issues going forward. CSP is probably the easiest one to get into, and given its good characteristics it's well worthwhile.
#JeremyFriesner
Future Problems
To expand on what I meant by "future problems", I was referring to the fact that in an asynchronous system the sender of messages has no knowledge as to whether the receiver is actually keeping up with the demand. The sender doesn't know because all it knows is that some message buffer has accepted the message. The transport underneath (e.g. tcp) then gets on with the job of pushing the message over as and when the receiver is willing to accept it.
Thus it might be that when under stress the system fails to perform as required, because the message transport will inevitably have a limited capacity to absorb messages that the receiver can't accept yet. The sender only finds this out after the problem has already begun to develop, by which time it might be too late to do anything about it.
Testing of course can reveal this problem, but you have to be careful that the testing really has exhausted the transport's ability to absorb messages. Just a quick blast at full speed might be deceiving.
Of course, a synchronous system imposes an overhead ("are you ready yet?", "no, not yet", "now?", "yes!", "here you are then") which just doesn't happen in an asynchronous system. So on average the asynchronous system will be more efficient, might actually have a higher throughput, etc. Which is why most the of the worlds systems are actually asynchronous, but also the reason why systems don't always reach the full capacity that the raw network bandwidths / processing times might suggest. When approaching full capacity asynchronous systems tend not to limit gracefully, in my opinion. Token Bus (nb not Token Ring) was a good example of a synchronous network with totally dependable and deterministic throughput but was just a little bit slower than Ethernet and Token Ring...
Having always been blessed with a surfeit of bandwidth in my problems I've chosen the synchronous route for certainty-of-success reasons; I'm not really losing out much on bandwidth, but I am losing tons of risk, which is good.
Convert from Synchronous to Asynchronous
Maybe, but it's possibly of little value. In a synchronous system it only works as per the requirement if you have successfully balanced the division of labour between threads. That is, there are enough threads doing the slow bits so that the fast bits aren't held back. Get that wrong and the system definitely isn't quick enough.
But having done that you have a system where every component is able to send messages onwards with no delay, because everything it is sending to is ready and waiting (because of your skill and judgement at balancing out the workloads). So if you did then convert to an asynchronous message transport all you're doing is saving fractionally small amounts of time in the transport of those messages. You're not making changes that will result in the workloads getting processed quicker. However, if saving bandwidth is the goal then perhaps its worthwhile.
Of course, doing this balancing can be a difficult thing, and dealing with variabilities like HDD access times, networks, etc can be difficult to overcome. I've often had to implement a 'next available' workload sharing scheme. But certainly in real time signal processing systems like the ones I play with you're basically dealing with a very dependable transport like OpenVPX's RapidIO, you're only doing sums on the data (not dealing with databases, disks, etc), and the data rates are very high (1GByte/sec is perfectly doable these days, and in fact I was handling data rates that high 13 years ago; that was haaard work). Being strictly synchronous means that you're either definitely keeping up with the data rate or definitely not. With asynchronous, it's more of a maybe...
Real Time OS for Everyone!
Having a real time OS is an essential component too, and these days it seems to be the PREEMPT_RT patch set for Linux that does the job for a lot of people in the trade. Redhat do a prepack spin of that (RedHat MRG), but for a freebie Scientific Linux from the nice people at CERN is good and free! I strongly suspect that a lot of systems would work much more smoothly near their capacity limits if PREEMPT_RT was used - it does a good job of smoothing things out.
Concurrency is a fascinating topic with a lot of approaches to implementation with the fundamental question being - "How do I coordinate parallel computations?".
Some models of concurrency are:
Futures
Futures also known as Promises or Tasks are objects that act as proxies for an asynchronously calculated result. When the value is actually needed for a calculation the thread freezes until the calculation is complete and thus, synchronization is achieved.
Futures are the preferred concurrency model for .NET and ES6.
Software Transactional Memory
Software Transactional Memory (STM) synchronizes access to shared memory (much like locks) by grouping actions into transactions. Any single transaction only sees a single view of the shared memory and is atomic. This is conceptually similar to how many databases deal with concurrency.
STM is the preferred concurrency model for Clojure and Haskell.
The Actor Model
The Actor Model focuses of message passing. An actor receives a message and can decide to send a message in response, spawn other actors, make local changes etc. This is, probably, the least tightly coupled model of these discussed as Actors exchange messages only and nothing else.
The Actor Model is the preferred concurrency model for Erlang and Rust.
Note that unlike the languages mentioned above most languages don't have cannon or preferred concurrency models and even those languages who show a strong preference for one model usually have the other ones implemented as libraries.
My personal opinion is that Futures outclass STM and Actors in simplicity of use and reasoning but none of these models are inherently "wrong" and I can think of no disadvantages for either. You could use whichever you preferred with no consequences.
The most general model for parallel processing is Petri Nets. It represents computation as pure data dependency graph, which expreses maximum parallelism. All other models stem from it.
Dataflow Computing model http://www.cs.colostate.edu/cameron/dataflow.html, http://en.wikipedia.org/wiki/Dataflow_programming is almost as powerful. It restricts Petri Net places to have only one output arc. In practice, this is useful, as places with multiple output arcs are hard to implement, cause indeterminism, and are rarely needed.
Actor model is a dataflow model where nodes may have only 2 input edges - one for input messages and one for actor's state. This is a serious restriction if you want to program functions with side-effect and more than one argument.
I have read "Nonblocking Assignments in Verilog Synthesis, Coding Styles that Kill!" by Clifford Cummings. He says that the code at the bottom of this question is "guaranteed" to be synthesised into a three flip-flop pipeline, but it is not guaranteed to simulate correctly (example pipeb3, page 10; the "guaranteed" comment is on page 12). The document won a best paper award, so I assume the claim is true. http://www.sunburst-design.com/papers/CummingsSNUG2000SJ_NBA.pdf
My question: How is the correctness of Verilog synthesis defined if not by reference to the simulation semantics? Many thanks.
I suppose the bonus points question is: give the simplest possible Verilog program that has well-defined synthesis semantics and does not have well defined simulation semantics, assuming it is not the code below. Thanks again.
In fact, can someone give me a piece of Verilog thatis well defined when both simulated and synthesised, yet the two produce different results?
The code:
module pipeb3 q3, d, clk);
output [7:0] q3;
input [7:0] d;
input clk;
reg [7:0] q3, q2, q1;
always #(posedge clk) q1=d;
always #(posedge clk) q3=q2;
always #(posedge clk) q1=d;
endmodule
PS: in case anyone cares, I though a plausible definition of a correct synthesis tool might be along the lines of "the synthesised hardware will do something that a correct simulator could". But this is inconsistent with the paper.
[I now think the paper is not right. Section 5.2 of the 1364-2001 standard clearly says that the meaning of a Verilog program is defined by its simulation that the standard then proceeds to define (non-determinism and all). There is no mention whatsoever of any "guarantees" that synthesis tools must provide over and above simulators.
There is another standard 1364.1-2002 that describes the synthesisable subset. There is no obvious mention that the semantics of synthesised hardware should somehow differ from simulation. Section 5.2.2 "Modelling edge-sensitive storage devices" says that non-blocking assignments should be used to model flip-flops. In standard-speak that means that the use of anything else is unsupported.
As a final note, the section referred to in the previous paragraph says that blocking assignments can be used to calculate the RHS of the non-blocking assignment. This appears to violate Cummings' recommendation #5.
Cliff Cummings is listed as a member of the working group of the 1364.1-2002 standard. This standard is listed as replaced on the IEEE website but I cannot tell what it was replaced by.]
All -
Time for me to chime in with useful background information and my own opinions.
First - The IEEE-1364.1-2002 Verilog RTL Synthesis Standard was never fully implemented by any vendor, which is why none of us were in any hurry to update the standard or to provide a SystemVerilog version of the synthesis standard. To my knowledge, the standard was not "replaced," and has just expired. To my knowledge, the attributes described in the Standard were never fully implemented by any vendor. The only useful feature in the Standard that I believe was implemented by all vendors was that a vendor is supposed to set the macro `define SYNTHESIS before reading any user code, so that you can now use `ifndef SYNTHESIS - `endif as a generic replacement for the vendor-specific // synopsys translate_on - // synopsys translate_off pragma-comments.
Verilog was invented as a simulation language and was never intended to be a synthesis language. In the late 1980's, Synopsys recognized that engineers really liked this Verilog-simulation language and started to define a subset of the language that they (Synopsys) would recognize and convert through synthesis into hardware. We now refer to this as the RTL synthesis subset, and that subset can grow over time as synthesis tool vendors discover unique and creative ways to convert a new type of description into hardware.
There really is no "correctness of Verilog synthesis defined." Don Mills and I wrote a paper in 1999 entitled, "RTL Coding Styles That Yield Simulation and Synthesis Mismatches," to warn engineers about legal Verilog coding styles that could infer synthesized hardware with different behavior.
http://www.sunburst-design.com/papers/CummingsSNUG1999SJ_SynthMismatch.pdf
Consider this, if synthesized results always matched the behavior of Verilog simulations, there would be no need to run gate simulations. The design, as RTL-simulated, would be correct. Because there is no guaranteed match, engineers run gate-sims to prove that the gate behavior matches the RTL behavior, or they try to run equivalence checking tools to mathematically prove that the pre-synthesis RTL code is equivalent to the post-synthesis gate models, so that gate-sims are not required.
As for the bonus question, this is really hard, because Verilog semantics are rather well defined, even if the definition is that it is a legal race condition.
As far as well-defined code in simulation and synthesis with different results, consider:
module code1c (output reg o, input a, b);
always
o = a & b;
endmodule
In simulation, you never get past time-0. Simulation will loop forever because of the missing sensitivity list. Synthesis tools do not even consider the sensitivity list when inferring combinational logic, so you will get a 2-input and-gate and a warning about missing sensitivity list items that could cause a mis-match between pre- and post-synthesis simulations. In Verilog-2001 we added always #* to avoid this common problem, and in SystemVerilog we added always_comb to remove the sensitivity list and inform the synthesis tool of the designer-intended logic.
As far as whether the paper should offer guarantees on correct synthesis behavior, it probably should not, but the guarantees described in my paper define what an engineer can expect from a synthesis tool based on experience with multiple synthesis tools.
"As a final note, the section referred to in the previous paragraph says that blocking
assignments can be used to calculate the RHS of the non-blocking assignment. This
appears to violate Cummings' recommendation #5."
You are correct, this does violate coding guideline #5 and in my opinion should not be used.
Coding guideline #5 is frequently violated in VHDL designs because VHDL variables cannot trigger another process. I find the VHDL-camp evenly divided on this issue. Half say that you should not use variable assignments and the other half use variables to improve simulation performance but then are required to mix variable assignments with a final signal assignment to trigger other processes.
If you violate coding guideline #5 and if your code is correct, the simulation will work and the synthesis will also work, but if you have any mistakes in your code, it is very difficult to debug designs that violate coding guideline #5 because the waveform display for the combinational piece does not make sense. The output of the combinational logic in a waveform display only updates when reset is not asserted and on a clock edge, which is not how real combinational hardware behaves, and this has proven to be a difficult issue when debugging these designs using waveform displays (I did not include this information in the paper).
Regards - Cliff Cummings - Verilog & SystemVerilog Guru
I believe the reason that will synthesize correctly is because in real silicon there's no difference between 'blocking' and 'nonblocking'.
Synthesis will read that and create three flip flops chained back to back, as you've described.
This won't be a problem in synthesis (assuming you're not violating flop hold time), because real gates exhibit delays. On the rising edge of clk, it will take several ns for the value d to propogate to q1. By the time d propagates to q1, q1 will have already been sampled by the second flop, similarly with q2 and q3.
The reason this doesn't work in simulation is because there are no gate delays. On the positive edge of clock, q1 will be instantly replaced with d, possibly before q1 was sampled by the second flop. In a real circuit (with proper setup and hold time), q1 is guaranteed to be sampled on the positive edge of clock before the first flop can change its output value.
I know this 3 years old, but your post was just flagged up when someone tried to edit it. Cliff's answer is, of course, comprehensive, but it doesn't really answer your question. The other answer is also plain wrong.
My question: How is the correctness of Verilog synthesis defined if
not by reference to the simulation semantics?
You're right, of course. Synthesis is only 'correct' if (a) the result (output) simulates in the same way as the original (input), after possibly making some allowance for timing/etc issues, and/or (b) the synthesiser output can be formally proved to be equivalent to the synthesiser input.
give the simplest possible Verilog program that has well-defined
synthesis semantics and does not have well defined simulation
semantics
In principle, this shouldn't be possible. The synthesiser vendors tried to define templates that were based on code that had well-defined simulation semantics. However, Verilog was (and is) poorly defined, and NBAs didn't initially exist in the language, so you have oddities like the pipeline example. Best to forget about them.
In fact, can someone give me a piece of Verilog that is well defined
when both simulated and synthesised, yet the two produce different
results?
The only definition of 'well defined' (as opposed to 'correct') in synthesis is that multiple vendors will produce exactly the same incorrect result. This is pretty unlikely. I guess the classical async reset and async set clocked F/F would be close.