How to drive a clock to a single clock domain? - delay

I have a project to do in VHDL on a FPGA (cyclone IV). The majority of my entities works with a single clock. I know that clock gating is not a good solution (see image) because it causes timing violations. Can someone tell me what are the good practice rules to do this kind of things? (I obviously did some researches on Internet but every link I found talks about clock domain crossing)
Thank you

The diagram isn't a demonstration of clock gating. It is showing the impact of skew between different portions of a clock distribution tree. This is a reality of synthesizing any practical design even if you aren't gating the clock. A gated clock introduces additional skew on nodes within its fanout in addition to that which naturally occurs in clock distribution.
In non-trivial devices the clocks are passed through a tree of buffers to minimize capacitive load. In FPGAs the clocks are usually routed on carefully designed global clock nets that have been optimized to minimize skew. In ASICs balanced clock trees will be synthesized and physical timing constraints will guide placement of the buffers to minimize skew at the flip-flops. Unless you are doing something exotic, the backend tools will take care of getting you the best clock trees possible.
As a designer you primarily deal with the skew problem by setting up proper timing constraints and using static timing analysis to verify you meet setup and hold requirements under worst case (and best case) conditions. An FPGA has already had its delays characterized for use in static timing. With an ASIC, the delays in your design will be estimated before and after place-and-route and fed into the analyzer. A gated clock will introduce skew that reduces the available timing budget on the affected data paths. The timing analyzer will account for this if you have set up the constraints properly. Once you pass static timing on the final design, your job is done.
If you have timing failures due to combinational delays in your datapath You have to do one of the following:
Reduce the failing combinational path by reorganizing the logic or inserting pipeline stages
Use a slower clock
Define a multi-cycle delay if you don't need valid results on every cycle
If a gated clock introduces too much skew you can think of it as creating a new clock domain altogether and using clock domain synchronization techniques to pass signals between the separate domains.

Related

Realtime CPU clock vs High Frequency Software clock

I am curious to learn about the technology which is used in generating software clock in simulators. The frequency of my machine is only ~2.4GHz but I can generate up to 500THz clock using a simulator(Refer below system Verilog snippet ).
`timescale 1fs/1fs;//This is the minimum time-unit and precision that can be used to generate 500THz clock
module temp();
bit clk_b;
always #1 clk_b =~ clk_b ;
endmodule
Is this higher frequency just a software illusion or does it have any link with CPU crystal oscillator?
The simulation does not "run" in realtime. So it will compute the result for the steps and if it is done it is done. Which means that the ratio between number of required steps (as well as problem complexity) and your computer performance will define how much time the simulation will need to finish. The timescale setting of the simulation is just what it says: a way to relate the simulation steps into a time(scale).
So it is really an "illusion" if you want to call it so.
SystemVerilog is a HVL i.e. a Hardware Verification Language. It is (mostly) used to verify hardware designs.
The main purpose of the language is to provide a platform where one can create logic to verify the DUT by running simulations i.e. generating different operating conditions for the DUT and checking how it behaves under each condition. But this does not necessarily mean that DUT is supposed to operate in such extreme conditions generated by the SystemVerilog testbench.
When you are generating 500THz clock from your testbench and checking the behaviour of your DUT, you are making sure that the DUT is not (virtually) going to break down even in such extreme conditions. But please note that this is just a virtual environment you have created and not the actual environment under which the DUT once synthesised is supposed to operate.
If the maximum frequency of the machine (or DUT) is ~2.5GHz, it is supposed to operate at that frequency in the actual environment, but just out of curiosity you can even check operation of DUT with different input clock frequencies by generating different simulations.
Hope it helps!

Propagational delay in circuits

which is better for accurate proportional delay: spice simulation method or calculation using elmores delay (RC delay modeling)
Spice simulation is more accurate than elmore delay modelling. This is mentioned in the book CMOS VLSI Design by Weste Harris on page 93, Section 2.6
Blindly trusting one’s models
Models should be viewed as only approximations to reality, not reality itself, and used within
their limitations. In particular, simple models like the Shockley or RC models aren’t even close
to accurate fits for the I-V characteristics of a modern transistor. They are valuable for the
insight they give on trends (i.e., making a transistor wider increases its gate capacitance and
decreases its ON resistance), not for the absolute values they predict. Cutting-edge projects
often target processes that are still under development, so these models should only be
viewed as speculative. Finally, processes may not be fully characterized over all operating regimes;
for example, don’t assume that your models are accurate in the subthreshold region
unless your vendor tells you so. Having said this, modern SPICE models do an extremely good
job of predicting performance well into the GHz range for well-characterized processes and
models when using proper design practices (such as accounting for temperature, voltage, and
process variation).

OpenCL GPU Audio

There's not much on this subject, perhaps because it isn't a good idea in the first place.
I want to create a realtime audio synthesis/processing engine that runs on the GPU. The reason for this is because I will also be using a physics library that runs on the GPU, and the audio output will be determined by the physics state. Is it true that GPU only carries audio output and can't generate it? Would this mean a large increase in latency, if I were to read the data back on the CPU and output it to the soundcard? I'm looking for a latency between 10 and 20ms in terms of the time between synthesis and playback.
Would the GPU accelerate synthesis by any worthwhile amount? I'm going to have a large number of synthesizers running at once, each of which I imagine could take up their own parallel process. AMD is coming out with GPU audio, so there must be something to this.
For what it's worth, I'm not sure that this idea lacks merit. If DarkZero's observation about transfer times is correct, it doesn't sound like there would be much overhead in getting audio onto the GPU for processing, even from many different input channels, and while there are probably audio operations that are not very amenable to parallelization, many are very VERY parallelizable.
It's obvious for example, that computing sine values for 128 samples of output from a sine source could be done completely in parallel. Working in blocks of that size would permit a latency of only about 3ms, which is acceptable in most digital audio applications. Similarly, the many other fundamental oscillators could be effectively parallelized. Amplitude modulation of such oscillators would be trivial. Efficient frequency modulation would be more challenging, but I would guess it is still possible.
In addition to oscillators, FIR filters are simple to parallelize, and a google search turned up some promising looking research papers (which I didn't take the trouble to read) that suggest that there are reasonable parallel approaches to IIR filter implementation. These two types of filters are fundamental to audio processing and many useful audio operations can be understood as such filters.
Wave-shaping is another task in digital audio that is embarrassingly parallel.
Even if you couldn't take an arbitrary software synth and map it effectively to the GPU, it is easy to imagine a software synthesizer constructed specifically to take advantage of the GPU's strengths, and avoid its weaknesses. A synthesizer relying exclusively on the components I have mentioned could still produce a fantastic range of sounds.
While marko is correct to point out that existing SIMD instructions can do some parallelization on the CPU, the number of inputs they can operate on at the same time pales in comparison to a good GPU.
In short, I hope you work on this and let us know what kind of results you see!
DSP operations on modern CPUs with vector processing units (SSE on x86/x64 or NEON on ARM) are already pretty cheap if exploited properly. This is particularly the case with filters, convolution, FFT and so on - which are fundamentally stream-based operations. There are the type of operations where a GPU might also excel.
As it turns out, soft synthesisers have quite a few operations in them that are not stream-like, and furthermore, the tendency is to process increasingly small chunks of audio at once to target low latency. These are a really bad fit for the capabilities of GPU.
The effort involved in using a GPU - particularly getting data in and out - is likely to far exceed any benefit you get. Furthermore, the capabilities of inexpensive personal computers - and also tablets and mobile devices - are more than enough for many digital audio applications AMD seem to have a solution looking for a problem. For sure, the existing music and digital audio software industry is not about to start producing software that only targets a limited sub-set of hardware.
Typical transfer times for some MB to/from GPU take 50us.
Delay is not your problem, however parallelizing a audio synthesizer in the GPU may be quite difficult. If you don't do it properly it may take more time the processing rather than the copy of data.
If you are going to run multiple synthetizers at once, I would recommend you to perform each synthesizer in a work-group, and parallelize the synthesis process with the work-items available. It will not be worth to have each synthesizer in one work-item, since it is unlikely you will have thousand.
http://arxiv.org/ftp/arxiv/papers/1211/1211.2038.pdf
You might be better off using OpenMP for it's lower initialization times.
You could check out the NESS project which is all about physical modelling synthesis. They are using GPUs for audio rendering because it the process involves simulating an acoustic 3D space for whichever given sound, and calculating what happens to that sound within the virtual 3D space (and apparently GPUs are good at working with this sort of data). Note that this is not realtime synthesis because it is so demanding of processing.

Adjusting the operating frequency of a module in Verilog

I am creating a fairly complicated module which involves timing analysis of 2 Modules each having their own algorithm, but take in 2 signed numbers as inputs and output a signed number.
I am designing this module for an FPGA in Verilog using Xilinx as my synthesis tool. Now I understand that Xilinx usually gives the worst case timing analysis for any module. This means that if I have a range of numbers which take 250 picoseconds, from input to output including the routing time, if there is even a single set of inputs that takes 400 picoseconds, the timing analysis shown by Xilinx would be 400 picoseconds.
My goal is to find:
1) If Module 1 is faster than Module 2 for any set of numbers.
1) Range of numbers for which Module 1 is faster than Module 2.
The only logical approach I can think of is, by increasing the operating frequency of the module. That is to force both the Modules to give their outputs after say 300 picoseconds rather than 400 picoseconds.
Obviously if I increase the operating frequency, some of the inputs in the testbench will give out erroneous outputs. My hypothesis is that the module that starts giving out erroneous answers first, has the algorithm.
So my doubts are:
1) Is it possible to increase the operating frequency of a Module in Verilog using Xilinx (some settings that I must enforce upon during synthesis or analysis). If not, is there a better tool that will be do my timing analysis?
2) Is this approach viable? Short of doing a gate level synthesis using Cadence, is there anyway, I can find out the actual time delay analysis for each set of signed numbers for each gate using Verilog?
You are right in assuming that Xilinx always reports worst-case timing for the whole design, where clock-rates are concerned, Don't take the synthesis results as being very accurate - they can vary by quite a lot once you've placed and routed the design.
I guess you could take the post-PAR Verilog netlist and simulate that with a variety of inputs using different simulated clock speeds - if there are slow paths which are not used for certain inputs, you should be able to run the simulated clock faster for those inputs.
Sounds like a very time consuming task, and I'm not sure what the point is. Where I come from (automotive) "worst-case" is the only number we can look at with any level of confidence!

What is the point of "create_clock" command in FPGA design?

In FPGA programming, what is the point of using the create_clock command in the XDC (or UCF) file? Let's say I have a clock port CLK that is assigned to a physical pin (which is my clock), in the XDC (or UCF) file. Why can't I just go ahead and use this CLK pin in my top level HDL? Why do I need to add something like this:
create_clock -name sys_clk_pin -period "XXX" [get_ports "CLK"]
Also, let's say I have a main clock "CLK" and some other clocks which I generate in HDL. Do I have to use "create_clock" for all the minor clock in XDC too?
I don't get this whole "create_clock" thing. Any help or direction is much appreciated.
Thanks
Design constraints, as the name suggests, are used in order to define additional constraints of your design, which can't be captured from HDL description.
Lets take create_clock command as an example. You specified the clock pin in your HDL description, why isn't this enough? The reason is that clock signal is not a usual signal - it is used as a reference signal by a synchronous logic (flip-flops).
I suppose you're familiar with "propagation delay" (through logic gates) concept. You want to make sure that all signals originating at one flop and sampled at the other will be able to propagate during a single clock cycle. Now, the total propagation delay you can know right after synthesis because each logic gate in FPGA has associated propagation delay (just sum these up). But how your analysis tools know what is the maximal allowed propagation delay? You do not specify these constraints in HDL, right? This is one of the cases where the frequency you specified with create_clock command will be used - it will be converted to period, and an analysis tool will warn you if any of the combinatorial paths in your design takes longer to propagate than clock's period.
The above example describes one of the actions performed by Static Timing Analysis (STA) tools in which "design constraints" are employed.
Another kind of tools which make extensive use of design constraints is Clock Domain Crossing (CDC) tools. These tools employed in designs containing more than one clock. The CDC concepts are described brilliantly here
In case you take one clock and generate another one from it (clock divider for example) you want to make CDC tool aware of this, because the fact that these clocks are related is important. Your way to inform CDC tool that the clocks are related is to use create_generated_clock constraint.
NOTE: the above examples are basic and by no means comprehensive.

Resources