Circuit goes wrong when synthesized with a tight constraint

Circuit goes wrong when synthesized with a tight constraint - verilog

I have a RTL code.
At first, I synthesized the circuit at 10 ns and run post-synthesis simulation. The circuit worked well.
After that, I changed the timing constraint to 7 ns and re-synthesized the code using:
compile_ultra -retime
DC reported that the circuit has met timing requirements (slack = 0) and there is no design rule violation either. However, the netlist couldn't pass post-synthesis simulation. Does anyone know why?

I have found that Xilinx gate level simulation have (had?) a flaw when running at very high frequencies. This was 10+ years ago so things might have changed!
In my case I was simulating logic running at 300MHz. The results where baffling so I pulled in the most important signals in the waveform display.
The problem turned out to be the clock. The delay in the clock tree is simulated by lumping all the delay in the IBUF buffer. The clock tree behaviour is that of a net- or transport delay: The pulse going in will come out after a while. The IBUF delay model should therefore use a non-blocking delay:
always #( I)
O <= #delay_time I;
But it does not. Instead it uses a standard O = I; blocking statement which gets SDF annotated.
Thus if the high/low period of the input frequency to the buffer is longer then the IBUF delay, clock edges get lost and your gate level simulation fails.
I don't know if Xilinx have fixed that but I would say check your clock.

Related

Verilog module generates impossible data when running on FPGA

I’m developing a boolean data logger on a ZYNQ 7000 SoC. The logger takes a boolean input from GPIO and logs the input’s value and the time it takes to flip.
I use a 32-bit register as a log entry, the MSB bit is the boolean value. The 30:0 bits is an unsigned integer which records the time between last 2 flips. The logger should work like the following picture.
Here's my implementation of the logger in Verilog. To read the logged data from the processor, I use an AXI slave interface generated by vivado and inline my logger in the AXI module.
module BoolLogger_AXI #(
parameter BufferDepth = 512
)(
input wire data_in, // boolean input
input wire S_AXI_ACLK, // clock
input wire S_AXI_ARESETN, // reset_n
// other AXI signals
);
wire slv_reg_wren; // write enable of AXI interface
reg[31:0] buff[0:BufferDepth-1];
reg[15:0] idx;
reg[31:0] count;
reg last_data;
always #(posedge S_AXI_ACLK) begin
if((!S_AXI_ARESETN) || slv_reg_wren) begin
idx <= 0;
count <= 1;
last_data <= data_in;
end else begin
if(last_data!=data_in) begin // add an entry only when input flips
last_data <= data_in;
if(idx < BufferDepth) begin // stop logging if buffer is full
buff[idx] <= count | (data_in << 31);
idx <= idx + 1;
end
count <= 1;
end else begin
count <= count + 1;
end
end
end
//other AXI stuff
endmodule
In the AXI module, the 512*32bit logged data is mapped to addresses from 0x43c20000 to 0x43c20800.
In the Verilog code, the logger adds a new entry only when the boolean input flips. In simulation, the module works as expected. But in the FPGA, sometimes the logged data is not valid. There are successive 2 data and their MSB bit is the same, which means the entry is added even when the boolean input stays the same.
The invalid data appear from time to time. I've tried reading from the address programmatically (*(u32*)(0x43c20000+4*idx)), and there are still invalid data. I watch idx in a ILA module and idx is 512, which means the logging finishes when I read the data.
The FPGA clock is 10 MHz. The input signal is 10 Hz. So the typical period is 10e6/10/2=0x7A120, which most of the data is close to, except the invalid data.
I think if the Verilog code is implemented well, there should be no such invalid data. What may be the problem? Is this an issue about timing?
The code

First off, are you absolutely sure you are not issuing an accidental write on the AXI bus, resetting the registers?
If so, have you tried inserting a so-called double-flop on data_in (two flip-flops, delaying the signal two clock ticks)? I suppose that your data_in is not synchronous to the FPGA clock, which will lead to metastability and you having bad days if not accounted for. Have a look here for information by NANDLAND.
Citing the linked source:
If you have ever tried to sample some input to your FPGA, such as a button press, or if you have had to cross clock domains, you have had to deal with Metastability. A metastable state is one in which the output of a Flip-Flop inside of your FPGA is unknown, or non-deterministic. When a metastable condition occurs, there is no way to tell if the output of your Flip-Flop is going to be a 1 or a 0. A metastable condition occurs when setup or hold times are violated.
Metastability is bad. It can cause your FPGA to exhibit very strange behavior.
In that source there is also a link to a white paper from Altera about the topic of metastability, linked here for reference.
Citing from that paper:
When a metastable signal does not resolve in the allotted time, a logic failure can result if the destination logic observes inconsistent logic states, that is, different destination registers capture different values for the metastable signal.
and
To minimize the failures due to metastability in asynchronous signal transfers, circuit designers typically use a sequence of registers (a synchronization register chain or synchronizer) in the destination clock domain to resynchronize the signal to the new clock domain. These registers allow additional time for a potentially metastable signal to resolve to a known value before the signal is used in the rest of the design.
Basically having the asynchronous signal routed to two flip-flops might for example lead to one FF reading a 1 and one FF reading a 0. This in turn could lead to the data point being saved, but the counter not being reset to 0 (hence doubling the measured time) and the bit being saved as 0.
Finally, it seems to me, that you are using the Vivado-generated example AXI core. Dan Gisselquist simply calls it "broken". This might not be the problem here, but you might want to have a look at his posts and his AXI core design.

How to set a signal high X-time before rising edge of clock cycle?

I have a signal that checks if the data is available in memory block and does some computation/logic (Which is irrelevant).
I want a signal called "START_SIG" to go high X-time (nanoseconds) before the first rising edge of the clock cycle that is at 10 MHz Frequency. This only goes high if it detects there is data available and does further computation as needed.
Now, how can this be done? Also, I cannot set a delay since this must be RTL Verilog. Therefore, it must be synthensizable on an FPGA (Artix7 Series).
Any suggestions?

I suspect an XY problem, if start sig is produced by logic in the same clock domain as your processing then timing will likely be met without any work on your part (10MHz is dead slow in FPGA terms), but if you really needed to do something like this there are a few ways (But seriously you are doing it wrong!).
FPGA logic is usually synchronous to one or more clocks,generally needing vernier control within a clock period is a sign of doing it wrong.
Use a {PLL/MCM/Whatever} to generate two clocks, one dead slow at 10Mhz, and something much faster, then count the fast one from the previous edge of the 10MHz clock to get your timing.
Use an MCMPLL or such (platform dependent) to generate two 10Mhz clocks with a small phase shift, then gate one of em.
Use a long line of inverter pairs (attribute KEEP (VHDL But verilog will have something similar) will be your friend), calibrate against your known clock periodically (it will drift with temperature, day of the week and sign of the zodiac), this is neat for things like time to digital converters, possibly combined with option two for fine trimming. Shades of ring oscs about this one, but whatever works.

Fast array inner access in verilog

I have some lines of code below:
wire [WIDTH_PIXEL-1:0] x_vector [0:36];
wire [6-1:0] x_sample [0:511]; // 0 <= x_sample <= 36
reg [WIDTH_PIXEL-1:0] rx_512 [0:511];
genvar p;
generate
for(p=0;p<=511;p=p+1) begin: PPP
always#(posedge clk) begin
if(x_sample[p] == counter2) begin
rx_512[p] <= x_vector[x_sample[p]];
end
end
I want to save 512 x_vector elements whose address is the value of x_sample[p]. The problem is when I synthesize on Quartus, the total LC-combinationals over 50000. I know the problem lies on the line
rx_512[p] <= x_vector[x_sample[p]];
So is there any way for improving the access memory? Thank you.

Keep in mind that Verilog is meant as a hardware emulation language.
This makes that you have to learn to write two different types of code:
Code that gets converted to hardware
Test bench code
For the former there are a lot more restrictions. As you correctly noticed you get 512 comparators each comparing 6 bits plus each conditionally selecting one of 37 PIXELWIDTH values and assigning it to one of 512 PIXELWIDTH destinations. My guess is easily a million gates.
You have to use a divide an conquer approach. As Qiu says make the code sequential: One operation per clock cycle. It will take more clock cycles but a lot less logic. Unfortunately you might find out that you do not have enough time to e.g. process a whole image in that (frame?) time. Then choose to do two or four operations per cycle.
You have to continuously weigh speed versus number of gates & power. Maybe you find out that you can't do the operations at all with the chosen hardware. (Nobody said writing Verilog was easy!)
I don't know if it helps but you can make the compiler/optimizer's life a bit easier if you use:
rx_512[p] <= x_vector[counter2];

Verilog: Common bus implementation issue

I've been coding a 16-bit RISC microprocessor in Verilog, and I've hit yet another hurdle. After the code writing task was over, I tried to synthesize it. Found a couple of accidental mistakes and I fixed them. Then boom, massive error.
The design comprises of four 16-bit common buses. For some reason, I'm getting a multiple driver error for these buses from the synthesis tool.
The architecture of the computer is inspired by and is almost exactly the same as the Magic-1 by Bill Buzzbee, excluding the Page Table mechanism. Here's Bill's schematics PDF: Click Here. Scroll down to page 7 for the architecture.
The control matrix is responsible for handling when the buses and driven, and I am absolutely sure that there is only one driver for each bus at any given instance. I was wondering whether this could be the problem, since the synthesis tool probably doesn't know this.
Tri-state statements enable writing to a bus, for example:
assign io [width-1:0] = (re)?rd_out [width-1:0]:0; // Assign IO Port the value of memory at address add if re is true.
EDIT: I forgot to mention, the io port is bidirectional (inout) and is simply connected to the bus. This piece of code is from the RAM, single port. All other registers other than the RAM have separate input and output ports.
The control matrix updates a 30-bit state every negative edge, for example:
state [29:0] <= 30'b100000000010000000000000100000; // Initiate RAM Read, Read ALU, Write PC, Update Instruction Register (ins_reg).
The control matrix is rather small, since I only coded one instruction to test out the design before I spent time on coding the rest.
Unfortunately, it's illogical to copy-paste the entire code over here.
I've been pondering over this for quite a few days now, and pointing me over to the right direction would be much appreciated.

When re is low, the assign statement should be floating (driving Zs).
// enable ? driving : floating
assign io [width-1:0] = (re) ? rd_out [width-1:0] : {width{1'bz}};
If it is driving any other value then the synthesizer will treat is as a mux and not a tri-state. This is where the conflicting driver message come from.

Asynchronous FIFO Design

I found the following piece of code in the internet , while searching for good FIFO design. From the linkSVN Code FIFO -Author Clifford E. Cummings . I did some research , I was not able to figure out why there are three pointers in the design ?I can read the code but what am I missing ?
module sync_r2w #(parameter ADDRSIZE = 4)
(output reg [ADDRSIZE:0] wq2_rptr,
input [ADDRSIZE:0] rptr,
input wclk, wrst_n);
reg [ADDRSIZE:0] wq1_rptr;
always #(posedge wclk or negedge wrst_n)
if (!wrst_n) {wq2_rptr,wq1_rptr} <= 0;
else {wq2_rptr,wq1_rptr} <= {wq1_rptr,rptr};
endmodule
module sync_w2r #(parameter ADDRSIZE = 4)
(output reg [ADDRSIZE:0] rq2_wptr,
input [ADDRSIZE:0] wptr,
input rclk, rrst_n);
reg [ADDRSIZE:0] rq1_wptr;
always #(posedge rclk or negedge rrst_n)
if (!rrst_n) {rq2_wptr,rq1_wptr} <= 0;
else {rq2_wptr,rq1_wptr} <= {rq1_wptr,wptr};
endmodule

What you are looking at here is what's called a dual rank synchronizer. As you mentioned this is an asynchronous FIFO. This means that the read and write sides of the FIFO are not on the same clock domain.
As you know flip-flops need to have setup and hold timing requirements met in order to function properly. When you drive a signal from one clock domain to the other there is no way to guarantee this requirements in the general case.
When you violate these requirements FFs go into what is called a 'meta-stable' state where there are indeterminate for a small time and then (more or less) randomly go to 1 or 0. They do this though (and this is important) in much less than one clock cycle.
That's why the two layers of flops here. The first has a chance of going meta-stable but should resolve in time to be captured cleanly by the 2nd set of flops.
This on it's own is not enough to pass a multi-bit value (the address pointer) across clock domains. If more than one bit is changing at the same time then you can't be sure that the transition will be clean on the other side. So what you'll see often in these situations is that the FIFO pointers will by gray coded. This means that each increment of the counter changes at most one bit at a time.
e.g. Rather than 00 -> 01 -> 10 -> 11 -> 00 ... it will be 00 -> 01 -> 11 -> 10 -> 00 ...
Clock domain crossing is a deep and subtle subject. Even experienced designers very often mess them up without careful thought.
BTW ordinary Verilog simulations will not show anything about what I just described in a zero-delay sim. You need to do back annotated SDF simulations with real timing models.

In this example, the address is passed through the shift register in order for it to be delayed by one clock cycle. There could have been more “pointers” in order to delay the output even more.
Generally, it is easier to understand what is going on and why if you simulate the design and look at the waveform.
Also, here are some good FIFO implementations you can look at:
Xilinx FIFO Generator IP Core
Altera's single/double-clock FIFOs
OpenCores Generic FIFOs
Hope it helps. Good Luck!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string