Fast array inner access in verilog - verilog

I have some lines of code below:
wire [WIDTH_PIXEL-1:0] x_vector [0:36];
wire [6-1:0] x_sample [0:511]; // 0 <= x_sample <= 36
reg [WIDTH_PIXEL-1:0] rx_512 [0:511];
genvar p;
generate
for(p=0;p<=511;p=p+1) begin: PPP
always#(posedge clk) begin
if(x_sample[p] == counter2) begin
rx_512[p] <= x_vector[x_sample[p]];
end
end
I want to save 512 x_vector elements whose address is the value of x_sample[p]. The problem is when I synthesize on Quartus, the total LC-combinationals over 50000. I know the problem lies on the line
rx_512[p] <= x_vector[x_sample[p]];
So is there any way for improving the access memory? Thank you.

Keep in mind that Verilog is meant as a hardware emulation language.
This makes that you have to learn to write two different types of code:
Code that gets converted to hardware
Test bench code
For the former there are a lot more restrictions. As you correctly noticed you get 512 comparators each comparing 6 bits plus each conditionally selecting one of 37 PIXELWIDTH values and assigning it to one of 512 PIXELWIDTH destinations. My guess is easily a million gates.
You have to use a divide an conquer approach. As Qiu says make the code sequential: One operation per clock cycle. It will take more clock cycles but a lot less logic. Unfortunately you might find out that you do not have enough time to e.g. process a whole image in that (frame?) time. Then choose to do two or four operations per cycle.
You have to continuously weigh speed versus number of gates & power. Maybe you find out that you can't do the operations at all with the chosen hardware. (Nobody said writing Verilog was easy!)
I don't know if it helps but you can make the compiler/optimizer's life a bit easier if you use:
rx_512[p] <= x_vector[counter2];

Related

Error "does map to unexpanded memory" in yosys verilog when using indexed part select

I'm having trouble understanding this error I'm getting in Yosys.
I copied the relevant (I think) code below.
reg signed [15:0] wb1 [0:131071];
reg signed [27:0] currentAttrWB [0:4094];
always #(posedge clk)
currentAttrWB <= wb1[attrWBoffset +: 4094];
ERROR (on last line): "currentAttrWB does map to an unexpanded memory!"
What I'm trying to do is select a sub-range of 4095 16bit words in a long array of 131072 16bit words, using indexed part select (+:). This range would be offset within the long array using attrWBoffset. But obviously there is something I am not understanding. At first I thought this was because currentAttrWB was not large enough to contain 4095 16bit words. But I still get the error when bumping its register up to 28bits.
I guess I need to understand what is meant by expanded and unexpanded.
Thanks for your help.
Yosys does not support memories, i.e. anything defined as reg [x:0] mem[0:y]; being on the left hand side of an assignment. I am not sure if this is a Yosys limitation or a Verilog one, but such a pattern makes little sense for an FPGA application where memories are accessed one element at a time.
If Yosys could implement such a pattern, it would have to map the memory to LUTs and flipflops rather than use dedicated RAM resources, as such a simultaneous transfer isn't possible with block RAM. There are very few FPGAs with over 2 million flipflops available, and if you had one you probably wouldn't want to fill it up with something like this.
What you probably want is a counter that counts from 0 to 4094 and copies one entry every clock cycle until it completes.

How can I improve my code to reduce the synthesis time?

I have written some code in verilog for a median filter using a cumulative histogram method. When I try to synthesize my code in xilinx it's processing up to 1 hour and finally shows an error, "program ran out of memory".
My code is:
//***** MEDIAN FILTER BY USING CUMULATIVE HISTOGRAM METHOD******//
module medianfilter(median_out,clk,a1,a2,a3,a4,a5,a6,a7,a8,a9);
output median_out;
input [7:0]a1,a2,a3,a4,a5,a6,a7,a8,a9;
integer i,j;
reg[7:0]b[255:0];
reg [7:0]buff[0:8];
input clk;
reg [7:0]median_out;
always#(negedge clk)
begin
//**************************************************************************//
for(i=0;i<256;i=i+1) // initilize the memory bins with zeros
b[i]=0;
//*************************************************************************//
buff[0]=a1;
buff[1]=a2;
buff[2]=a3;
buff[3]=a4;
buff[4]=a5;
buff[5]=a6;
buff[6]=a7;
buff[7]=a8;
buff[8]=a9;
for(i=0;i<9;i=i+1) // this loop is for cumulative histogram method
begin
b[buff[i]]=b[buff[i]]+1; // incrementing the value in b[i]th memory address
for(j=0;j<256;j=j+1)
if(j>buff[i])
b[j]=b[j]+1; // incrementing the bins below b[i]th bin
end
//**************************************************************************//
for(i=0;i<256;i=i+1) // loop for finding the median
begin
if(b[i]>4) ///////// condition for checking median
begin
b[i]=1;
median_out=i;
i=256; // loop breaks here
end
end
//*************************************************************************//
end
endmodule
How can I make the code synthesizable?
How many adders are generated by your code? I see at least 2,100 8-bit adders which are working in the same cycle.
You should rethink your algorithm: A median filter needs an ordered list of pixel values, so at first you should think about efficient ordering of numbers on FPGAs.
A good approach are sorting networks like:
Odd-even-merge sort or
Bitonic sort.
Sorting 9 numbers can't be done in one cycle so you need pipelining. (You can do it, but at very low clock speed.)
Our PoC-Library contains pipelined sorting networks,but I have never test these networks with a non-power of two input size!
I agree with everything that #Paebbels had to say here. However, there are some additional considerations. How fast is the data coming in. Do you get a new set of 10 values to sort every clock cycle? If not, you could pipeline the operation and use many fewer adders and fewer register stages, even to the point of using a single adder and storing the results in a block RAM (although this would be much slower). Also, you haven't mentioned which FPGA you are using (although I'm guessing this is a small one). Any FPGA design needs to take into account the available resources on the target device. You could also directly instantiate DSP48 multiplier-accumulators for adders if they are not used elsewhere in your design, once again depending on how many you need and how many are available on the device.

Implementing CRC16 in verilog with dynamic data packet length

Thank you for reading this and for all of your help. Anyway...I am trying to implement a crc16 with polynomial x^16 + x^12 + x^5 + 1 in verilog. The problem I have encountered is that I don't get the entire packet of data at one point in time. I get a 32 bit word at a time and the number of words is dynamic but is at least 4 words and can be as high as 16384 words or higher. The time is not much of an issue because I am running on a 150 MHz clk and the input is coming in at most a 33 MHz clk but may be a 10 MHz. This does not really affect me because I am first accepting the data via a FIFO.
I have been trying to develop an FSM but have really hit a roadblock. One idea is for me to wait for all the data and then just input the entire thing as one big data packet; however, this seems really inefficient and I just don't think I need to do this. Plus it could take up valuable resources. Another way I was playing with was to input the first word and do the XOR operation. Then when the input data only has 1 to 2 bits left that are not xored (not sure if that is worded correctly) I would input the next word. Upon the input I would continue to compute the CRC followed by another input until the last word is imputed into the module.
With this method I would need to implement a counter or a shift register in some fashion. Anyway, any help would be nice. This goes into a command parser/packet parser. Thank you so much for your help.
A CRC calculation doesn't need to be done serially 1-bit at a time. You can essentially "unroll" the calculation to come up with the individual equations for each bit of a parallel CRC generator. With that, you can create a CRC generator that processes 32-bits of input data at a time, matching your datapath width. This should simplify your design as well as make it higher performance (processing each bit serially wouldn't meet your throughput requirements anyway, unless you don't mind holding off incoming data while the hw generates the CRC).

Asynchronous FIFO Design

I found the following piece of code in the internet , while searching for good FIFO design. From the linkSVN Code FIFO -Author Clifford E. Cummings . I did some research , I was not able to figure out why there are three pointers in the design ?I can read the code but what am I missing ?
module sync_r2w #(parameter ADDRSIZE = 4)
(output reg [ADDRSIZE:0] wq2_rptr,
input [ADDRSIZE:0] rptr,
input wclk, wrst_n);
reg [ADDRSIZE:0] wq1_rptr;
always #(posedge wclk or negedge wrst_n)
if (!wrst_n) {wq2_rptr,wq1_rptr} <= 0;
else {wq2_rptr,wq1_rptr} <= {wq1_rptr,rptr};
endmodule
module sync_w2r #(parameter ADDRSIZE = 4)
(output reg [ADDRSIZE:0] rq2_wptr,
input [ADDRSIZE:0] wptr,
input rclk, rrst_n);
reg [ADDRSIZE:0] rq1_wptr;
always #(posedge rclk or negedge rrst_n)
if (!rrst_n) {rq2_wptr,rq1_wptr} <= 0;
else {rq2_wptr,rq1_wptr} <= {rq1_wptr,wptr};
endmodule
What you are looking at here is what's called a dual rank synchronizer. As you mentioned this is an asynchronous FIFO. This means that the read and write sides of the FIFO are not on the same clock domain.
As you know flip-flops need to have setup and hold timing requirements met in order to function properly. When you drive a signal from one clock domain to the other there is no way to guarantee this requirements in the general case.
When you violate these requirements FFs go into what is called a 'meta-stable' state where there are indeterminate for a small time and then (more or less) randomly go to 1 or 0. They do this though (and this is important) in much less than one clock cycle.
That's why the two layers of flops here. The first has a chance of going meta-stable but should resolve in time to be captured cleanly by the 2nd set of flops.
This on it's own is not enough to pass a multi-bit value (the address pointer) across clock domains. If more than one bit is changing at the same time then you can't be sure that the transition will be clean on the other side. So what you'll see often in these situations is that the FIFO pointers will by gray coded. This means that each increment of the counter changes at most one bit at a time.
e.g. Rather than 00 -> 01 -> 10 -> 11 -> 00 ... it will be 00 -> 01 -> 11 -> 10 -> 00 ...
Clock domain crossing is a deep and subtle subject. Even experienced designers very often mess them up without careful thought.
BTW ordinary Verilog simulations will not show anything about what I just described in a zero-delay sim. You need to do back annotated SDF simulations with real timing models.
In this example, the address is passed through the shift register in order for it to be delayed by one clock cycle. There could have been more “pointers” in order to delay the output even more.
Generally, it is easier to understand what is going on and why if you simulate the design and look at the waveform.
Also, here are some good FIFO implementations you can look at:
Xilinx FIFO Generator IP Core
Altera's single/double-clock FIFOs
OpenCores Generic FIFOs
Hope it helps. Good Luck!

Synthesis error in Verilog

I am trying to implement the FatICA algorithm in verilog. I have written the whole code and till simulation it shows no error but when I try to synthesize the code it gives an error stating " ";" expecting instead of".""
I am using four floating point modules for arithmetic calculation and I have generated 1000 instances of sum, sqrt ... etc using for loop for the in between calculations.Following is the code for generate
genvar s;
generate
for(s=1;s<=4000;s=(s+1))
begin:cov_mul_ins
Float32Mul cov_mul (.CLK(clk),
.nRST(1'b1),
.leftArg(dummy_14),
.rightArg(dummy_15),
.loadArgs(1'b1)
);
end
endgenerate
Now I am accessing the individual instances using the Dot operator
for(d=1;d<=2;d=(d+1))
begin
for(e=1;e<=2;e=(e+1))
begin
for(c=1;c<=1000;c=(c+1))
begin
if((d==1)&&(e==1))
begin
dummy_14=centered_data_copy[d][c];
dummy_15=Parent.centered_data_float_trans[c][e];
#10 ***cov_mul_ins[c].cov_mul***(.CLK(clk),
.nRST(1'b1),
.leftArg(dummy_14),
.rightArg(dummy_15),
.loadArgs(1'b1),
.product(cov_temp[c][1])
);
I would be grateful if someone could pin point the error I am making.Thanks!
Couple of things to note:
Out of module references can't be synthesised. This means that you can't "peek inside" instantiated modules to look at nets or call functions if you want that code to be synthesisable. It's grand for testbenches though.
Your attempted function call has a delay on it, which will be ignored, ie #10 cov_mul_ins[c].cov_mul ( ... );
I can see your thinking in a softwarey lets-put-everything-in-a-class-and-call-methods way. This is perfect for testbenches, but synthesis will complain, as you've seen. When it comes to hardware, well, you need to think of the hardware - ask youself which blocks you need to build to run your algorithm. For example, if your algorithm needs 30 multiplies on each input sample, then you need either 30 instances of a multiplier, or one multiplier and sequence your 30 operations through it. Or 15 multipliers, each doing 2 multiplications per sample period, or 10 multipliers doing 3 etc...
Try to delete "#10" because I think it is not synthesable.

Resources