I have written some code in verilog for a median filter using a cumulative histogram method. When I try to synthesize my code in xilinx it's processing up to 1 hour and finally shows an error, "program ran out of memory".
My code is:
//***** MEDIAN FILTER BY USING CUMULATIVE HISTOGRAM METHOD******//
module medianfilter(median_out,clk,a1,a2,a3,a4,a5,a6,a7,a8,a9);
output median_out;
input [7:0]a1,a2,a3,a4,a5,a6,a7,a8,a9;
integer i,j;
reg[7:0]b[255:0];
reg [7:0]buff[0:8];
input clk;
reg [7:0]median_out;
always#(negedge clk)
begin
//**************************************************************************//
for(i=0;i<256;i=i+1) // initilize the memory bins with zeros
b[i]=0;
//*************************************************************************//
buff[0]=a1;
buff[1]=a2;
buff[2]=a3;
buff[3]=a4;
buff[4]=a5;
buff[5]=a6;
buff[6]=a7;
buff[7]=a8;
buff[8]=a9;
for(i=0;i<9;i=i+1) // this loop is for cumulative histogram method
begin
b[buff[i]]=b[buff[i]]+1; // incrementing the value in b[i]th memory address
for(j=0;j<256;j=j+1)
if(j>buff[i])
b[j]=b[j]+1; // incrementing the bins below b[i]th bin
end
//**************************************************************************//
for(i=0;i<256;i=i+1) // loop for finding the median
begin
if(b[i]>4) ///////// condition for checking median
begin
b[i]=1;
median_out=i;
i=256; // loop breaks here
end
end
//*************************************************************************//
end
endmodule
How can I make the code synthesizable?
How many adders are generated by your code? I see at least 2,100 8-bit adders which are working in the same cycle.
You should rethink your algorithm: A median filter needs an ordered list of pixel values, so at first you should think about efficient ordering of numbers on FPGAs.
A good approach are sorting networks like:
Odd-even-merge sort or
Bitonic sort.
Sorting 9 numbers can't be done in one cycle so you need pipelining. (You can do it, but at very low clock speed.)
Our PoC-Library contains pipelined sorting networks,but I have never test these networks with a non-power of two input size!
I agree with everything that #Paebbels had to say here. However, there are some additional considerations. How fast is the data coming in. Do you get a new set of 10 values to sort every clock cycle? If not, you could pipeline the operation and use many fewer adders and fewer register stages, even to the point of using a single adder and storing the results in a block RAM (although this would be much slower). Also, you haven't mentioned which FPGA you are using (although I'm guessing this is a small one). Any FPGA design needs to take into account the available resources on the target device. You could also directly instantiate DSP48 multiplier-accumulators for adders if they are not used elsewhere in your design, once again depending on how many you need and how many are available on the device.
Related
I'm having trouble understanding this error I'm getting in Yosys.
I copied the relevant (I think) code below.
reg signed [15:0] wb1 [0:131071];
reg signed [27:0] currentAttrWB [0:4094];
always #(posedge clk)
currentAttrWB <= wb1[attrWBoffset +: 4094];
ERROR (on last line): "currentAttrWB does map to an unexpanded memory!"
What I'm trying to do is select a sub-range of 4095 16bit words in a long array of 131072 16bit words, using indexed part select (+:). This range would be offset within the long array using attrWBoffset. But obviously there is something I am not understanding. At first I thought this was because currentAttrWB was not large enough to contain 4095 16bit words. But I still get the error when bumping its register up to 28bits.
I guess I need to understand what is meant by expanded and unexpanded.
Thanks for your help.
Yosys does not support memories, i.e. anything defined as reg [x:0] mem[0:y]; being on the left hand side of an assignment. I am not sure if this is a Yosys limitation or a Verilog one, but such a pattern makes little sense for an FPGA application where memories are accessed one element at a time.
If Yosys could implement such a pattern, it would have to map the memory to LUTs and flipflops rather than use dedicated RAM resources, as such a simultaneous transfer isn't possible with block RAM. There are very few FPGAs with over 2 million flipflops available, and if you had one you probably wouldn't want to fill it up with something like this.
What you probably want is a counter that counts from 0 to 4094 and copies one entry every clock cycle until it completes.
I've been having trouble the last little while with a project that uses look up table arrays quite a bit and getting yosys to infer them as block ram. Yosys keeps thinking one or the other of my arrays should be implemented using logic cells.
example:
reg signed [11:0] dynamicBuffer [0:2047];
gets inferred as IceStorm LC and thus quickly overflows my logic cell budget.
Info: Device utilisation:
Info: ICESTORM_LC: 83524/ 7680 1087%
Info: ICESTORM_RAM: 18/ 32 56%
Info: SB_IO: 36/ 256 14%
Info: SB_GB: 8/ 8 100%
Info: ICESTORM_PLL: 2/ 2 100%
Info: SB_WARMBOOT: 0/ 1 0%
I've read that an array needs to have a registered output or Yosys does not see it as ram (is this true?) I've tried to rework things such that my arrays are ultimately routed to a register at each clock count. But I still cannot get it working. What is the right way of working with multiple arrays, copying them one to the other, and getting yosys to infer them as block ram? What do I need to avoid?
I've read that an array needs to have a registered output or Yosys does not see it as ram (is this true?)
Yes, its true. Only way I can make Yosys infer to BRAM is to make it an asynchronous BRAM with Input and Output. I took this from my VFD display controller project
module GRAM(
input R_CLK,
input W_CLK,
input [7:0]GRAM_IN,
output reg [7:0]GRAM_OUT,
input [11:0] GRAM_ADDR_R,
input [11:0] GRAM_ADDR_W,
input G_CE_W,
input G_CE_R);
reg [7:0] mem [3002:0];// change this to what you want.
initial mem[0] <= 255;// fill the first byte to let Yosys infer to BRAM.
always#(posedge R_CLK) begin// reading from RAM sync with reading clock.
if(G_CE_R)
GRAM_OUT <= mem[GRAM_ADDR_R];
end
always#(posedge W_CLK) begin// writing to RAM sync with writing clock.
if(G_CE_W)
mem[GRAM_ADDR_W] <= GRAM_IN;
end
endmodule// GRAM
Assuming you are using a for loop like in your last example, for loops in hardware do not run sequentially like in software but do the entire copy in one clock. You need to use a counter and a state machine for the copy, not a Verilog for loop.
The solution in this case was to implement an asynchronous fifo.
I was crossing clock domains when I connected the two modules so needed to synchronize reading and writing from the array. The lack of coordination between reads and writes to the array was causing yosys to infer that the array was not to be implemented as block ram.
I have some lines of code below:
wire [WIDTH_PIXEL-1:0] x_vector [0:36];
wire [6-1:0] x_sample [0:511]; // 0 <= x_sample <= 36
reg [WIDTH_PIXEL-1:0] rx_512 [0:511];
genvar p;
generate
for(p=0;p<=511;p=p+1) begin: PPP
always#(posedge clk) begin
if(x_sample[p] == counter2) begin
rx_512[p] <= x_vector[x_sample[p]];
end
end
I want to save 512 x_vector elements whose address is the value of x_sample[p]. The problem is when I synthesize on Quartus, the total LC-combinationals over 50000. I know the problem lies on the line
rx_512[p] <= x_vector[x_sample[p]];
So is there any way for improving the access memory? Thank you.
Keep in mind that Verilog is meant as a hardware emulation language.
This makes that you have to learn to write two different types of code:
Code that gets converted to hardware
Test bench code
For the former there are a lot more restrictions. As you correctly noticed you get 512 comparators each comparing 6 bits plus each conditionally selecting one of 37 PIXELWIDTH values and assigning it to one of 512 PIXELWIDTH destinations. My guess is easily a million gates.
You have to use a divide an conquer approach. As Qiu says make the code sequential: One operation per clock cycle. It will take more clock cycles but a lot less logic. Unfortunately you might find out that you do not have enough time to e.g. process a whole image in that (frame?) time. Then choose to do two or four operations per cycle.
You have to continuously weigh speed versus number of gates & power. Maybe you find out that you can't do the operations at all with the chosen hardware. (Nobody said writing Verilog was easy!)
I don't know if it helps but you can make the compiler/optimizer's life a bit easier if you use:
rx_512[p] <= x_vector[counter2];
Thank you for reading this and for all of your help. Anyway...I am trying to implement a crc16 with polynomial x^16 + x^12 + x^5 + 1 in verilog. The problem I have encountered is that I don't get the entire packet of data at one point in time. I get a 32 bit word at a time and the number of words is dynamic but is at least 4 words and can be as high as 16384 words or higher. The time is not much of an issue because I am running on a 150 MHz clk and the input is coming in at most a 33 MHz clk but may be a 10 MHz. This does not really affect me because I am first accepting the data via a FIFO.
I have been trying to develop an FSM but have really hit a roadblock. One idea is for me to wait for all the data and then just input the entire thing as one big data packet; however, this seems really inefficient and I just don't think I need to do this. Plus it could take up valuable resources. Another way I was playing with was to input the first word and do the XOR operation. Then when the input data only has 1 to 2 bits left that are not xored (not sure if that is worded correctly) I would input the next word. Upon the input I would continue to compute the CRC followed by another input until the last word is imputed into the module.
With this method I would need to implement a counter or a shift register in some fashion. Anyway, any help would be nice. This goes into a command parser/packet parser. Thank you so much for your help.
A CRC calculation doesn't need to be done serially 1-bit at a time. You can essentially "unroll" the calculation to come up with the individual equations for each bit of a parallel CRC generator. With that, you can create a CRC generator that processes 32-bits of input data at a time, matching your datapath width. This should simplify your design as well as make it higher performance (processing each bit serially wouldn't meet your throughput requirements anyway, unless you don't mind holding off incoming data while the hw generates the CRC).
I am trying to implement the FatICA algorithm in verilog. I have written the whole code and till simulation it shows no error but when I try to synthesize the code it gives an error stating " ";" expecting instead of".""
I am using four floating point modules for arithmetic calculation and I have generated 1000 instances of sum, sqrt ... etc using for loop for the in between calculations.Following is the code for generate
genvar s;
generate
for(s=1;s<=4000;s=(s+1))
begin:cov_mul_ins
Float32Mul cov_mul (.CLK(clk),
.nRST(1'b1),
.leftArg(dummy_14),
.rightArg(dummy_15),
.loadArgs(1'b1)
);
end
endgenerate
Now I am accessing the individual instances using the Dot operator
for(d=1;d<=2;d=(d+1))
begin
for(e=1;e<=2;e=(e+1))
begin
for(c=1;c<=1000;c=(c+1))
begin
if((d==1)&&(e==1))
begin
dummy_14=centered_data_copy[d][c];
dummy_15=Parent.centered_data_float_trans[c][e];
#10 ***cov_mul_ins[c].cov_mul***(.CLK(clk),
.nRST(1'b1),
.leftArg(dummy_14),
.rightArg(dummy_15),
.loadArgs(1'b1),
.product(cov_temp[c][1])
);
I would be grateful if someone could pin point the error I am making.Thanks!
Couple of things to note:
Out of module references can't be synthesised. This means that you can't "peek inside" instantiated modules to look at nets or call functions if you want that code to be synthesisable. It's grand for testbenches though.
Your attempted function call has a delay on it, which will be ignored, ie #10 cov_mul_ins[c].cov_mul ( ... );
I can see your thinking in a softwarey lets-put-everything-in-a-class-and-call-methods way. This is perfect for testbenches, but synthesis will complain, as you've seen. When it comes to hardware, well, you need to think of the hardware - ask youself which blocks you need to build to run your algorithm. For example, if your algorithm needs 30 multiplies on each input sample, then you need either 30 instances of a multiplier, or one multiplier and sequence your 30 operations through it. Or 15 multipliers, each doing 2 multiplications per sample period, or 10 multipliers doing 3 etc...
Try to delete "#10" because I think it is not synthesable.