I am trying to implement the FatICA algorithm in verilog. I have written the whole code and till simulation it shows no error but when I try to synthesize the code it gives an error stating " ";" expecting instead of".""
I am using four floating point modules for arithmetic calculation and I have generated 1000 instances of sum, sqrt ... etc using for loop for the in between calculations.Following is the code for generate
genvar s;
generate
for(s=1;s<=4000;s=(s+1))
begin:cov_mul_ins
Float32Mul cov_mul (.CLK(clk),
.nRST(1'b1),
.leftArg(dummy_14),
.rightArg(dummy_15),
.loadArgs(1'b1)
);
end
endgenerate
Now I am accessing the individual instances using the Dot operator
for(d=1;d<=2;d=(d+1))
begin
for(e=1;e<=2;e=(e+1))
begin
for(c=1;c<=1000;c=(c+1))
begin
if((d==1)&&(e==1))
begin
dummy_14=centered_data_copy[d][c];
dummy_15=Parent.centered_data_float_trans[c][e];
#10 ***cov_mul_ins[c].cov_mul***(.CLK(clk),
.nRST(1'b1),
.leftArg(dummy_14),
.rightArg(dummy_15),
.loadArgs(1'b1),
.product(cov_temp[c][1])
);
I would be grateful if someone could pin point the error I am making.Thanks!
Couple of things to note:
Out of module references can't be synthesised. This means that you can't "peek inside" instantiated modules to look at nets or call functions if you want that code to be synthesisable. It's grand for testbenches though.
Your attempted function call has a delay on it, which will be ignored, ie #10 cov_mul_ins[c].cov_mul ( ... );
I can see your thinking in a softwarey lets-put-everything-in-a-class-and-call-methods way. This is perfect for testbenches, but synthesis will complain, as you've seen. When it comes to hardware, well, you need to think of the hardware - ask youself which blocks you need to build to run your algorithm. For example, if your algorithm needs 30 multiplies on each input sample, then you need either 30 instances of a multiplier, or one multiplier and sequence your 30 operations through it. Or 15 multipliers, each doing 2 multiplications per sample period, or 10 multipliers doing 3 etc...
Try to delete "#10" because I think it is not synthesable.
Related
I'm having trouble understanding this error I'm getting in Yosys.
I copied the relevant (I think) code below.
reg signed [15:0] wb1 [0:131071];
reg signed [27:0] currentAttrWB [0:4094];
always #(posedge clk)
currentAttrWB <= wb1[attrWBoffset +: 4094];
ERROR (on last line): "currentAttrWB does map to an unexpanded memory!"
What I'm trying to do is select a sub-range of 4095 16bit words in a long array of 131072 16bit words, using indexed part select (+:). This range would be offset within the long array using attrWBoffset. But obviously there is something I am not understanding. At first I thought this was because currentAttrWB was not large enough to contain 4095 16bit words. But I still get the error when bumping its register up to 28bits.
I guess I need to understand what is meant by expanded and unexpanded.
Thanks for your help.
Yosys does not support memories, i.e. anything defined as reg [x:0] mem[0:y]; being on the left hand side of an assignment. I am not sure if this is a Yosys limitation or a Verilog one, but such a pattern makes little sense for an FPGA application where memories are accessed one element at a time.
If Yosys could implement such a pattern, it would have to map the memory to LUTs and flipflops rather than use dedicated RAM resources, as such a simultaneous transfer isn't possible with block RAM. There are very few FPGAs with over 2 million flipflops available, and if you had one you probably wouldn't want to fill it up with something like this.
What you probably want is a counter that counts from 0 to 4094 and copies one entry every clock cycle until it completes.
I have some lines of code below:
wire [WIDTH_PIXEL-1:0] x_vector [0:36];
wire [6-1:0] x_sample [0:511]; // 0 <= x_sample <= 36
reg [WIDTH_PIXEL-1:0] rx_512 [0:511];
genvar p;
generate
for(p=0;p<=511;p=p+1) begin: PPP
always#(posedge clk) begin
if(x_sample[p] == counter2) begin
rx_512[p] <= x_vector[x_sample[p]];
end
end
I want to save 512 x_vector elements whose address is the value of x_sample[p]. The problem is when I synthesize on Quartus, the total LC-combinationals over 50000. I know the problem lies on the line
rx_512[p] <= x_vector[x_sample[p]];
So is there any way for improving the access memory? Thank you.
Keep in mind that Verilog is meant as a hardware emulation language.
This makes that you have to learn to write two different types of code:
Code that gets converted to hardware
Test bench code
For the former there are a lot more restrictions. As you correctly noticed you get 512 comparators each comparing 6 bits plus each conditionally selecting one of 37 PIXELWIDTH values and assigning it to one of 512 PIXELWIDTH destinations. My guess is easily a million gates.
You have to use a divide an conquer approach. As Qiu says make the code sequential: One operation per clock cycle. It will take more clock cycles but a lot less logic. Unfortunately you might find out that you do not have enough time to e.g. process a whole image in that (frame?) time. Then choose to do two or four operations per cycle.
You have to continuously weigh speed versus number of gates & power. Maybe you find out that you can't do the operations at all with the chosen hardware. (Nobody said writing Verilog was easy!)
I don't know if it helps but you can make the compiler/optimizer's life a bit easier if you use:
rx_512[p] <= x_vector[counter2];
I have written some code in verilog for a median filter using a cumulative histogram method. When I try to synthesize my code in xilinx it's processing up to 1 hour and finally shows an error, "program ran out of memory".
My code is:
//***** MEDIAN FILTER BY USING CUMULATIVE HISTOGRAM METHOD******//
module medianfilter(median_out,clk,a1,a2,a3,a4,a5,a6,a7,a8,a9);
output median_out;
input [7:0]a1,a2,a3,a4,a5,a6,a7,a8,a9;
integer i,j;
reg[7:0]b[255:0];
reg [7:0]buff[0:8];
input clk;
reg [7:0]median_out;
always#(negedge clk)
begin
//**************************************************************************//
for(i=0;i<256;i=i+1) // initilize the memory bins with zeros
b[i]=0;
//*************************************************************************//
buff[0]=a1;
buff[1]=a2;
buff[2]=a3;
buff[3]=a4;
buff[4]=a5;
buff[5]=a6;
buff[6]=a7;
buff[7]=a8;
buff[8]=a9;
for(i=0;i<9;i=i+1) // this loop is for cumulative histogram method
begin
b[buff[i]]=b[buff[i]]+1; // incrementing the value in b[i]th memory address
for(j=0;j<256;j=j+1)
if(j>buff[i])
b[j]=b[j]+1; // incrementing the bins below b[i]th bin
end
//**************************************************************************//
for(i=0;i<256;i=i+1) // loop for finding the median
begin
if(b[i]>4) ///////// condition for checking median
begin
b[i]=1;
median_out=i;
i=256; // loop breaks here
end
end
//*************************************************************************//
end
endmodule
How can I make the code synthesizable?
How many adders are generated by your code? I see at least 2,100 8-bit adders which are working in the same cycle.
You should rethink your algorithm: A median filter needs an ordered list of pixel values, so at first you should think about efficient ordering of numbers on FPGAs.
A good approach are sorting networks like:
Odd-even-merge sort or
Bitonic sort.
Sorting 9 numbers can't be done in one cycle so you need pipelining. (You can do it, but at very low clock speed.)
Our PoC-Library contains pipelined sorting networks,but I have never test these networks with a non-power of two input size!
I agree with everything that #Paebbels had to say here. However, there are some additional considerations. How fast is the data coming in. Do you get a new set of 10 values to sort every clock cycle? If not, you could pipeline the operation and use many fewer adders and fewer register stages, even to the point of using a single adder and storing the results in a block RAM (although this would be much slower). Also, you haven't mentioned which FPGA you are using (although I'm guessing this is a small one). Any FPGA design needs to take into account the available resources on the target device. You could also directly instantiate DSP48 multiplier-accumulators for adders if they are not used elsewhere in your design, once again depending on how many you need and how many are available on the device.
I found the following piece of code in the internet , while searching for good FIFO design. From the linkSVN Code FIFO -Author Clifford E. Cummings . I did some research , I was not able to figure out why there are three pointers in the design ?I can read the code but what am I missing ?
module sync_r2w #(parameter ADDRSIZE = 4)
(output reg [ADDRSIZE:0] wq2_rptr,
input [ADDRSIZE:0] rptr,
input wclk, wrst_n);
reg [ADDRSIZE:0] wq1_rptr;
always #(posedge wclk or negedge wrst_n)
if (!wrst_n) {wq2_rptr,wq1_rptr} <= 0;
else {wq2_rptr,wq1_rptr} <= {wq1_rptr,rptr};
endmodule
module sync_w2r #(parameter ADDRSIZE = 4)
(output reg [ADDRSIZE:0] rq2_wptr,
input [ADDRSIZE:0] wptr,
input rclk, rrst_n);
reg [ADDRSIZE:0] rq1_wptr;
always #(posedge rclk or negedge rrst_n)
if (!rrst_n) {rq2_wptr,rq1_wptr} <= 0;
else {rq2_wptr,rq1_wptr} <= {rq1_wptr,wptr};
endmodule
What you are looking at here is what's called a dual rank synchronizer. As you mentioned this is an asynchronous FIFO. This means that the read and write sides of the FIFO are not on the same clock domain.
As you know flip-flops need to have setup and hold timing requirements met in order to function properly. When you drive a signal from one clock domain to the other there is no way to guarantee this requirements in the general case.
When you violate these requirements FFs go into what is called a 'meta-stable' state where there are indeterminate for a small time and then (more or less) randomly go to 1 or 0. They do this though (and this is important) in much less than one clock cycle.
That's why the two layers of flops here. The first has a chance of going meta-stable but should resolve in time to be captured cleanly by the 2nd set of flops.
This on it's own is not enough to pass a multi-bit value (the address pointer) across clock domains. If more than one bit is changing at the same time then you can't be sure that the transition will be clean on the other side. So what you'll see often in these situations is that the FIFO pointers will by gray coded. This means that each increment of the counter changes at most one bit at a time.
e.g. Rather than 00 -> 01 -> 10 -> 11 -> 00 ... it will be 00 -> 01 -> 11 -> 10 -> 00 ...
Clock domain crossing is a deep and subtle subject. Even experienced designers very often mess them up without careful thought.
BTW ordinary Verilog simulations will not show anything about what I just described in a zero-delay sim. You need to do back annotated SDF simulations with real timing models.
In this example, the address is passed through the shift register in order for it to be delayed by one clock cycle. There could have been more “pointers” in order to delay the output even more.
Generally, it is easier to understand what is going on and why if you simulate the design and look at the waveform.
Also, here are some good FIFO implementations you can look at:
Xilinx FIFO Generator IP Core
Altera's single/double-clock FIFOs
OpenCores Generic FIFOs
Hope it helps. Good Luck!
I am writing Verilog code using Lattice Diamond for synthesis.
I have binary data in a text file which I want to use as input for my code.
At simulation level we can use $readmemb function to do it. How is this done at synthesis level?
I want to access data present in text file as an input for FPGA.
As suggested by Mr Martin Thompson(answers below) I have written a Verilog code to read data from a file.
Verilog code is given below:-
module rom(clock,reset,o0);
input clock,reset;
output o0;
reg ROM [0:0];
reg o0;
initial
$readmemb("rom.txt",ROM);
always #(negedge clock,negedge reset )
begin
if(reset==0)
begin
o0<=0;
end
else
begin
o0<=ROM[0];
end
end
endmodule
When I am running this code on fpga I am facing the problem below:-
If text file which I want to read have only one bit which is '1' then I am able to assign input output pins to clock,reset and ROM. But if I have one bit which is '0' or more than one bits data in text file I am unable to assign input pins(i.e clock,reset) and a warning is displayed:-
WARNING: IO buffer missing for top level port clock...logic will be discarded.
WARNING: IO buffer missing for top level port reset...logic will be discarded.
I am unable to understand why I am getting this warning and how I can resolve it.
One way is to build the data into the netlist that you have synthesised. You can initialise a read-only memory (ROM) with the data using $readmemb and then access that as a normal memory from within your device.
Here's an introduction to some memory initialisation methods:
http://myfpgablog.blogspot.co.uk/2011/12/memory-initialization-methods.html
And in here:
http://rijndael.ece.vt.edu/schaum/slides/ddii/lecture16.pdf
is an example of a file-initialised RAM on the second to last slide. If you want just a ROM, leave out the if (we) part.
Think of Simulation as an environment not a level. You should just be switching the DUT (Device Under Test) from the RTL code to the synthesised netlist, other than this nothing should change in your simulation environment.
From the block of code you have given it does not look like you are separating out the test and code for the fpga. You should not be trying to synthesise your test I would recommend splitting it between at least 2 separate modules, your test instantiating the code you want to place on the fpga. Pretty sure the $fwrite is also not synthesizable.
A simple test case might look like:
module testcase
//Variables here
reg reg_to_drive_data;
thing_to_test DUT (
.input_ports ( reg_to_drive_data )
.output_ports ()
...
);
//Test
initial begin
#1ns;
reg_to_drive_data = 0;
...
end
endmodule
Via includes, incdir or file lists the code for thing_to_test (DUT) is being pulled in to the simulation, change this to point to the synthesised version.
If what you are trying to do is initialise a ROM, and hold this data in a synthesised design Martin Thompson's answer covers the correct usage of $readmemb for this.