I want to implement a reciprical block on Verilog that will later be synthesized on an FPGA. The input should be a signed 32 bit wordlength with a 16 bit fraction length. The output should have the same format.
Example
input : x ---> output ---> 1/x
I have solved the problem using the inbuilt IP core divider. I'm wondering if there is an elegant/altenative way of solving this by for example by bit shifting or 2's complement with some xor grinds.
I have used the IP core to implement the inverse as it says in the manual but for some reason that i don't really understand the result is wrong and it needs to be shifted to the left by 1. For example; Reciprical of 1 gives 0.5 . Reciprical of 2 gives 1.
Below is a section from the manual and my testbench code
Test bench
module reciprical_tb;
// Inputs
reg clk;
reg [1:0] dividend;
reg [31:0] divisor;
// Outputs
wire rfd;
wire [1:0] quotient;
wire [31:0] fractional;
// Instantiate the Unit Under Test (UUT)
reciprical uut (
.rfd(rfd),
.clk(clk),
.dividend(dividend),
.quotient(quotient),
.divisor(divisor),
.fractional(fractional)
);
// clock
always begin
#5 clk = ~clk;
end
initial begin
// Initialize Inputs
clk = 0;
dividend = 2'b1; // 1
divisor = 2**16;; // = 1 when fraction length is 16bit
// Wait 100 ns for global reset to finish
#100;
// Add stimulus here :: Inverse of 2 should give 0.5
//$display("inv(%g) => %g || inv = %b",$itor(divisor)*2.0**-16, $itor(fractional)*2.0**-16, fractional); //gives zero
$monitor("inv(%d) => q = %d || inv = %b", divisor>>>16,fractional>>>16, fractional); //gives a wrong answer by a factor of 2
// Using the monitor i get inv(1) = 0.5 instead of 1.
#100;
end
endmodule
Manual section (page 4):
...
The divider can be used to implement the reciprocal of X; that is the 1/X function. To do this, the
dividend bit width is set to 2 and fractional mode is selected. The dividend input is then tied to 01 for
both unsigned or signed operation, and the X value is provided via the divisor input.
Ip Core used
Trying to debug part of the question:
can you try :
// Wait 10 clock cycles
repeat (10) begin
#(posedge clk);
end
// Add stimulus here :: Inverse of 2 should give 0.5
$display("dsiplay inv(%g) => %g || inv = %b",$itor(divisor)*2.0**-16, $itor(fractional)*2.0**-16, fractional); //gives zero
$monitor("inv(%d) => q = %d || inv = %b", divisor>>>16,fractional>>>16, fractional); //gives a wrong answer by a factor of 2
// Using the monitor i get inv(1) = 0.5 instead of 1.
// Wait 10 clock cycles
repeat (10) begin
#(posedge clk);
end
$display("dsiplay inv(%g) => %g || inv = %b",$itor(divisor)*2.0**-16, $itor(fractional)*2.0**-16, fractional);
//End simulation
$finish();
As monitor is only issued once, it might be after 200ns that it is actually firing and outputting the updated value.
IP cores are normally pretty efficient, but they will use fabric multipliers and a reasonable amount of logic. Depending on the accuracy that you need it could also be done with a look-up-table stored in block RAM. A RAM that used the raw input as the address would be massively too big for a device. But you can store a reciprocal curve and pre-scale the input and apply shifts at the output. You can reduce the ram usage further by linearly interpolating between two points on the curve.
It depends on what resources you have and what accuracy you need.
Related
I want to instantiate 3 instances of my counter module. However, Xilinx will only instantiate one counter for me, not the three. Does anyone know why this is? In the RTL schematic, the 2nd two counters are connected straight to ground in their block diagrams, i.e. no logic implemented for them. Are my local parameters declared correctly?
I would really really appreciate your help. I have been staring at this problem for several hours.
Thank you so much for your help. It is really appreciated.
//Top Level Module:
`timescale 1ns / 1ps
//Top Level Wrapper module
module topWrapper(input CCLK, input reset, output clk, output max_tick_Green, output max_tick_Red, output max_tick_Amber);
localparam
M_Green = 5;
M_Red = 3;
M_Amber = 2;
//Frequency scaling of CCLK
//clk is used in the traffic light module and is a scaled version of CCLK
//CCLK: Frequency = 50MHz, Period = 20ns
//clk = Frequency = 1Hz, Period = 1s
//clkscale (frequency scaling parameter) = 1s/20ns = 5x10^7
clock clockScalingModule (CCLK, 50000000, clk);
//Counter for green light
//In traffic light sequence, change from green to amber after 12 clock cycles = 120s = 2mins
counter #(.M(M_Green)) countGreen
(.clk(clk), .reset(reset), .state(1), .max_tick(max_tick_Green));
//Counter for red light
//In traffic light sequence, change from green to amber after 12 clock cycles = 120s = 2mins
counter #(.M(M_Red)) countRed
(.clk(clk), .reset(reset), .state(1), .max_tick(max_tick_Red));
//Counter for amber light
//In traffic light sequence, change from green to amber after 12 clock cycles = 120s = 2mins
counter #(.M(M_Amber)) countAmber
(.clk(clk), .reset(reset), .state(1), .max_tick(max_tick_Amber));
endmodule
//Counter module:
//Counter - modulo M counter - counts 0 to M-1, then wraps around
module counter
//Parameters
//M = number of clock cycles the counter counts = max value
#(parameter M = 6)
//I/O signals
(
input wire clk, reset, state,
output wire max_tick
);
//Local parameter
//N = number of bits in counter
//N = ceiling(log2(M)) - definition at end of module
localparam N = clog2(M);
//Internal signal declaration
reg [N-1:0] r_reg;
wire [N-1:0] r_next;
//Body
//Rgister update
always#(posedge clk, posedge reset)
//Restart counter if reset is High or state is Low
//State = Low if this counter's light is not currently on
//State = High if this counter's light is currently on
if(reset)
r_reg <= 0;
//Only increment counter if state is High
//Only one of red, green, amber states is high
else if(state)
r_reg <= r_next;
else
r_reg <= 0;
//Next-state logic
assign r_next = (r_reg == M) ? 0: r_reg + 1;
//Output logic
//Max tick = HI when maximum count value is reached
//Max tick = LO otherwise
assign max_tick = (r_reg == M) ? 1'b1 : 1'b0;
//Ceiling log2() function definition
function integer clog2;
input integer value;
begin
value = value-1;
for (clog2=0; value>0; clog2=clog2+1)
value = value>>1;
end
endfunction
endmodule
Your issue is in your counter module. Your clog2 function returns the number of bits needed to represent inputvalue-1. Thus, if inputvalue is exactly a power of two, you can't represent inputvalue with the returned length.
This is probably not very clear, so let's look at what happens for M_amber = 2. In this case, clog2 return 1, r_reg is range [0:0], but you need 2 bits to represent 2 = 2'b10. And you need to be able to represent 2, since you have the checks r_reg == M in your code. Said check will always fail, and Xilinx remove the logic and connects your logic to ground.
Normally, if you want to count N cycles in HDL, you count from 0 to N-1. Thus, your code will be fine if you replace r_reg == M for r_reg == M-1.
I'm trying to implement as follows to multiplying by 15.
module mul15(
output [10:0] result,
input [3:0] a
);
assign result = a*15;
endmodule
But is there any improve way to multiplying to a by 15?
I think there are 2 ways like this
1.result = a<<4 -1;
2.result = {a,3'b1111_1111};
Ans I think the best way is 2.
but I'm not sure also with aspect to synthesis.
update:
What if I am multiplying 0 at {a,3'b1111_1111}? This is 255 not 0.
Does anyone know the best way?
Update
How about this way?
Case1
result = {a,8'b0}+ {a,7'b0}+ {a,6'b0}+ {a,5'b0}+ {a,4'b0}+ {a,7'b0}+ {a,3'b0}+ {a,2'b0}+ {a,1'b0}+ a;
But it looks 8 adder used.
Case2
result = a<<8 -1
I'm not sure what is the best way else.
There is always a*16 - a. Static multiplications of power of 2 are basically free in hardware; it is just hard-coded 0s to the LSB. So you just need one 11-bit full-subtracter, which is a full adder and some inverters.
other forms:
result = a<<4 - a;
result = {a,4'b0} - a; // unsigned full-subtractor
result = {a,4'b0} + ~a + 1'b1; // unsigned full-adder w/ carry in, 2's complement
result = {{3{a[3]}},a,4'b0} + ~{ {7{a[3]}}, a} + 1'b1; // signed full-adder w/ carry in, 2's complement
The cleanest RTL version is as you have stated in the question:
module mul15(
input [3:0] a
output reg [7:0] result,
);
always #* begin
result = a * 4'd15;
end
endmodule
The Multiplicand 15 in binary is 4'b1111; That is 8 + 4 + 2 + 1.
Instead of a multiplier it could be broken down into the sum of these powers of 2. Powers of 2 are just barrel shifts. This is how a shift and add multiplier would work.
module mul15(
input [3:0] a
output reg [7:0] result,
);
always #* begin
// 8 4 2 1 =>15
result = (a<<3) + (a<<2) + (a<<1) + a;
end
endmodule
To minimise the number of adders required a CSD could be used. making 15 out of 16-1:
module mul15(
input [3:0] a
output reg [7:0] result,
);
always #* begin
// 16 - 1 =>15
result = (a<<4) - a;
end
endmodule
With a modern synthesis tool these should all result in same the thing. Therefore having more readable code which gives a clear instruction to the tool as to what you intended gives it the free rein to optimise as required.
I'm designing a 464 order FIR filter in Verilog for use on the Altera DE0 FPGA. I've got (what I believe to be) a working implementation; however, there's one small issue that's really actually given me quite a headache. The basic operation works like this: A 10 bit number is sent from a micro controller and stored in datastore. The FPGA then filters the data, and lights LED1 if the data is near 100, and off if it's near 50. LED2 is on when the data is neither 100 nor 50, or the filter hasn't filled the buffer yet.
In the specification, the coefficients (which have been pre provided), have been multiplied by 2^15 in order to represent them as integers. Therefore, I need to divide my final output Y by 2^15. I have implemented this using a shift, since it should be (?) the most efficient way. However, this single line causes my number of logic elements to jump from ~11,000 without it, to over 35,000. The Altera DE0 uses a Cyclone III FPGA which only has room for about 15k logic elements. I've tried doing it inside both combinational and sequential logic blocks, both of which have the same exact issue.
Why is this single, seemingly simple operation causing such an inflation elements? I'll include my code, which I'm sure isn't the most efficient, nor the cleanest. I don't care about optimizing this design for performance or area/density at all. I just want to be able to fit it onto the FPGA so it'll run. I'm not very experienced in HDL design, and this is by far the most complex project I've needed to tackle. It's worth noting that I do not remove y completely, I replace the "bad" line with assign YY = y;.
Just as a note: I haven't included all of the coefficients, for sanity's sake. I know there might be a better way to do it than using case statements, but it's the way that it came and I don't really want to relocate 464 elements to a parameter declaration, etc.
module lab5 (LED1, LED2, handshake, reset, data_clock, datastore, bit_out, clk);
// NUMBER OF COEFFICIENTS (465)
// (Change this to a small value for initial testing and debugging,
// otherwise it will take ~4 minutes to load your program on the FPGA.)
parameter NUMCOEFFICIENTS = 465;
// DEFINE ALL REGISTERS AND WIRES HERE
reg [11:0] coeffIndex; // Coefficient index of FIR filter
reg signed [16:0] coefficient; // Coefficient of FIR filter for index coeffIndex
reg signed [16:0] out; // Register used for coefficient calculation
reg signed [31:0] y;
wire signed [7:0] YY;
reg [9:0] xn [0:464]; // Integer array for holding x
integer i;
output reg LED1, LED2;
// Added values from part 1
input reset, handshake, clk, data_clock, bit_out;
output reg [9:0] datastore;
integer k;
reg sent;
initial
begin
sent = 0;
i=0;
datastore = 10'b0000000000;
y=0;
LED1 = 0;
LED2 = 0;
for (i=0; i<NUMCOEFFICIENTS; i=i+1)
begin
xn[i] = 0;
end
end
always#(posedge data_clock)
begin
if(handshake)
begin
if(bit_out)
begin
datastore = datastore >> 1;
datastore [9] = 1;
end
else
begin
datastore = datastore >> 1;
datastore [9] = 0;
end
end
end
always#(negedge clk)
begin
if (!handshake )
begin
if(!sent)
begin
y=0;
for (i=NUMCOEFFICIENTS-1; i > 0; i=i-1) //shifts coeffecients
begin
xn[i] = xn[i-1];
end
xn[0] = datastore;
for (i=0; i<NUMCOEFFICIENTS; i=i+1)
begin
// Calculate coefficient based on the coeffIndex value. Note that coeffIndex is a signed value!
// (Note: These don't necessarily have to be blocking statements.)
case ( 464-i )
12'd0: out = 17'd442; // This coefficient should be multiplied with the oldest input value
12'd1: out = -17'd373;
12'd2: out = -17'd169;
...
12'd463: out = -17'd373; //-17'd373
12'd464: out = 17'd442; //17'd442
// This coefficient should be multiplied with the most recent data input
// This should never occur.
default: out = 17'h0000;
endcase
y = y + (out * xn[i]);
end
sent = 1;
end
end
else if (handshake)
begin
sent = 0;
end
end
assign YY = (y>>>15); //THIS IS THE LINE THAT IS CAUSING THE ISSUE!
always #(YY)
begin
LED1 = 0;
LED2 = 1;
if ((YY >= 40) && (YY <= 60))
begin
LED1 <= 0;
LED2 <= 0;
end
if ((YY >= 90) && (YY <= 110))
begin
LED1 <= 1;
LED2 <= 0;
end
end
endmodule
You're almost certainly seeing the effects of synthesis optimisation.
The following line is the only place that uses y:
assign YY = (y>>>15); //THIS IS THE LINE THAT IS CAUSING THE ISSUE!
If you remove this line, all the logic that feeds into y (including out and xn) will be removed. On Altera you want to look carefully through your map report which will contain (buried amongst a million other things) information about all the logic that Quartus has removed and the reason behind it.
Good places to start are the Port Connectivity Checks which will tell you if any inputs or outputs are stuck high or low or are dangling. The look through the Registers Removed During Synthesis section and Removed Registers Triggering Further Register Optimizations.
You can try to force Quartus not to remove redundant logic by using the following in your QSF:
set_instance_assignment -name preserve_fanout_free_node on -to reg
set_instance_assignment -name preserve_register on -to foo
In your case however it sounds like the correct solution is to re-factor the code rather than try to preserve redundant logic. I suspect you want to investigate using an embedded RAM to store the coefficients.
(In addition to Chiggs' answer, assuming that you are hooking up YY correctly ....)
I would add that, you don't need >>>. It would be simpler to write :
assign YY = y[22:15];
And BTW, initial blocks are ignored for synthesis. So, you want to move that initialization to the respective always blocks in a if (reset) or if (handshake) section.
I'm trying to write a multiplier based on a design. It consists of two 16-bit inputs and the a single adder is used to calculate the partial product. The LSB of one input is AND'ed with the 16 bits of the other input and the output of the AND gate is repetitively added to the previous output. The Verilog code for it is below, but I seem to be having trouble with getting the outputs to work.
module datapath(output reg [31:15]p_high,
output reg [14:0]p_low,
input [15:0]x, y,
input clk); // reset, start, x_ce, y_ce, y_load_en, p_reset,
//output done);
reg [15:0]q0;
reg [15:0]q1;
reg [15:0]and_output;
reg [16:0]sum, prev_sum;
reg d_in;
reg [3:0] count_er;
initial
begin
count_er <= 0;
sum <= 17'b0;
prev_sum <= 17'b0;
end
always#(posedge clk)
begin
q0 <= y;
q1 <= x;
and_output <= q0[count_er] & q1;
sum <= and_output + prev_sum;
prev_sum <= sum;
p_high <= sum;
d_in <= p_high[15];
p_low[14] <= d_in;
p_low <= p_low >> 1;
count_er <= count_er + 1;
end
endmodule
I created a test bench to test the circuit and the first problem I see is that, the AND operation doesn't work as I want it to. The 16-bits of the x-operand are and'ed with the LSB of the y-operand. The y-operand is shifted by one bit after every clock cycle and the final product is calculated by successively adding the partial products.
However, I am having trouble starting from the sum and prev_sum lines and their outputs are being displayed as xxxxxxxxxxxx.
You don't seem to be properly resetting all the signals you need to, or you seem to be confusing the way that nonblocking assignments work.
After initial begin:
sum is 0
prev_sum is 0
and_output is X
After first positive edge:
sum is X, because and_output is X, and X+0 returns X. At this point sum stays X forever, because X + something is always X.
You're creating a register for almost every signal in your design, which means that none of your signals update immediately. You need to make a distinction between the signals that you want to register, and the signals that are just combinational terms. Let the registers update with nonblocking statements on the posedge clock, and let the combinational terms update immediately by placing them in an always #* block.
I don't know the algorithm that you're trying to use, so I can't say which lines should be which, but I really doubt that you intend for it to take one clock cycle for x/y to propagate to q0/q1, another cycle for q to propagate to and_output, and yet another clock cycle to propogate from and_output to sum.
Comments on updated code:
Combinational blocks should use blocking assignments, not nonblocking assignments. Use = instead of <= inside the always #* block.
sum <= and_output + sum; looks wrong, It should be sum = and_output + p_high[31:16] according to your picture.
You're assigning p_low[14] twice here. Make the second statement explicitly set bits [13:0] only:
p_low[14] <= d_in;
p_low[13:0] <= p_low >> 1;
You are mixing blocking and nonblocking assignments in the same sequential always block, which can cause unexpected results:
d_in <= p_high[15];
p_low[14] = d_in;
I want to have a 2 sec counter in my for loop such that there is a gap of 2 seconds between every iteration.I am trying to have a shifting LEDR Display
Code:
parameter n =10;
integer i;
always#(*)
begin
for(i=0;i<n;i=i+1)
begin
LEDR[i]=1'b1;
//2 second counter here
end
end
What is your desired functionality? I assume: shifting which LED is on every 2 seconds, keeping all the other LEDs off? "Sliding LED"...
Also, I am assuming your target is an FPGA-type board.
There is no free "wait for X time" in the FPGA world. The key to what you are trying to do it counting clock cycles. You need to know the clock frequency of the clock that you are using for this block. Once you know that, then you can calculate how many clock rising edges you need to count before "an action" needs to be taken.
I recommend two processes. In one, you will watch rising edge of clock, and run a counter of sufficient size, such that it will roll over once every two seconds. Every time your counter is 0, then you set a "flag" for one clock cycle.
The other process will simply watch for the "flag" to occur. When the flag occurs, you shift which LED is turned on, and turn all other LEDs off.
I think this module implements what Josh was describing. This module will create two registers, a counter register (counter_reg) and a shift register (leds_reg). The counter register will increment once per clock cycle until it rolls over. When it rolls over, the "tick" variable will be equal to 1. When this happens, the shift register will rotate by one position to the left.
module led_rotate_2s (
input wire clk,
output wire [N-1:0] leds,
);
parameter N=10; // depends on how many LEDs are on your board
parameter WIDTH=<some integer>; // depends on your clock frequency
localparam TOP=<some integer>; // depends on your clock frequency
reg [WIDTH-1:0] counter_reg = 0, counter_next;
reg [N-1:0] leds_reg = {{N-1{1'b0}}, 1'b1}, leds_next;
wire tick = (counter_reg == 0);
assign leds = leds_reg;
always #* begin : combinational_logic
counter_next = counter_reg + 1'b1;
if (counter_next >= TOP)
counter_next = 0;
leds_next = leds_reg;
if (tick)
leds_next = {leds_reg[N-2:0], leds_reg[N-1]}; // shift register
end
always #(posedge clk) begin : sequential_logic
counter_reg <= counter_next;
leds_reg <= leds_next;
end
endmodule