How to improve the speed of a multiplier in verilog?
Hi
I want to know about 'How to improve the speed of a multiplier without increasing clock speed in verilog?'
Does anyone know about regarding this?
We don't have much money to buy DesignWare of Synopsys's.
Unfortunately, Also we met some problem regarding speed limit of multiplier. So I trying to find way to improve multiplier without clock speed up. Especially, our ASIC has already approached to timing limit. We don't have timing margin.But We have to change regarding the multiplier logic.
For example, we have already met the all timing clock in synthesis.but we need to change the algorithm some multiplier regarding logic.
Assuming that all surrounding logic has been minimised, inputs from flip-flops output direct to flip-flops.
module multiplier(
WDITH = 24
)(
input clk,
input signed [WIDTH-1:0] a,
input signed [WIDTH-1:0] b,
output logic signed [(WIDTH*2) -1:0] mul
);
logic signed [WIDTH-1:0] a_i;
logic signed [WIDTH-1:0] b_i;
always #(posedge clk) begin
a_i <= a;
b_i <= b;
mul <= a_i * b_i;
end
endmodule
Having the a*b style in RTL allows the synthesis library to choose the best multiplier style (Area/power vs speed ). Assuming the question is a result of synthesis not being able to close timing.
What limits the multiplier speed?
Input width could be minimised to speed up design.
For ASIC design the next process node could be chosen ie going from 22nm to 14nm. For FPGA a more expensive chip supporting a faster multiplier speed.
Alternatively the target clock speed of the Multiplier could be halved and two used in parallel. Multi-cycle clocks could be used in synthesis if actual clock is to remain the same but the result sampled every other clock.
module multiplier(
WDITH = 24
)(
input clk,
input rst_n,
input signed [WIDTH-1:0] a,
input signed [WIDTH-1:0] b,
output logic signed [(WIDTH*2) -1:0] mul
);
logic signed [WIDTH-1:0] a1_i;
logic signed [WIDTH-1:0] b1_i;
logic signed [WIDTH-1:0] a2_i;
logic signed [WIDTH-1:0] b2_i;
logic signed [(WIDTH*2) -1:0] mul1;
logic signed [(WIDTH*2) -1:0] mul2;
logic state;
always #(posedge clk, negedge rst_n) begin
if (~rst_n) begin
state <= 'b0;
end
else begin
state <= ~state;
end
end
always #* begin
mul1_i = a1_i * b1_i;
mul2_i = a2_i * b2_i;
end
always #(posedge clk, negedge rst_n) begin
if (~rst_n) begin
a1_i <= 'b0;
b1_i <= 'b0;
a2_i <= 'b0;
b2_i <= 'b0;
mul <= 'b0
end
else begin
if (state) begin
a1_i <= a;
b1_i <= b;
mul <= mul2_i;
end
else begin
a2_i <= a;
b2_i <= b;
mul <= mul1_i;
end
end
end
endmodule
Where mul1_i and mul2_i; are given multi cycle properties in synthesis, so they have twice the clock period to resolve.
Another possibility is to instantiate a multi-cycle design ware multiplier, using the designware Datapath and building block IP. They have 2,3,4,5 and 6 cycle multipliers.
An example of a 2-Stage Multiplier :
module DW02_mult_2_stage_inst( inst_A, inst_B, inst_TC,
inst_CLK, PRODUCT_inst );
parameter A_width = 8;
parameter B_width = 8;
input [A_width-1 : 0] inst_A;
input [B_width-1 : 0] inst_B;
input inst_TC;
input inst_CLK;
output [A_width+B_width-1 : 0] PRODUCT_inst;
// Instance of DW02_mult_2_stage
DW02_mult_2_stage #(A_width, B_width)
U1 ( .A(inst_A), .B(inst_B), .TC(inst_TC),
.CLK(inst_CLK), .PRODUCT(PRODUCT_inst) );
endmodule
Related
Say I have a submodule and want to "call" it from a top module.
//sub.v
module sub(
input wire clk,
input wire rst_n,
input wire update,//interface is modifiable
input wire [7:0] in_data,
output wire[7:0] out_data
);
reg[7:0] state;
always #(posedge clk) begin
if(!rst_n)
state<=0;
else if(update)
state<=state+in_data;//update drives states to change according to input data
end
assign out_data=state<<1;
endmodule
//top.v
module top(
input wire clk,
input wire rst_n,
output reg[7:0] top_out
);
//Submodule
reg sub_update;
reg[7:0] sub_in;
wire[7:0] sub_out;
sub sub_instance(
.clk(clk),
.rst_n(rst_n),
.update(sub_update),
.in_data(sub_in),
.out_data(sub_out)
);
//Topmodule
reg[7:0] top_state;
localparam STATE0 = 8'd0;
localparam STATE1 = 8'd1;
localparam STATE2 = 8'd2;
localparam STATE3 = 8'd3;
localparam TEMP_STATE = -8'd1;
always #(posedge clk) begin
if(!rst_n) begin
top_state <= STATE0;
sub_update <= 0;
sub_in <= 0;
end else begin
case (top_state)
STATE0:top_state<=STATE1;
STATE1:begin
//want to update the submodule and use its output here
sub_in <= 123;
sub_update <= 1;
top_state <= TEMP_STATE;
end
STATE2:top_state<=STATE3;//proceed from STATE1
TEMP_STATE:begin
sub_update <= 0;
top_out <= sub_out+1;
top_state <= STATE2;
end
default:top_state<=STATE0;
endcase
end
end
endmodule
The main question may divide into 2 smaller questions:
In my module code, the "update" signal is both set and captured at the same edge of a same clock. Is that a data race? Should I use a different edge, or pass a reverse clock to the submodule? Or should I change "update" to other interface?
Is there a better way to get the submodule output than using a temporary state? For this simple example, there is a way to calculate the output when the input is ready: Inline everything, get top_out<=((state+123)<<1)+1; at STATE1. Can this be generalized for more complicated calculation?
there is no problem in setting and capturing 'update' on the same clock edge. Just note that it will be captured with a delay of one clock cycle after it is set. This is the way flops work and your simulation should work same way because you did correctly use NBAs.
This is the question that you should answer for yourself. Better ways or not depend on your requirements and abilities of hardware. You have all these parts of the equation and states for some reason. You can probably collapse them if your algorithm allows it. You can use complex calculations there if they do not violate your design rules.
I have an ice40 that drives the clock and data inputs of an ASIC.
The ice40 drives the ASIC's clock with the same clock that drives the ice40's internal logic. The problem is that the rising clock triggers the ice40's internal logic and changes the ice40's data outputs a few nanoseconds before the rising clock reaches the ASIC, and therefore the ASIC observes the wrong data at its rising clock.
I've solved this issue by using an inverter chain to delay the ice40's internal clock without delaying the clock driving the ASIC. That way, the rising clock reaches the ASIC before the ice40's data outputs change. But that raises a few questions:
Is my strategy -- using an inverter chain to delay the ice40 internal clock -- a good strategy?
To diagnose the problem, I used Lattice's iCEcube2 to analyze the min/max delays between the internal clock and output pins:
Notice that the asic_dataX delays are shorter than the clk_out delay, indicating the problem.
Is there a way to get this information from yosys/nextpnr?
Thank you for any insight!
Instead of tinkering with the delays I would recommend to use established techniques. For example SPI simple clocks the data on the one edge and changes them on the other: .
The logic to implement that is rather simple. Here an example implementation for an SPI slave:
module SPI_slave #(parameter WIDTH = 6'd16, parameter phase = 1'b0,
parameter polarity = 1'b0, parameter bits = 5) (
input wire rst,
input wire CS,
input wire SCLK,
input wire MOSI,
output reg MISO,
output wire data_avbl,
input wire [WIDTH-1:0] data_tx,
output reg [WIDTH-1:0] data_rx
);
reg [bits:0] bitcount;
reg [WIDTH-1:0] buf_send;
assign clk = phase ^ polarity ^ SCLK;
assign int_rst = rst | CS;
assign tx_clk = clk | CS;
assign data_avbl = bitcount == 0;
always #(negedge tx_clk or posedge rst) begin
MISO <= rst ? 1'b0 : buf_send[WIDTH-1];
end
always #(posedge clk or posedge int_rst) begin
if (int_rst) begin
bitcount <= WIDTH;
data_rx <= 0;
buf_send <= 0;
end else begin
bitcount <= (data_avbl ? WIDTH : bitcount) - 1'b1;
data_rx <= { data_rx[WIDTH-2:0], MOSI };
buf_send <= bitcount == 1 ? data_tx[WIDTH-1:0] : { buf_send[WIDTH-2:0], 1'b0};
end
end
endmodule
As one can see the data are captured at the positive edge and changed on the negative edge. If one wants to avoid the mixing of edge sensistivies a doubled clock can be used instead.
Sir,
I have done a 4 bit up counter using verilog. but it was not incrementing during simulation. A frequency divider circuit is used to provide necessory clock to the counter.please help me to solve this. The code is given below
module my_upcount(
input clk,
input clr,
output [3:0] y
);
reg [26:0] temp1;
wire clk_1;
always #(posedge clk or posedge clr)
begin
temp1 <= ( (clr) ? 4'b0 : temp1 + 1'b1 );
end
assign clk_1 = temp1[26];
reg [3:0] temp;
always #(posedge clk_1 or posedge clr)
begin
temp <= ( (clr) ? 4'b0 : temp + 1'b1 );
end
assign y = temp;
endmodule
Did you run your simulation for at least (2^27) / 2 + 1 iterations? If not then your clk_1 signal will never rise to 1, and your counter will never increment. Try using 4 bits for the divisor counter so you won't have to run the simulation for so long. Also, the clk_1 signal should activate when divisor counter reaches its max value, not when the MSB bit is one.
Apart from that, there are couple of other issues with your code:
Drive all registers with a single clock - Using different clocks within a single hardware module is a very bad idea as it violates the principles of synchronous design. All registers should be driven by the same clock signal otherwise you're looking for trouble.
Separate current and next register value - It is a good practice to separate current register value from the next register value. The next register value will then be assigned in a combinational portion of the circuit (not driven by the clock) and stored in the register on the beginning of the next clock cycle (check code below for example). This makes the code much more clear and understandable and minimises the probability of race conditions and unwanted inferred memory.
Define all signals at the beginning of the module - All signals should be defined at the beginning of the module. This helps to keep the module logic as clean as possible.
Here's you example rewritten according to my suggestions:
module my_counter
(
input wire clk, clr,
output [3:0] y
);
reg [3:0] dvsr_reg, counter_reg;
wire [3:0] dvsr_next, counter_next;
wire dvsr_tick;
always #(posedge clk, posedge clr)
if (clr)
begin
counter_reg <= 4'b0000;
dvsr_reg <= 4'b0000;
end
else
begin
counter_reg <= counter_next;
dvsr_reg <= dvsr_next;
end
/// Combinational next-state logic
assign dvsr_next = dvsr_reg + 4'b0001;
assign counter_next = (dvsr_reg == 4'b1111) ? counter_reg + 4'b0001 : counter_reg;
/// Set the output signals
assign y = counter_reg;
endmodule
And here's the simple testbench to verify its operation:
module my_counter_tb;
localparam
T = 20;
reg clk, clr;
wire [3:0] y;
my_counter uut(.clk(clk), .clr(clr), .y(y));
always
begin
clk = 1'b1;
#(T/2);
clk = 1'b0;
#(T/2);
end
initial
begin
clr = 1'b1;
#(negedge clk);
clr = 1'b0;
repeat(50) #(negedge clk);
$stop;
end
endmodule
I'm new to verilog development and am having trouble seeing where I'm going wrong on a relatively simple counter and trigger output type design.
Here's the verilog code
Note the code returns the same result whether or not the reg is declared on the output_signal without the internal_output_buffer
`timescale 1ns / 1ps
module testcounter(
input wire clk,
input wire resetn,
input wire [31:0] num_to_count,
output reg [7:0] output_signal
);
reg [31:0] counter;
initial begin
output_signal = 0;
end
always#(negedge resetn) begin
counter = 0;
end
always#(posedge clk) begin
if (counter == num_to_count) begin
counter = 0;
if (output_signal == 0) begin
output_signal = 8'hff;
end
else begin
output_signal = 8'h00;
end
end
else begin
counter = counter + 1;
end
end
assign output_signal = internal_output_buffer;
endmodule
And the code is tested by
`timescale 1ns / 1ps
module testcounter_testbench(
);
reg clk;
reg resetn;
reg [31:0] num_to_count;
wire [7:0] output_signal;
initial begin
clk = 0;
forever #1 clk = ~clk;
end
initial begin
num_to_count = 20;
end
initial begin
#7 resetn = 1;
#35 resetn = 0;
end
testcounter A1(.clk(clk),.resetn(resetn),.num_to_count(num_to_count),.output_signal(output_signal));
endmodule
Behavioral simulation looks as I expected
But the timing simulation explodes
And for good measure: the actual probed execution blows up and looks like
Any tips would be appreciated. Thanks all.
The difference between the timing and functional simulations is that a timing simulation models the actual delay of logic gates while the functional simulation just checks if values are correct.
For e.g. if you have a simple combinational adder with two inputs a and b, and output c. A functional simulation will tell you that c=a+b. and c will change in the exact microsecond that a or b changes.
However, a timing simulation for the same circuit will only show you the result (a+b) on c after some time t, where t is the delay of the adder.
What is your platform? If you are using an FPGA it is very difficult to hit 500 MHz. Your clock statement:
forever #1 clk = ~clk;
shows that you toggle the clock every 1ns, meaning that your period is 2ns and your frequency is 500MHz.
The combinational delay through FPGA resources such as lookup tables, multiplexers and wire segments is probably more than 2ns. So your circuit violates timing constraints and gives wrong behaviour.
The first thing I would try is to use a much lower clock frequency, for example 100 MHz and test the circuit again. I expect it to produce the correct results.
forever #5 clk = ~clk;
Then to know the maximum safe frequency you can run at, look at your compilation reports in your design tools by running timing analysis. It is available in any FPGA CAD tool.
Your code seems working fine using Xilinx Vivado 14.2 but there is only one error which is the following line
assign output_signal = internal_output_buffer;
You can't assign registers by using "assign" and also "internal_output_buffer" is not defined.
I also personally recommend to set all registers to some values at initial. Your variables "resetn" and "counter" are not assigned initially. Basicly change your code like this for example
reg [31:0] counter = 32'b0;
Here is my result with your code:
Your verilog code in the testcounter looks broken: (a) you're having multiple drivers, and (b) like #StrayPointer notices, you're using blocking assignments for assigning Register (Flip-Flop) values.
I'm guessing your intent was the following, which could fix a lot of simulation mismatches:
module testcounter
(
input wire clk,
input wire resetn,
input wire [31:0] num_to_count,
output reg [7:0] output_signal
);
reg [31:0] counter;
always#(posedge clk or negedge resetn) begin
if (!resetn) begin
counter <= 0;
end else begin
if (counter == num_to_count) begin
counter <= 0;
end else begin
counter <= counter + 1;
end
end
end
assign output_signal = (counter == num_to_count) ? 8'hff : 8'h00;
endmodule
i need a frequency divider in verilog, and i made the code below. It works, but i want to know if is the best solution, thanks!
module frquency_divider_by2 ( clk ,clk3 );
output clk3 ;
reg clk2, clk3 ;
input clk ;
wire clk ;
initial clk2 = 0;
initial clk3 = 0;
always # (posedge (clk)) begin
clk2 <= ~clk2;
end
always # (posedge (clk2)) begin
clk3 <= ~clk3;
end
endmodule
the circuit generated by quartus:
Your block divides the frequency by 4 not 2. There is actually quite a good description of this on Wikipedia Digital Dividers. Your code can be tidied up a bit but only 1 D-Type is required, which is smaller than a JK Flip-flop so is optimal.
module frquency_divider_by2(
input rst_n,
input clk_rx,
output reg clk_tx
);
always # (posedge clk_rx) begin
if (~rst_n) begin
clk_tx <= 1'b0;
end
else begin
clk_tx <= ~clk_tx;
end
end
endmodule
When chaining these types of clock dividers together be aware that there is a clk to q latency which is compounded every time you go through one of these dividers. IF this is not managed correctly by synthesis the clocks can not be considered synchronous.
Example on EDAplayground, should open the waveform when run.