registering and resetting the convolution output in verilog - verilog

so I have a module that does convolution, it takes a data input and the filter input , where input is array of 9 numbers , every posedge of the clk these two inputs are being multiplied and then added accumulatively, i.e I save every new multiplication product into a register. after each 9 iterations I have to save the result and reset it , but I have to do it in one clock cycle, since my new data is coming on the next posedge. So the issue that I am facing is how to not save data and reset the out without losing data? Please help if you have any suggestions. It also need to be mentioned that my conv_module is a sub-module and I will be instantiating it in a top module , so I have to access all the inputs and outputs from uptop.
This is the code that I've written so far, but it does not work the way I want it, cause I cannot tap the array of output numbers from the top module.
module mult_conv( input clk,
input rst,
input signed [4:0] a,
input signed[2:0] b,
output reg signed[7:0] out
);
wire signed [7:0] mult;
reg signed [7:0] sum;
reg [3:0] counter;
reg do_write;
reg [7:0] out_top;
assign mult = {{3{a[4]}},a} * {{5{b[2]}},b};
always #(posedge clk or posedge rst)
begin
if (rst)
begin
counter <= 4'h0;
addr <= 'h0;
sum <= 0;
do_write <= 1'b0;
end // rst
else
begin
if (counter == 4'h8)
begin // we have gathered 9 samples
counter <= 4'h0;
// start again so ignore old sum
sum <= mult;
out <= sum;
out_top <= out;
end
else
begin
counter <= counter + 4'h1;
// Add results
sum <= sum + mult;
out <= 0;
out_top <= out_top;
end
// Write signal has to be set one cycle early
do_write = (counter==4'h7);
end // clocked
end // always
endmodule

You have a plethora of errors in that code.
Apart from that you have a 3Mega bit memory from which you use only 1 in 9 locations.
You write out in two places. That does not work.
You use a %9. That can not be mapped onto hardware.
You have a sel signal which somehow controls your sum.
On top of that I understand you want to bring the whole memory out.
Your code because it needs to be drastically re-written.
But your biggest problem is that you definitely can't make the memory come out. What ever post-processing you want to do you have two choices:
Process the output data as it appears.
Store the data outside the module in a memory and have another process read that memory.
I think only (1) is the correct way because your signal can have infinite length.
As to fixing this code a bit:
Replace the %9 with a counter to count from 0 to 8.
Process out in in clocked section. See below
Move the addr and sel generating logic in here. Keep it all together.
Below is the basic code of how to do a 9-sequence convolution. I have to ignore 'sel' as I have no idea of the timing. I have also added address generation and a write signal so the result can be store in an external memory. But I still think you should process the result on the fly.
always #(posedge clk or posedge rts)
begin
if (rst)
begin
counter <= 4'h0;
addr <= 'h0;
sum <= 0;
do_write <= 1'b0;
end // rst
else
begin
if (counter == 4'h8)
begin // we have gathered 9 samples
counter <= 4'h0;
addr <= addr + 1;
// start again so ignore old sum
sum <= mult;
end
else
begin
counter <= counter + 4'h1;
// Add results
sum <= sum + mult;
end
// Write signal has to be set one cycle early
do_write = (counter==4'h7);
end // clocked
end // always
(Code above was entered on-the fly, may contain syntax, typing or other errors!!)
As you can see the trick is to know when to add the old result or when to ignore the old sum and start again.
(I spend about 3/4 of an hour on that so on my normal tariff you would have to pay me $93.75 :-)
I provided the basic code to let you work out the specifics. I did nothing with out but left that to you.
do_write and addr where for a possible memory to pick up the result. Without memory you can drop addr but do_write should tell you when a new convolution result is available, in which case you might want to give a it a different name. e.g. 'sum_valid'.

Related

How to program a delay in Verilog?

I'm trying to make a morse code display using an led. I need a half second pulse of the light to represent a dot and a 1.5 second pulse to represent a dash.
I'm really stuck here. I have made a counter using an internal 50MHz clock on my FPGA. The machine I have to make will take as input a 3 bit number and translate that to a morse letter, A-H with A being 000, B being 001 and so on. I just need to figure out how to tell the FPGA to keep the led on for the specified time and then turn off for about a second (that would be the delay between a dot pulse and a dash pulse).
Any tips would be greatly appreciated.
Also, it has to be synthesizable.
Here is my code. It's not functioning yet. The error message it keeps giving me is:
Error (10028): Can't resolve multiple constant drivers for net "c3[0]"
at part4.v(149)
module part4 (SELECT, CLK, CLOCK_50, RESET, led);
input [2:0]SELECT;
input RESET, CLK, CLOCK_50;
output reg led=0;
reg [26:0] COUNT=0; //register that keeps track of count
reg [1:0] COUNT2=0; //keeps track of half seconds
reg halfsecflag=0; //goes high every time half second passes
reg dashflag=0; //goes high every time 1 and half second passes
reg [3:0] code; //1 is dot and 0 is dash. There are 4 total
reg [1:0] c3; //keeps track of the index we are on in the code.
reg [3:0] STATE; //register to keep track of states in the state machine
reg done=0; //a flag that goes up when one morse pulse is done.
reg ending=0; //another flag that goes up when a whole morse letter has flashed
reg [1:0] length; //This is the length of the morse letter. It varies from 1 to 4
wire i; // if i is 1, then the state machine goes to "dot". if 0 "dash"
assign i = code[c3];
parameter START= 4'b000, DOT= 4'b001, DASH= 4'b010, DELAY= 4'b011, IDLE=
4'b100;
parameter A= 3'b000, B=3'b001, C=3'b010, D=3'b011, E=3'b100, F=3'b101,
G=3'b110, H=3'b111;
always #(posedge CLOCK_50 or posedge RESET) //making counter
begin
if (RESET == 1)
COUNT <= 0;
else if (COUNT==8'd25000000)
begin
COUNT <= 0;
halfsecflag <= 1;
end
else
begin
COUNT <= COUNT+1;
halfsecflag <=0;
end
end
always #(posedge CLOCK_50 or posedge RESET)
begin
if (RESET == 1)
COUNT2 <= 0;
else if ((COUNT2==2)&&(halfsecflag==1))
begin
COUNT2 = 0;
dashflag=1;
end
else if (halfsecflag==1)
COUNT2= COUNT2+1;
end
always #(RESET) //asynchronous reset
begin
STATE=IDLE;
end
always#(STATE) //State machine
begin
done=0;
case(STATE)
START: begin
led = 1;
if (i) STATE = DOT;
else STATE = DASH;
end
DOT: begin
if (halfsecflag && ~ending) STATE = DELAY;
else if (ending) STATE= IDLE;
else STATE=DOT;
end
DASH: begin
if ((dashflag)&& (~ending))
STATE = DELAY;
else if (ending)
STATE = IDLE;
else STATE = DASH;
end
DELAY: begin
led = 0;
if ((halfsecflag)&&(ending))
STATE=IDLE;
else if ((halfsecflag)&&(~ending))
begin
done=1;
STATE=START;
end
else STATE = DELAY;
end
IDLE: begin
c3=0;
if (CLK) STATE=START;
else STATE=IDLE;
end
default: STATE = IDLE;
endcase
end
always #(posedge CLK)
begin
case (SELECT)
A: length=2'b01;
B: length=2'b11;
C: length=2'b11;
D: length=2'b10;
E: length=2'b00;
F: length=2'b11;
G: length=2'b10;
H: length=2'b11;
default: length=2'bxx;
endcase
end
always #(posedge CLK)
begin
case (SELECT)
A: code= 4'b0001;
B: code= 4'b1110;
C: code= 4'b1010;
D: code= 4'b0110;
E: code= 4'b0001;
F: code= 4'b1011;
G: code= 4'b0100;
H: code= 4'b1111;
default: code=4'bxxxx;
endcase
end
always #(posedge CLK)
begin
if (c3==length)
begin
c3<=0; ending=1;
end
else if (done)
c3<= c3+1;
end
endmodule
I have been reading your code and there are many issues:
The code is not formatted.
You did not provide a test-bench. Did you write one?
"Can't resolve multiple constant drivers for net" Search on stack exchange for the error message. It has been asked many times.
Use always #(*) not e.g. always #(STATE) you are missing signals like i, halfsecflag, ending. But see point 6: You want the STATE in a clocked section.
Where you use always #(posedge CLK) you must use non-blocking assignments: <=.
There are many places where you use always #(posedge CLK) where you want to use always #(*) (e.g. where you set length and code) Opposite you want to use a posedge CLK where you work with your STATE.
Use one clock and one clock only. Do not use CLK and CLOCK_50. Use either one or the other.
Take care of your vector sizes. This 8'd25000000 is wrong as you can no fit 25000000 in 8 bits.
Your usage of halfsecflag is excellent! I have see many times where people think they can use always #(halfsecflag) which is a recipe for disaster!
Below you find a small piece of your code which I have re-written.
All assignments are non-blocking <=
halfsecflag is essential to operate the code only every half a second, so I put that by itself in a separate if at the top. I would use that throughout the code.
All register are reset, both COUNT2 and dashflag.
dashflag was set to 1 but never set back to 0. I fixed that.
I specified the vector sizes. It makes the code "Lint proof".
Here is it:
always #(posedge CLOCK_50 or posedge RESET)
begin
if (RESET == 1'b1)
begin
COUNT2 <= 2'd00;
dashflag <= 1'b0;
end // reset
else if (halfsecflag) // or if (halfsecflag==1'b1)
begin
if (COUNT2==2'd2))
begin
COUNT2 <= 2'd0;
dashflag <=1'b1;
end
else
begin
COUNT2 <= COUNT2+2'd1;
dashflag <=1'b0;
end
end // clocked
end // always
Start fixing the rest of your code the same way. Write a test-bench, simulate and trace on a waveform display where things go wrong.
Normally you would build the finite state machine to produce the output. That machine would have some stages, like reading the input, mapping it to a sequence of morse code element, shifting out the elements to output buffer, waiting for conditions to move to the next morse element. You will need some timer that would produce one morse time unit intervals, and depending on the FSM stage you will wait one, three or seven time units. The FSM will spin in the waiting stage, it doesn't "magically" sleeps in some fpga-produced delay, there's no such things.
Okay a year later, I know exactly what one should do if they want to create a delay in their verilog program! Essentially, what you should do is create a timer using one of the clocks on your FPGA. For me on my Altera DE1-SoC, the timer I could use is the 50MHz clock known as CLOCK_50. What you do is make a timer module that triggers on the positive (or negative, doesn't matter) edge of the 50MHz clock. Set up a count register that holds a constant value. For example, reg [24:0] timer_limit = 25'd25000000; This is a register that can hold 25 bits. I've set this register to hold the number 25 million. The idea is to flip a bit every time the value in this register is exceeded. Here's some pseudocode to help you understand:
//Your variable declarations
reg [24:0] timer_limit = 25'd25000000; //defining our timer limit register
reg [25:0] timer_count = 0; //See note A
reg half_sec_clock;
always#(posedge of CLOCK_50) begin
if timer_count >= timer_limit then begin
reset timer_count to 0;
half_sec_clock = ~half_sec_clock; //toggle your half_sec_clock
end
Note A: Setting it to zero may or may not initialize count, it's always best to include a reset function that clears your count to zero because you don't know what the initial state is when you're dealing with hardware.
This is the basic idea of how to introduce timing into your hardware. You need to use an onboard clock on your device, trigger on the edge of that clock and create your own slower clock to measure things like seconds. The example above will give you a clock that triggers periodically every half second. For me, this allowed me to easily make a morse code light that could flash on either 1 half second count, or 3 half seconds. My best advice to you beginners is to work in a modular fashion. For example build your half second clock and then test it out to see if you can get a light on your FPGA to toggle once every half second (or whatever interval you want). :) I really hope this is the answer that helps you. I know this is what I was looking for when I originally posted this question so long ago.

Instantiate a module based on a condition in Verilog

I have a 1023 bit vector in Verilog. All I want to do is check if the ith bit is 1 and if it is 1 , I have to add 'i' to another variable .
In C , it would be something like :
int sum=0;
int i=0;
for(i=0;i<1023;i++) {
if(a[i]==1) {
sum=sum+i;
}
Of course , the addition that I am doing is over a Galois Field . So, I have a module called Galois_Field_Adder to do the computation .
So, my question now is how do I conditionally check if a specific bit is 1 and if so call my module to do that specific addition .
NOTE: The 1023 bit vector is declared as an input .
It's hard to answer your question without seeing your module, as we can't gage where you are in your Verilog. You always have to think of how your code translates in gates. If we want to translate your C code into synthesizable logic, we can take the same algorithm, go through each bit one after the other, and add to the sum depending on each bit. You would use something like this:
module gallois (
input wire clk,
input wire rst,
input wire [1022:0] a,
input wire a_valid,
output reg [18:0] sum,
output reg sum_valid
);
reg [9:0] cnt;
reg [1021:0] shift_a;
always #(posedge clk)
if (rst)
begin
sum[18:0] <= {19{1'bx}};
sum_valid <= 1'b0;
cnt[9:0] <= 10'd0;
shift_a[1021:0] <= {1022{1'bx}};
end
else
if (a_valid)
begin
sum[18:0] <= 19'd0;
sum_valid <= 1'b0;
cnt[9:0] <= 10'd1;
shift_a[1021:0] <= a[1022:1];
end
else if (cnt[9:0])
begin
if (cnt[9:0] == 10'd1022)
begin
sum_valid <= 1'b1;
cnt[9:0] <= 10'd0;
end
else
cnt[9:0] <= cnt[9:0] + 10'd1;
if (shift_a[0])
sum[18:0] <= sum[18:0] + cnt[9:0];
shift_a[1021:0] <= {1'bx, shift_a[1021:1]};
end
endmodule
You will get your result after 1023 clock cycles. This code needs to be modified depending on what goes around it, what interface you want etc...
Of importance here is that we use a shift register to test each bit, so that the logic adding your sum only takes shift_a[0], sum and cnt as an input.
Code based on the following would also work in simulation:
if (a[cnt[9:0])
sum[18:0] <= sum[18:0] + cnt[9:0];
but the logic adding to sum would in effect take all 1023 bits of a[] as an input. This would be quite hard to turn into actual lookup tables.
In simulation, you can also implement something very crude such as this:
reg [1022:0]a;
reg [9:0] sum;
integer i;
always #(a)
begin
sum[9:0] = 10'd0;
for (i=0; i < 1023; i=i+1)
if (a[i])
sum[9:0] = sum[9:0] + i;
end
If you were to try to synthesize this, sum would actually turn into a chunk of combinatorial logic, as the 'always' block doesn't rely on a clock. This code is in fact equivalent to this:
always #(a)
case(a):
1023'd0: sum[18:0] = 19'd0;
1023'd1: sum[18:0] = 19'd1;
1023'd2: sum[18:0] = 19'd3;
etc...
Needless to say that a lookup table with 1023 input bits is a VERY big memory...
Then if you want to improve your code, and use your FPGA as an FPGA and not like a CPU, you need to start thinking about parallelism, for instance working in parallel on different ranges of your input a. But this is another thread...

Take the sum of 3 adc data for 3 sampling

Hi everyone,
I am a newbie in programming FPGA by verilog language. At the present, I am trying to design the firmware to calculate the sum of adc data at 3 sampling. Firstly, I will explain about one adc at one sampling in my code. When you look at the code, you can see that with rising-edge of clkr clock and adcIfEnb == 1, the adc_data will get the value from adcIfData and this is the data for one sampling. In the next rising-edge of clkr clock and adcIfEnb == 1, this data is stored in iradcTrg. Finally, I will have the 3 data of adc_data for 3 sampling which are stored in iradcTrg and then I summarize 3 these data.
wire adcIfData[79:0];
reg
always #(posedge clkr) begin
if(adcIfEnb) begin
adc_data[9:0] <= adcIfData[9:0];
end
end
reg [29:0] iradcTrg;
reg [9:0] adcTrg;
always #(posedge clkr) begin
if (adcIfEnb) begin
iradcTrg[29:0] <= {adc_trg[19:10],adc_trg[9:0],adc_data[9:0]};
adcTrg[9:0] <= adc_trg[29:20] + adc_trg[19:10] + adc_trg[9:0];
end
end
However, there are 2 problems which I do not know how to solve.
Firstly, at the beginning time, when the first data of adc_data is stored at iradcTrg and adcTrg also take the sum. It means that adcTrg = 0 + 0 + first_adc_data but this sum need to be avoided.
Secondly, according to my design, I see that adc_data is serialized into iradcTrg. It means that the adc_data will be stored like this:
[1 2 3] 4 5 6 => 1 [2 3 4] 5 6=> 1 2 [3 4 5] 6
But in my case, I would like that the adc_data will be stored like this to get the sum
[1 2 3] 4 5 6 => 1 2 3 [4 5 6]
Therefore, how should I repair my code to get the result that I expected or are there any documents can help me in this case ?
To start: make sure your code is correctly indented when you put it on stackexchange. Secondly: I assume you have edited the code before posting it here because that code will not compile e.g. there is a floating 'reg' at the top and no module declaration.Thirdly: you have defined a wire adcIfData[79:0] I am going to assume you meant that to be [9:0].Forthly: You use variables which are not defined: adc_data, adc_trg.
Fifthly: I suggest you give your variables more meaningfull names like: gater_samples, sum_off_samples.
Now lets look at the core of the code. You want to take samples and shift them into a 30 bit register. There is no need to write "adc_trg[19:10],adc_trg[9:0]" adc_trg[19:0] will suffice. Also there is no need to put it in a different register beforehand. I would just use:
always #(posedge clkr)
if (adcIfEnb)
iradcTrg[29:0] <= {iradcTrg[19:0],adcIfData[9:0]};
As to your basic problem of gathering samples and not using the first two: all you have to do is add a counter which counts to three. Then you add the result on the third count. You will need a reset to give the counter a known value at startup but I don't see a reset signal. I always try to use minimal logic so I would make iradcTrg 20 bits wide to only store the intermediate result and at the count of three add it up with the latest sample. Saves another 10 registers. Here is some code. I wrote this without simulating or compiling. It is just a guide of how it all should look like.
reg [ 1:0] count;
reg [19:0] gather_samples;
reg [ 9:0] sum_of_samples;
reg sum_valid;
always #(posedge clkr)
begin
if (some_reset)
count <= 2'd0;
else
if (adcIfEnb)
begin
if (count==2'd2)
begin // third sample arriving, add it to the previous 2
sum_of_samples <= gather_samples[19:10] + gather_samples[9:0] + adcIfData;
count <= 2'd0;
else
begin // intermediate: gather samples
gather_samples <= {gather_samples[9:0],adcIfData};
count <= count + 2'd1;
end
sum_valid <= (count==2'd2);
end // if (adcIfEnb)
end // clocked
Your job will be much easier if you use a state machine. Here's a small (and incomplete) example of a state machine.
parameter FIRST_DATA=0, SECOND_DATA=1, THIRD_DATA=2, OUTPUT=3;
reg [2:0] current_state = FIRST_DATA;
reg [9:0] adc_data1;
reg [9:0] adc_data2;
reg [9:0] adc_data3;
reg [11:0] adc_data_sum;
always # (posedge clk)
begin
// TODO: use proper reset
case (current_state):
FIRST_DATA:
if(adcIfEnb):
current_state <= SECOND_DATA;
SECOND_DATA:
if(adcIfEnb):
current_state <= THIRD_DATA;
THIRD_DATA:
if(adcIfEnb):
current_state <= OUTPUT;
OUTPUT:
if(adcIfEnb):
current_state <= FIRST_DATA;
endcase
end
always # (negedge clk)
begin
if (current_state == FIRST_DATA && adcIfEnb)
adc_data1 <= adcIfData;
end
always # (negedge clk)
begin
if (current_state == SECOND_DATA && adcIfEnb)
adc_data2 <= adcIfData;
end
always # (negedge clk)
begin
if (current_state == THIRD_DATA && adcIfEnb)
adc_data3 <= adcIfData;
end
always # (negedge clk)
begin
if (current_state == OUTPUT)
adc_data_sum <= adc_data1 + adc_data2 + adc_data3;
end
Few comments first:
1) I don't know why do you introduce so many variables with strange names. You only need a adc_buffer and adc_sum. Is iradcTrg the same as adc_trg? Why is there an empty reg statement? Why adcIfData has 80 bits and you only use 8 LSB bits? I'm confused.
2) Since adc_sum will be a sum of 3 (adcTrg in your case), think about possible overflow. What should be the width of adc_sum if you want to add 3 10-bit numbers?
3) Shouldn't you reset your design to a known state using asynchronous or synchronous reset first?
You can use a 2 bit counter with async reset and a logic for wrapping back to 0:
reg [1:0] adc_buffer_counter_reg;
always #(posedge clkr or negedge rst_n) begin
if (!rst_n)
adc_buffer_counter_reg <= 2'd0;
else if (adcIfEnb) begin
if (adc_buffer_counter_reg == 2'd2) //trigger calc of the sum here
adc_buffer_counter_reg <= 2'd0;
else
adc_buffer_counter_reg <= adc_buffer_counter_reg + 2'd1;
end
You can use this counter to trigger a calculation of the sum every 3rd valid data.

How to start a counter using a one clock pulse enable

So I'm trying to implement a counter that takes a one-clock-cycle enable and starts counting from there. It sends a one-clock-cycle expired once the counter finishes counting. The enable will never be sent again until the counter is done counting, and the value input is continuous.
I was thinking of just using a reg variable that is changed to high when enable = 1, and changed to low once the counter finishes counting. I fear this may imply latches that I don't want... is there a better way to do this?
current code:
module Timer(
input [3:0] value,
input start_timer,
input clk,
input rst,
output reg expired
);
//counter var
reg [3:0] count;
//hold variable
reg hold;
//setting hold
always #*
begin
if (start_timer == 1'b1)
hold <= 1'b1;
end
//counter
if (hold == 1'b1)
begin
always # (posedge(clk))
begin
if(count == value - 1)
begin
expired <= 1'b1;
count <= 4'b0;
hold <= 1'b0;
end
else
count <= count + 1'b1;
end
end
endmodule
Well, you seem to be on the right track, but you are correct that your current design will imply latches. You can fix this by making hold its own flipflop and only counting when its set as you have done, clearing it once your counting is complete. Note this will impact the timing of your system, so you have to make sure everything happens on the correct cycle. Its hard to say from your description whether you want expired to be set on the "value"th cycle or "value + 1"th cycle. In order to show you a few of what youll need to do to your code to get it to compile and have the proper timing, Im going to assume a few things:
value is held at a constant number and doesnt not change while counting
expired will pulse on the valueth clock after start_timer is set
value will not be 4'h0 (though this is easy to deal with, as no counting is required based on assumption 1)
So, your module would look like this:
module Timer(
input [3:0] value,
input start_timer,
input clk,
input rst,
output expired
);
//counter var
reg [3:0] count;
//hold variable
reg hold;
// Note you cannot assign hold from multiple blocks, so the other one has been removed
// In order to meet the timing from assumption 2, we need to combinationally determine expired (be sure we only set it while we are counting; thus the dependence on hold)
assign expired = (count == value - 4'd1) && (hold == 1'b1);
//counter
// Note, you cannot have logic outside of a procedural block!
always #(posedge clk) begin
if ((hold == 1'b0) && (start_timer == 1'b1)) begin
hold <= 1'b1; // Hold is part of the register, making it its own flipflop
count <= 4'd0;
end
else if (hold == 1'b1) begin
if (count == value - 4'd1) begin
hold <= 1'b0; // Clear hold, we are done counting
// No need to clear count, it will be cleared at the start of the timer
end
else begin
count <= count + 4'd1;
end
end
end
endmodule

Is there a way to sum multi-dimensional arrays in verilog?

This is something that I think should be doable, but I am failing at how to do it in the HDL world. Currently I have a design I inherited that is summing a multidimensional array, but we have to pre-write the addition block because one of the dimensions is a synthesize-time option, and we cater the addition to that.
If I have something like reg tap_out[src][dst][tap], where src and dst is set to 4 and tap can be between 0 and 15 (16 possibilities), I want to be able to assign output[dst] be the sum of all the tap_out for that particular dst.
Right now our summation block takes all the combinations of tap_out for each src and tap and sums them in pairs for each dst:
tap_out[0][dst][0]
tap_out[1][dst][0]
tap_out[2][dst][0]
tap_out[3][dst][0]
tap_out[0][dst][1]
....
tap_out[3][dst][15]
Is there a way to do this better in Verilog? In C I would use some for-loops, but that doesn't seem possible here.
for-loops work perfectly fine in this situation
integer src_idx, tap_idx;
always #* begin
sum = 0;
for (scr_idx=0; src_idx<4; src_idx=scr_idx+1) begin
for (tap_idx=0; tap_idx<16; tap_idx=tap_idx+1) begin
sum = sum + tap_out[src_idx][dst][tap_idx];
end
end
end
It does unroll into a large combinational logic during synthesis and the results should be the same adding up the bits line by line.
Propagation delay from a large summing logic could have a timing issue. A good synthesizer should find the optimum timing/area when told the clocking constraint. If logic is too complex for the synthesizer, then add your own partial sum logic that can run in parallel
reg [`WIDHT-1:0] /*keep*/ partial_sum [3:0]; // tell synthesis to preserve these nets
integer src_idx, tap_idx;
always #* begin
sum = 0;
for (scr_idx=0; src_idx<4; src_idx=scr_idx+1) begin
partial_sum[scr_idx] = 0;
// partial sums are independent of each other so the can run in parallel
for (tap_idx=0; tap_idx<16; tap_idx=tap_idx+1) begin
partial_sum[scr_idx] = partial_sum[scr_idx] + tap_out[src_idx][dst][tap_idx];
end
sum = sum + partial_sum[scr_idx]; // sum the partial sums
end
end
If timing is still an issue, then you have must treat the logic as multi-cycle and sample the value some clock cycles after the input changed.
In RTL (the level of abstraction you are likely modelling with your HDL), you have to balance parallelism with time. By doing things in parallel, you save time (typically) but the logic takes up a lot of space. Conversely, you can make the adds completely serial (do one add at one time) and store the results in a register (it sounds like you want to accumulate the total sum, so I will explain that).
It sounds like the fully parallel is not practical for your uses (if it is and you want to rewrite it, look up generate statements). So, you'll need to create a small FSM and accumulate the sums into a register. Here's a basic example, which sums an array of 16-bit numbers (assume they are set somewhere else):
reg [15:0] arr[0:9]; // numbers
reg [31:0] result; // accumulated sum
reg load_result; // load signal for register containing result
reg clk, rst_L; // These are the clock and reset signals (reset asserted low)
/* This is a register for storing the result */
always #(posedge clk, negedge rst_L) begin
if (~rst_L) begin
result <= 32'd0;
end
else begin
if (load_result) begin
result <= next_result;
end
end
end
/* A counter for knowing which element of the array we are adding
reg [3:0] counter, next_counter;
reg load_counter;
always #(posedge clk, negedge rst_L) begin
if (~rst_L) begin
counter <= 4'd0;
end
else begin
if (load_counter) begin
counter <= counter + 4'd1;
end
end
end
/* Perform the addition */
assign next_result = result + arr[counter];
/* Define the state machine states and state variable */
localparam IDLE = 2'd0;
localparam ADDING = 2'd1;
localparam DONE = 2'd2;
reg [1:0] state, next_state;
/* A register for holding the current state */
always #(posedge clk, negedge rst_L) begin
if (~rst_L) begin
state <= IDLE;
end
else begin
state <= next_state;
end
end
/* The next state and output logic, this will control the addition */
always #(*) begin
/* Defaults */
next_state = IDLE;
load_result = 1'b0;
load_counter = 1'b0;
case (state)
IDLE: begin
next_state = ADDING; // Start adding now (right away)
end
ADDING: begin
load_result = 1'b1; // Load in the result
if (counter == 3'd9) begin // If we're on the last element, stop incrementing counter, we are done
load_counter = 1'b0;
next_state = DONE;
end
else begin // Otherwise, keep adding
load_counter = 1'b1;
next_state = ADDING;
end
end
DONE: begin // finished adding, result is in result!
next_state = DONE;
end
endcase
end
There are lots of resources on the web explaining FSMs if you are having trouble with the concept, but they can be used to implement your basic C-style for loop.

Resources