Finding columns in a 2-D array in Verilog - verilog

I have a following code :
`timescale 1ns / 1ps
//////////////////////////////////////////////////////////////////////////////////
// Company:
// Engineer:
//
// Create Date: 04/07/2019 01:20:06 PM
// Design Name:
// Module Name: data_generator_v1
// Project Name:
// Target Devices:
// Tool Versions:
// Description:
//
// Dependencies:
//
// Revision:
// Revision 0.01 - File Created
// Additional Comments:
//
//////////////////////////////////////////////////////////////////////////////////
module data_generator_v1 #(
// Define parameters
parameter integer MAPPING_NUMBER = 196 // MAPPING NUMBER IS USED TO SET A SPECIFIC PROBABILITY (16 BIT SCALING --> MAX VALUE = 65535 --> MAPPING NUMBER = 65535 * 0.03 == 196)
)
(
input S_AXI_ACLK , // Input clock
input S_AXI_ARESETN, // RESET signal (active low )
input start_twister,
output reg [1022:0] rec_vector = 1023'd0,
output reg start_decoding = 1'b0 ,
output integer random_vector_bit_errors = 0
);
// Mersenne Twister signals ----------------------------------------------------------------------
wire [63:0] output_axis_tdata ;
wire output_axis_tvalid ;
wire output_axis_tready ;
wire busy ;
wire [63:0] seed_val ;
wire seed_start ;
//--------------------------------------------------------------------------------------------------
// Signals ----------------------------------------------------------------------------------------
wire [3:0] random_nibble ;
integer nibble_count = 256 ; // initialize to 256
reg [1023:0] random_vector = 1024'd0;
reg sample_random_vector = 1'b0;
reg [9:0] bit_errors = 10'd0 ;
// -------------------------------------------------------------------------------------------------
// Generate numbers with a specific probability
assign random_nibble[0] = (output_axis_tdata[15:0] < MAPPING_NUMBER) ? 1 : 0 ;
assign random_nibble[1] = (output_axis_tdata[31:16] < MAPPING_NUMBER) ? 1 : 0 ;
assign random_nibble[2] = (output_axis_tdata[47:32] < MAPPING_NUMBER) ? 1 : 0 ;
assign random_nibble[3] = (output_axis_tdata[63:48] < MAPPING_NUMBER) ? 1 : 0 ;
// Generate a random vector ------------------------------------------------------------------------
always#(posedge S_AXI_ACLK) begin
if(S_AXI_ARESETN == 1'b0 ) begin
random_vector <= 1024'd0 ;
sample_random_vector <= 1'b0 ;
nibble_count <= 256 ;
random_vector_bit_errors <= 0 ;
bit_errors <= 0 ;
end
else begin
if(output_axis_tvalid == 1'b1) begin
if(nibble_count == 0 ) begin
random_vector <= random_vector ;
sample_random_vector <= 1'b1 ;
nibble_count <= 256 ;
random_vector_bit_errors <= bit_errors ;
bit_errors <= 0 ;
end
else begin
nibble_count <= nibble_count - 1 ; // 256*4 == 1024 bit vector
sample_random_vector <= 1'b0 ;
random_vector <= (random_vector << 4) ^ random_nibble ;
random_vector_bit_errors <= random_vector_bit_errors ;
if(nibble_count == 256) begin
case(random_nibble[2:0])
3'b000 : bit_errors <= bit_errors ;
3'b001 : bit_errors <= bit_errors + 1 ;
3'b010 : bit_errors <= bit_errors + 1 ;
3'b011 : bit_errors <= bit_errors + 2 ;
3'b100 : bit_errors <= bit_errors + 1 ;
3'b101 : bit_errors <= bit_errors + 2 ;
3'b110 : bit_errors <= bit_errors + 2 ;
3'b111 : bit_errors <= bit_errors + 3 ;
endcase
end
else begin
case (random_nibble)
4'b0000 : bit_errors <= bit_errors ;
4'b0001 : bit_errors <= bit_errors + 1 ;
4'b0010 : bit_errors <= bit_errors + 1 ;
4'b0011 : bit_errors <= bit_errors + 2 ;
4'b0100 : bit_errors <= bit_errors + 1 ;
4'b0101 : bit_errors <= bit_errors + 2 ;
4'b0110 : bit_errors <= bit_errors + 2 ;
4'b0111 : bit_errors <= bit_errors + 1 ;
4'b1000 : bit_errors <= bit_errors + 1 ;
4'b1001 : bit_errors <= bit_errors + 2 ;
4'b1010 : bit_errors <= bit_errors + 2 ;
4'b1011 : bit_errors <= bit_errors + 3 ;
4'b1100 : bit_errors <= bit_errors + 2 ;
4'b1101 : bit_errors <= bit_errors + 3 ;
4'b1110 : bit_errors <= bit_errors + 3 ;
4'b1111 : bit_errors <= bit_errors + 4 ;
endcase
end
end
end
end
end
// Sample output for the next block
always#(posedge S_AXI_ACLK) begin
if(S_AXI_ARESETN == 1'b0) begin
rec_vector <= 1023'd0 ;
start_decoding <= 1'b0 ;
end
else begin
if(sample_random_vector) begin
rec_vector <= random_vector[1022:0] ;
start_decoding <= 1'b1 ;
end
else begin
rec_vector <= rec_vector ;
start_decoding <= 1'b0 ;
end
end
end
//---------------------------------------------------------------------------------------------------
// //-------------------------------------------------------------------------------------------------------------------------------------
// // STANDARD CLOCK AND RESET
// //output_axis_tdata contains valid data when output_axis_tvalid is asserted
// // output_axis_tready is input into the mersenne twister and we can use this to accept or stop the generation of new data streams
// // busy is asserted when the mersenne twister is performing some computations
// // seed val is not used . It will start will default seed
// // seed start --> not used
// Mersenne twister signal assignment
assign seed_val = 64'd0 ; // used for seeding purposes
assign seed_start = 1'b0 ; // We do not want to assign a new seed so we proceed with the default one
assign output_axis_tready = (S_AXI_ARESETN == 1'b0 || start_twister == 0 ) ? 1'b0 : 1'b1 ; // knob to turn the twister on and off
// MODULE INSTANTIATION
axis_mt19937_64 AMT19937(S_AXI_ACLK,S_AXI_ARESETN,output_axis_tdata,output_axis_tvalid,output_axis_tready,busy,seed_val,seed_start) ;
// //-------------------------------------------------------------------------------------------------------------------------------------
endmodule
The focus of this question is the variable :output reg [1022:0] rec_vector = 1023'd0
I am loading this vector using a Mersenne Twister random number generator. The mersenne twister provides a 64 bit number that is then mapped into a 4 bit number. 256 such 4 bit numbers are generated to fill up one row in the rec_vector variable.
Now, I need to select each row in this 2-d array and send it for decoding. This is simple. I can write something like rec_vector[row_index] to get a specific row.
After I row an operation on each one of the rows, I need to perform the same operation on the columns as well. How do I get the columns out of this 2-d array?
Please note that a simple approach like creating wires and assigning them like :
codeword_column[0] = {rec_vector[0][0], rec_vector[1][0] ....., rec_vector[1022][0]} does not work. If I do this , the utilization blows up since now I am doing an asynchronous read on the 2-d array and that 2-d array can no longer be inferred as block ram since block rams can only support synchronous reads.
I would really appreciate any inputs regarding this. Thanks for taking the time to read this

I'll give this as complete answer and not as a comment as a similar question popped-up a short while ago: Accessing a million bits
In fact what you are asking is "How can I access a 2d-array in row and in column mode".
This is only possible if you make the array completely out of registers.
As soon as you have a lot of bits, too many to store in registers, you have to fall back on memories. So how do you access rows and in columns in a memory?
And the answer is the very unsatisfactory: "You can't."
Unfortunately memories are implement in long rows of bits and the hardware allows you to select only one row at a time. To access columns you have to work your way through the addresses, reading one row and picking out the column(s) you want. Which means it costs one clock cycle to read one column element.
The fist way to speed things up is to use dual-ported memories. The memories on the FPGAs I know are all dual ported. Thus your can do two reads from different addresses at a time.
You can also speed up the access by storing two rows at a time. e.g. an array of 8x8 bytes can be stored as 16x4 and reading gives you access to two rows at a time and thus the firs two column elements. (But that has diminishing returns, you end up with one huge row of registers again.)
Combing this with dual-ported access gives you four columns per clock cycle.
Just as a last warning which is also mentioned in the above link: FPGAs have two types of memories:
Synchronous write and a-synchronous read for which they have to use LUT's.
Synchronous write and read for which they have can use the internal memory banks.
The latter have the largest amount of storage. Thus if you write your code to use the former you can quickly find yourself out of resources.

Related

Verilog expand each bit n times

I want to expand each bit n times.
For example,
// n = 2
5'b10101 -> 10'b1100110011
// n = 3
5'b10101 -> 15'b111000111000111
Is there any simple way (i.e., not using generate block) in Verilog or SystemVerilog?
EDIT 19.02.21
Actually, I'm doing 64bit mask to 512bit mask conversion, but it is different from {8{something}}. My current code is the following:
logic [63 : 0] x;
logic [511 : 0] y;
genvar i;
for (i = 0; i < 64; i = i + 1) begin
always_comb begin
y[(i + 1) * 8 - 1 : i * 8] = x[i] ? 8'hFF : 8'h00;
end
end
I just wonder there exists more "beautiful" way.
I think that your method is a good one. You cannot do it without some kind of a loop (unless you want to type all the iterations manually). There might be several variants for implementing it.
For example, using '+:' operator instead of an expression, which simplifies it a bit.
genvar i;
for (i = 0; i < 64; i = i + 1) begin
always_comb begin
y[i * 8 +: 8] = x[i] ? 8'hFF : 8'h00;
end
end
Thew above method actually generated 64 always blocks (as in your original one). Though sensitivity list of every block will be just a single bit from 'x'.
You can move the for loop inside an always block:
always #* begin
for (int j = 0; j < 64; j++) begin
y3[j * 8 +: 8] = x[j] ? 8'hFF : 8'h00;
end
end
this will end up as a single always block, but sensitivity list will include all bits of 'x'.
If this operation is used multiple times, you can use a function :
function logic [511 : 0] transform(input logic [63 : 0] x);
for (int j = 0; j < 64; j++) begin
transform[j * 8 +: 8] = x[j] ? 8'hFF : 8'h00;
end
endfunction
...
always #* begin
y = transform(x);
end
If n is a parameter you can do:
always_comb begin
y = '0;
for(int idx=0; idx<($bits(y)/n) && idx<$bits(x); idx++) begin
y[idx*n +: n] = {n{x[idx]}};
end
end
If n is a signal you have to assign each bit:
always_comb begin
y = '0;
foreach(y[idx]) begin
y[idx] = x[ idx/n ];
end
end
A variable divisor will add timing and area overhead. Depending on your design target, it may or may not be an issue (synthesis optimization or simulation only).
My answer might not be the best of the answers, but if I were you, I would do something as below (assuming x and y are registers in your module that will be used in a synchronous design):
// your module name and ports
reg [63:0] x;
reg [511:0] y;
// your initializations
always#(posedge clk) begin
y[0+:8] <= x[0] ? 8'hff : 8'h00;
y[8+:8] <= x[1] ? 8'hff : 8'h00;
y[16+:8] <= x[2] ? 8'hff : 8'h00;
y[24+:8] <= x[3] ? 8'hff : 8'h00;
y[32+:8] <= x[4] ? 8'hff : 8'h00;
*
*
*
y[504+:8] <= x[63] ? 8'hff : 8'h00;
end
For different always conditions:
// your module name and ports
reg [63:0] x;
reg [511:0] y;
// your initializations
always#('some sensitivity conditions') begin
y[0+:8] <= x[0] ? 8'hff : 8'h00;
y[8+:8] <= x[1] ? 8'hff : 8'h00;
y[16+:8] <= x[2] ? 8'hff : 8'h00;
y[24+:8] <= x[3] ? 8'hff : 8'h00;
y[32+:8] <= x[4] ? 8'hff : 8'h00;
*
*
*
y[504+:8] <= x[63] ? 8'hff : 8'h00;
end
However, if I wanted a separate module that inputs x and outputs y, I would do something as below:
module mask_conversion(
input [63:0] x;
output [511:0] y;
);
assign y[0+:8] = x[0] ? 8'hff : 8'h00;
assign y[8+:8] = x[1] ? 8'hff : 8'h00;
assign y[16+:8] = x[2] ? 8'hff : 8'h00;
assign y[24+:8] = x[3] ? 8'hff : 8'h00;
assign y[32+:8] = x[4] ? 8'hff : 8'h00;
*
*
*
assign y[504+:8] = x[63] ? 8'hff : 8'h00;
endmodule
It is not that difficult to type all these, you just need to copy and paste, and change numbers manually. As a result you will get guaranteed code that does what you want.

verilog output stuck on last if statement

Problem: I'm synthesizing my code, which reads 1200 16 bit binary vectors, analyzes them and sets a 2 bit register named classe depending on the behavior of 4 if statements. The problem seems to be that classe is stuck on the last if statement - where classe is set to bit 11, or 3.
My code worked fine in when I was using a testbench.
I'm thinking it is stuck because somehow the always block is reading all 1200 vectors at once, as seen in the simulation, instead of one every clock edge?
I've attached a simulation screenshot here: https://imgur.com/a/No2E9cq
module final_final_code
(
output reg [ 0:1] classe
);
reg [0:15] memory [0:1199];
reg[0:15] vect:
integer i;
//// Internal Oscillator
defparam OSCH_inst.NOM_FREQ = "2.08";
OSCH OSCH_inst
(
.STDBY(1'b0), // 0=Enabled, 1=Disabled also Disabled with Bandgap=OFF
.OSC(osc_clk),
.SEDSTDBY() // this signal is not required if not using SED
);
initial begin
$readmemb("C:/Users/KP/Desktop/data.txt", memory, 0, 1199);
i = 0;
end
always #(posedge osc_clk) begin
vect = memory[i];
if ((memory[i][3] == 1'b0)) begin
classe = 2'b10;
end
if ((memory[i][11] == 1'b0)) begin
classe = 2'b01;
end
if ((memory[i][8] == 1'b1 && memory[i][4] + memory[i][5] + memory[i][6] + memory[i][7] >= 4'b0100)) begin
classe = 2'b00;
end
if ((memory[i][0] + memory[i][1] + memory[i][2] + memory[i][3] + memory[i][4] + memory[i][5] + memory[i][6] + memory[i][7] + memory[i][8] + memory[i][9] + memory[i][10] + memory[i][11] + memory[i][12] + memory[i][13] + memory[i][14] + memory[i][15] <= 1'b1)) begin
classe = 2'b11;
end
i = i + 1'd1;
if (i == 4'd1199) begin
i = 0;
end
end
endmodule
Apart from what john_log says:
Your last if statement is always TRUE. You are adding 1-bit operands and comparing against a 1-bit result thus the results is 1'b1 or 1'b0 which is always <= 1'b1.
You should check if your FPGA tool supports this:
initial begin
$readmemb("C:/Users/KP/Desktop/data.txt", memory, 0, 1199);
i = 0;
end
Especially the loading of a memory from a file by the synthesis tool. It was not possible the last time I used an FPGA.

How to initialize a wire with constant in verilog ?

In the below mentioned verilog code for J-K Flip Flop , i want to initialize wire type q and q_bar with some value. For eg : I am initializing here q and q_bar with 0. But in the output, q and q_bar has don't care (1'hx) value . So how to initialize wire type with constant ?
module JK_FF(j,k,clk,q,q_bar) ;
input j,k,clk ;
output q , q_bar ;
wire s,r,w,z ;
assign w = q ;
assign z = q_bar ;
nand U1(s,j,clk,z) ;
nand U2(r,k,clk,w) ;
nand U3(q,s,z) ;
nand U4(q_bar,r,w) ;
endmodule
/* TEST BENCH */
module JK_FF_TB ;
reg j,k,clk ;
wire q , q_bar ;
assign q = 1'b0 ;
assign q_bar = 1'b0 ;
initial begin
clk = 1'b1 ;
end
JK_FF DUT(j,k,clk,q,q_bar) ;
initial
begin
j = 1'b0 ;
k = 1'b0 ;
#5
j = 1'b0 ;
k = 1'b1 ;
#5
j = 1'b1 ;
k = 1'b0 ;
#5
j = 1'b1 ;
k = 1'b1 ;
end
endmodule
There are several issues to address.
State in Verilog, like flip-flop value, is usually kept in reg type, where the value can be initialized using initial. However, in the simple flip-flop made of gates there are only wires, which can't be initialized.
The design with the crossed NAND gates will in an hardware implementation lead to a stable value at start up, even when the wires are initially undefined 1'bX. You can emulate this in the circuit using conversion from 1'X to 1'b0 or 1'b1 at q and q_bar using assign as:
assign w = q !== 1'b0; // 1'bX => 1
assign z = q_bar === 1'b1; // 1'bX => 0
The Verilog implementation will however give a race condition, since the clock pulse will always be too long for the immediate change that occur if this design is simulated. This is typically shown as an infinite iteration during simulation, thereby reaching iteration limits with resulting error.
So more modifications are required, and you can find a great tutorial here: The JK Flip Flop

calculation of simulation time in verilog

I want to calculate the simulation time of a calculation of one prime number, which is the number of clock cycle to calculate one prime number. As we know, the calculation of a large prime number takes more clock cycles than a small prime number.
I used $time in Verilog whenever a prime is calculated and captured it in a time_s register. I calculated the difference of calculation after another prime number. Here is my code where you can see time_s1 captured the time when a prime is calculated. time_s2 is the time to calculate the difference.
module prime_number_count(
input clk
);
//for count 1
parameter N =100; // size of array
parameter N_bits = 32;
reg [N_bits-1:0] prime_number[0:N-1]; // memory array for prime_number
reg [N_bits-1:0] prime_aftr50 [0:49]; // memory array to get
integer k; // counter variable
integer k1; // counter variable
integer count;
integer test;
integer time_s1;
integer time_s2;
integer check; //Counts 1 to k
localparam S_INC = 2'b01;
localparam S_CHECK = 2'b10;
reg [1:0] state;
initial begin
prime_number[0] = 'd1;
prime_number[1] = 'd2;
//prime_aftr50[0] = 'd0;
state = S_CHECK; //Check set count first
count = 'd3;
k = 'd2; //0,1 preloaded
check = 'd1;
test = 'd1;
time_s1 = 'd0;
time_s2 = 'd0;
k1 = 'd0;
end
always #(posedge clk )
begin
$display ("time of clock %d ", $time );
if(state == S_INC)
begin // if state is 1
//$display("State: Incrementing Number to check %d", count+1);
count <= count+1 ;
state <= S_CHECK ; // chang the state to 2
check <= 'd1; // Do not check against [0] value 1
test <= 'd1; // Safe default
end
else if (state == S_CHECK) begin
if (test == 0) begin
// Failed Prime test (exact divisor found)
$display("Reject %3d", count);
state <= S_INC ;
end
else
if (time_s2>30000)begin
prime_number[k]=prime_number[k-1];
time_s1 <=$realtime ;
state <= S_INC ;
k <= k + 1;
$display("Found %1d th Prime_1 %1d", k, count);
$display("display of simulation time" , time_s2);
end // end of simulation time
else
if (check == k) begin
//Passed Prime check
time_s1 <=$time ;
prime_number[k] <= count;
k <= k + 1;
state <= S_INC ;
$display("Found %1d th Prime_1 %1d", k, count);
$display("display of simulation time" , time_s2);
end
else begin
//$display("Check");
test <= count % prime_number[check] ;
check <= check + 1;
//$display("Checking %1d against %1d prime %1d : %1d", count, check, prime_number[check], count % prime_number[check]);
end
end
end
//////////////////////////////////////////////////////////////////
always #(posedge clk )
begin
if(check==k-1)
begin
time_s2 <=$realtime-time_s1;
// $display("display of simulation time" , time_s2) ;
end
end
always # (posedge clk) begin
if ( k==51+(50*k1)) begin
prime_aftr50[k1] <= count;
k1 <= k1+1;
end
end
endmodule
Background on time
Semantically I would recommend using time over integer, behind the scenes they are the same thing. But as it is only an integer it is limited to the accuracy of the timescale time_unit*. Therefore I would suggest you actually use realtime which is a real behind the scenes.
For displaying time %t can be used instead of %d decimal of %f for reals. The formatting of this can be controlled through $timeformat.
realtime capture = 0.0;
//To change the way (below) is displayed
initial begin
#80.1ns;
capture = $realtime;
$display("%t", capture);
end
To control how %t is displayed :
//$timeformat(unit#, prec#, "unit", minwidth);
$timeformat(-3, 2, " ms", 10); // -3 and " ms" give useful display msg
unit is the base that time is to be displayed in, from 0 to -15
precision is the number of decimal points to display.
"unit" is a string appended to the time, such as " ns".
minwidth is the minimum number of characters that will be displayed.
unit: recommended "unit" text
0 = 1 sec
-1 = 100 ms
-2 = 10 ms
-3 = 1 ms
-4 = 100 us
-5 = 10 us
-6 = 1 us
-7 = 100 ns
-8 = 10 ns
-9 = 1 ns
-10 = 100 ps
-11 = 10 ps
-12 = 1 ps
-13 = 100 fs
-14 = 10 fs
-15 = 1 fs
With these changes: realtime types, $realtime captures and displaying with %t analysing simulation time becomes a little easier.
Solution
Now to calculate the time between finding primes:
Add to your the following to intial begin:
$timeformat(-9, 2, " ns", 10);
Then in the state which adds the prime to the list you just need to add the following:
//Passed Prime check
time_s2 = time_s1; //Last Prime
time_s1 = $realtime ;
$display("Found %1d th Prime_1 %1d", k, count);
$display("Found at time : %t", time_s1);
$display("Time Diff : %t", time_s1 - time_s2);
Working example on EDA Playground.
timescale
*: time scales for verilog simulations are set by, the time_unit sets the decimal point so any further accuracy from the precision is lost when using time or integer to record timestamps.
`timescale <time_unit>/ <time_precision>
See section 22.7 of IEEE 1800-1012 for more info.

Implementing CRC32 module with verilog for FPGA

I'm sort of new to FPGA. I'm having a project on this field this summer which is implementing Ethernet switch with 4ports. I've coded all the parts to check preamble and MAC address and etc and they're working correctly
but I have serious problem with implementing CRC32.
I know the algorithm of CRC32 from IEEE 802.3
then, created a frame with 18 Bytes of data
then generated the CRC of my frame with this applet ( here's a link!
but with any frame I make, the result of checking CRC for that particular frame is wrong ( means with my module, every frame has error )
I'd be more than happy to know your opinion
Here is my code of CRC32 module :
module CRC( clk10x, clk, rst, SFD, length, lengthReady, dataIn, hasError//, MACready
);
.
.
// input and outputs and registers are here
.
.
.
initial
begin
CRC <= 32'h04C11DB7;
zeros <= 32'h00000000;
end
always # ( posedge clk10x )
begin
if ( rst )
begin
counter32bit <= 0;
shiftFlag <= 1;
shift <= 0;
shift2 <= 0;
first32bit <= 0;
state <= 0;
index <= 0;
calcEnd <= 0;
end
else if ( clk )
begin
if ( SFD )
begin
case ( state )
'b00 : begin
first32bit <= ( counter32bit == 32 ) ? 1 : 0;
state <= ( first32bit ) ? 'b01 : 'b00;
{MSB, window} <= {window, ~dataIn}; // shift Register;
counter32bit <= counter32bit + 1;
end
'b01 : begin
{MSB, window} <= ( MSB ) ? ( {window, dataIn} ^ CRC ) : {window, dataIn};
shift <= ( lengthReady && shiftFlag ) ? ( length * 8 ) : shift - 1;
shiftFlag <= ( lengthReady ) ? 0 : shiftFlag;
shift2 <= ( shift == 0 && lengthReady ) ? 32 : shift2 -1;
//shift2 <= ( !shift2 ) ? shift2 - 1 : shift2;
state <= ( shift2 == 2 && lengthReady ) ? 'b10 : 'b01;
end
'b10 : begin
{MSB, window} <= ( MSB && !calcEnd ) ? ( {window, zeros[index]} ^ CRC ) : {window, zeros[index]};
index <= ( index == 32 && !calcEnd ) ? 40 : index + 1;
calcEnd <= ( index == 40 ) ? 1 : 0;
state <= ( calcEnd ) ? 'b11 : state;
end
'b11 : begin
window <= window ^ 32'b11111111_11111111_11111111_11111111;
hasError <= ( window == 0 ) ? 0 : 1;
end
default : begin
//state <= 0;
first32bit <= 0;
//shift <= 0;
end
endcase
// have to assign index 0 again
end
CRC calculations are realized on a per bit basis. so every input data word - lets say one byte per clock cycle # 125 MHz for gigabit Ethernet - results in 8 CRC calculations per clock cycle. So your code needs an extra loop to do this 8 sub-cycle calculations.
I would also advice to split up your fsm into a control state machine and crc calculation (data path).
As Mark Adler noticed, the initial value of the CRC's internal LFSR must be initialized with 0xFFFFFFFF. I can see this in your code.
Why do you use 2 different clocks in your process?
Edit 1:
I'm not so good in coding verilog, so I'll copy some VHDL code from our VHDL library. I think you will be able to translate the statements into corresponding verilog code.
I spared the separate register process with reset and clock enable :)
-- Compute next combinational Value
process(lfsr, din)
variable v : std_logic_vector(lfsr'range);
begin
v := lfsr;
for i in BITS-1 downto 0 loop
v := (v(v'left-1 downto 0) & '0') xor
(GN and (GN'range => (din(i) xor v(v'left))));
end loop;
lfsn <= v;
end process;
BITS is a generic and set to 32
lfsr (linear feedback shift register) is 32 bit wide and stores the current "checksum"
the temp. variable v is initialized by the current register value (lfsr)
the for loop goes over every bit of din (data in) and performs the crc calculation (shift + xor)
=> so 32 CRC calculations are performed per clock cycle
GN is the normalized generator polynomial of CRC32
the result is stored in lfsn (next lfsr value) which is connected to a 32 bit wide D-FF with reset and clock enable

Resources