I'm working with a serial protocol. Messages are of variable length that is known in advance. On both transmission and reception sides, I have the message saved to a shift register that is as long as the longest possible message.
I need to calculate CRC32 of these registers, the same as for Ethernet, as fast as possible. Since messages are variable length (anything from 12 to 64 bits), I chose serial implementation that should run already in parallel with reception/transmission of the message.
I ran into a problem with organization of data before calculation. As specified here , the data needs to be bit-reversed, padded with 32 zeros and complemented before calculation.
Even if I forget the part about running in parallel with receiving or transmitting data, how can I effectively get only my relevant message from max-length register so that I can pad it before calculation? I know that ideas like
newregister[31:0] <= oldregister[X:0] // X is my variable length
don't work. It's also impossible to have the generate for loop clause that I use to bit-reverse the old vector run variable number of times. I could use a counter to serially move data to desired length, but I cannot afford to lose this much time.
Alternatively, is there an operation that would directly give me the padded and complemented result? I do not even have an idea how to start developing such an idea.
Thanks in advance for any insight.
You've misunderstood how to do a serial CRC; the Python question you quote isn't relevant. You only need a 32-bit shift register, with appropriate feedback taps. You'll get a million hits if you do a Google search for "serial crc" or "ethernet crc". There's at least one Xilinx app note that does the whole thing for you. You'll need to be careful to preload the 32-bit register with the correct value, and whether or not you invert the 32-bit data on completion.
EDIT
The first hit on 'xilinx serial crc' is xapp209, which has the basic answer in fig 1. On top of this, you need the taps, the preload value, whether or not to invert the answer, and the value to check against on reception. I'm sure they used to do all this in another app note, but I can't find it at the moment. The basic references are the Ethernet 802.3 spec (3.2.8 Frame check Sequence field, which was p27 in the original book), and the V42 spec (8.1.1.6.2 32-bit frame check sequence, page 311 in the old CCITT Blue Book). Both give the taps. V42 requires a preload to all 1's, invert of completion, and gives the test value on reception. Warren has a (new) chapter in Hacker's Delight, which shows the taps graphically; see his website.
You only need the online generators to check your solution. Be careful, though: they will generally have different preload values, and may or may not invert the result, and may or may not be bit-reversed.
Since X is a viarable, you will need to bit assignments with a for-loop. The for-loop needs to be inside an always block and the for-loop must static unroll (ie the starting index, ending index, and step value must be constants).
for(i=0; i<32; i=i+1) begin
if (i<X)
newregister[i] <= oldregister[i];
else
newregister[i] <= 1'b0; // pad zeros
end
Related
Good day!
Please, help me understand how the Delay block works in AnyLogic. Suppose we deal with a multichannel transmission network.
The model has 2 sources. Suppose these sources generate packets every 1 sec. Packets from different sources have different priorities and need different quantities of resources to be served (it is set up with Priority and Resource_quantity parameters respectively). The Priority_queue in the model is priority-based. The proposed model put the packets into the channels in accordance with Resource availability in the channel. Firstly, it tries to put the packet to the first channel. If there are no available resources it puts the packet into the second channel. If there are no resources in both channels it waits (it is realized with Hold block).
I noticed that if I set delays in blocks delay1 and delay2 with static parameters (for ex. 2 sec) the model works ok. But then I try to calculate it before these blocks the model doesn't take into consideration it at all. And in this case, the model works without any delays.
What did I do wrong here?
I will appreciate any help.
The delay is calculated in Exit block and is written into the variable delay of the agent. I tried to add traceln(agent.delay) as #Jaco-Ben suggested right after calculation of the delay and it showed zero. In this case it doesn't also seize resources :(
Thank #Jaco-Ben for the useful comments.
The delay is zero because
the result of division in Java depends on the types of the operands.
If both operands are integer, the result will be integer as well. To
let Java perform real division (and get a real number as a result) at
least one of the operands must be of real type.
So it was my problem.
To solve it I assigned double to one of the operands :
agent.delay = (double)agent.Resource_quantity/ChannelResources1.idle();
However, it is strange why it shows correct values in the database.
I have 2 different clocks, one for reading and one for writing. I am using gray-code to synchronize the pointers with an additional 2 flip-flops for synchronization on the differnt clock of the input signal.
The articles that I have read indicate how to determine the full and empty signal using gray code by comparing the 2MSB for full state and equality for empty state.
However, I need to get the number of elements in the buffer and not just the full or empty signals. Is this possible to do with gray code?
In a comment you ask about the common clock and mentioned that your depth is not a power of two.
First : Edit your original post and add that question and the information.
Second: In an a-synchronous FIFO there is no common clock. The write operations are all run from the write clock. The read operations are all run from the read clock. The critical part is to exchange information between the clock domains. That is where the gray code comes in.
Third: An a-synchronous FIFO uses gray code because only one bit changes at a time. Important there is that the process is circular. Thus the difference between your last and your first value also only differs by one bit:
Counter Gray-code
000 000
001 001
010 011
011 010
100 110
101 111
110 101
111 100 <-- Last
000 000 <-- First again
This works if and only if the depth (and thus the counters) are a power of two. Therefore an a-synchronous FIFO always has a depth which is a power of two.
If you must have a different depth you can add a synchronous FIFO to the beginning or the end. However if you think about it: a FIFO is just an elastic buffer. The behavior if it is e.g. 16 entries deep or 12 entries is not different, other then that you have the potential to store more values.
Last: As supercat said: You convert from binary to Gray code, cross to the other clock domain, then convert Gray code to binary again.
In the end clock domain you can safely compare read and write counters to determine the fill-level of the FIFO.
If the level is needed on both read and write side you have to implement this process twice, once in each clock domain.
The most understandable way to compute the difference between two gray-code values is to synchronize them with a common clock, convert them to binary, and then do an ordinary binary subtraction on them. While it may be possible to design a fully-combinatorial circuit that would compute the difference between two gray-code values in such a way that if all bits of one particular value are stable, and one bit in the other value changes, only one bit in the output would change and all others would remain stable, such a design would be much more complicated than one which simply synchronizes both counters, converts to binary, and subtracts.
I've been trying to read the implementation of a kernel module, and I'm stumbling on this piece of code.
unsigned long addr = (unsigned long) buf;
if (!IS_ALIGNED(addr, 1 << 9)) {
DMCRIT("#%s in %s is not sector-aligned. I/O buffer must be sector-aligned.", name, caller);
BUG();
}
The IS_ALIGNED macro is defined in the kernel source as follows:
#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
I understand that data has to be aligned along the size of a datatype to work, but I still don't understand what the code does.
It left-shifts 1 by 9, then subtracts by 1, which gives 111111111. Then 111111111 does bitwise-and with x.
Why does this code work? How is this checking for byte alignment?
In systems programming it is common to need a memory address to be aligned to a certain number of bytes -- that is, several lowest-order bits are zero.
Basically, !IS_ALIGNED(addr, 1 << 9) checks whether addr is on a 512-byte (2^9) boundary (the last 9 bits are zero). This is a common requirement when erasing flash locations because flash memory is split into large blocks which must be erased or written as a single unit.
Another application of this I ran into. I was working with a certain DMA controller which has a modulo feature. Basically, that means you can allow it to change only the last several bits of an address (destination address in this case). This is useful for protecting memory from mistakes in the way you use a DMA controller. Problem it, I initially forgot to tell the compiler to align the DMA destination buffer to the modulo value. This caused some incredibly interesting bugs (random variables that have nothing to do with the thing using the DMA controller being overwritten... sometimes).
As far as "how does the macro code work?", if you subtract 1 from a number that ends with all zeroes, you will get a number that ends with all ones. For example, 0b00010000 - 0b1 = 0b00001111. This is a way of creating a binary mask from the integer number of required-alignment bytes. This mask has ones only in the bits we are interested in checking for zero-value. After we AND the address with the mask containing ones in the lowest-order bits we get a 0 if any only if the lowest 9 (in this case) bits are zero.
"Why does it need to be aligned?": This comes down to the internal makeup of flash memory. Erasing and writing flash is a much less straightforward process then reading it, and typically it requires higher-than-logic-level voltages to be supplied to the memory cells. The circuitry required to make write and erase operations possible with a one-byte granularity would waste a great deal of silicon real estate only to be used rarely. Basically, designing a flash chip is a statistics and tradeoff game (like anything else in engineering) and the statistics work out such that writing and erasing in groups gives the best bang for the buck.
At no extra charge, I will tell you that you will be seeing a lot of this type of this type of thing if you are reading driver and kernel code. It may be helpful to familiarize yourself with the contents of this article (or at least keep it around as a reference): https://graphics.stanford.edu/~seander/bithacks.html
Thank you for reading this and for all of your help. Anyway...I am trying to implement a crc16 with polynomial x^16 + x^12 + x^5 + 1 in verilog. The problem I have encountered is that I don't get the entire packet of data at one point in time. I get a 32 bit word at a time and the number of words is dynamic but is at least 4 words and can be as high as 16384 words or higher. The time is not much of an issue because I am running on a 150 MHz clk and the input is coming in at most a 33 MHz clk but may be a 10 MHz. This does not really affect me because I am first accepting the data via a FIFO.
I have been trying to develop an FSM but have really hit a roadblock. One idea is for me to wait for all the data and then just input the entire thing as one big data packet; however, this seems really inefficient and I just don't think I need to do this. Plus it could take up valuable resources. Another way I was playing with was to input the first word and do the XOR operation. Then when the input data only has 1 to 2 bits left that are not xored (not sure if that is worded correctly) I would input the next word. Upon the input I would continue to compute the CRC followed by another input until the last word is imputed into the module.
With this method I would need to implement a counter or a shift register in some fashion. Anyway, any help would be nice. This goes into a command parser/packet parser. Thank you so much for your help.
A CRC calculation doesn't need to be done serially 1-bit at a time. You can essentially "unroll" the calculation to come up with the individual equations for each bit of a parallel CRC generator. With that, you can create a CRC generator that processes 32-bits of input data at a time, matching your datapath width. This should simplify your design as well as make it higher performance (processing each bit serially wouldn't meet your throughput requirements anyway, unless you don't mind holding off incoming data while the hw generates the CRC).
I'm fairly new to hardware design and I'm not sure how to approach this problem. I'm working with a 64 bit wide stream that also has End Of Packet and Start of Packet signals. I need to find a particular byte sequence at an offset from SOP. The goal is to pass the stream to another module, and every time SOP is asserted, a match signal will tell the next module whether or not the byte sequence will be found in the incoming packet.
I think I need to shift the signal into a large shift register (16x64 to fit the search space) and do the comparison on those slices. But then it seems I would also need shift registers for SOP and EOP to keep those signals in sync with the data (match would be asserted along with SOP). Am I on the right track, or is there a better approach?
In that case I think you're onto the right idea. If the downstream module must know if the match exists before receiving the SOP, then I would just make a 16 or 17 stage pipeline of all the data and the two control signals.
If that's too many registers for some kind of area constraint, you might consider using a small ram to hold the packets while you're waiting to do the check.