What value for length field for Freescale PowerPC Security Engine 2.0, when using Link Tables? - security

I am working on the code to use the security engine of my MPC83XX with Openssl.
I can already encrypt/decrypt AES up to 64KByte of data.
The problem comes with data greater than 64KByte since the maximum value of the length-bits is 65535.
I can assume the data is always in one piece on the Ram.
So now I am collecting all the data in a Link Table and use the pointer to the table instead of the pointer to the data and set the J bit to 1.
Now I am not sure what a value I should use for the length-bits since 0 would mean the Dword will be ignored.
The real length of the data is too also big for 16 bit.
http://cache.freescale.com/files/32bit/doc/app_note/AN2755.pdf?fpsp=1
Possible Informations can be found in Chapter 8.

You set LENGTH to the length of the data. See Page 19:
For any sequence of data parcels accessed by a link table or chain of link tables, the combined lengths of the parcels (the sum of their LENGTH and/or EXTENT fields) must equal the combined lengths of the link table memory segments (SEGLEN fields). Otherwise the channel sets the appropriate error bit in the Channel Pointer Status Register...
I'm not sure what mode you're using (and the documentation seems unnecessarily confusing!) but for the usual cipher modes (CBC/CTR/CFB/OFB) the the usual method is simply to chain AES invocations, reusing the same context. You might be able to do this by simply setting "Pointer Dword1" and "Pointer Dword5" to the same thing. There's very little documentation, though; I can't work out where it gets the IV from.

Related

AF_XDP: map `(SRC-IP, DST-IP, DST-Port)` to index to `BPF_MAP_TYPE_XSKMAP`

I want to spawn multiple user-space processes with each one processing packets from a single source (triple of (SRC-IP, DST-IP, DST-Port)).
Because there are going to pass a lot of packets through the AF-XDP kernel program and time is critical, I thought of a separate map in the kernel-program which is populated by a user-space program beforehand.
This map defines a mapping from the previously mentioned triple to an index which is then used in bpf_redirect_map(&xsks_map, index, 0) to send packets to a specific socket in user-space.
My initial idea was to just concatenate src-ip, destination-ip and destination port into a (32 + 32 + 16)-bit value.
Is it possible to define maps with such a large key-size? Which map would be the best fit for this problem? Furthermore, is it possible to fill the map from user-space?
A struct as a key
There are several types of maps that can be used with eBPF. Some of them are generic (hash maps, arrays, ...) and some are very specific (redirect maps, sockmaps, ...).
The case you describes sounds like a perfect use case for a hash maps. Such maps take a struct as a key, and another struct as a value. So you could have something like:
struct my_key {
uint32_t src_ip;
uint32_t dst_ip;
uint16_t dst_port;
};
... and use it as a key. The value, in your case, would be the index for the xskmap, i.e. a simple integer. Hash maps are efficient for retrieving a value from a given key (no linear search as for an array), so you get good performance with that.
Key size for hash maps
There are no specific restrictions for the size of the keys or the values, as long as the size holds on a 32-bit integer :) (Note that there may be size restrictions in the case of hardware offload).
Update from user space
It is perfectly doable to update a hash map from user space (some very specific map types may not allow it, though, but generic maps like are arrays or hash maps are entirely fine). You would do it:
From the command line, with bpftool,
From a C program, with the helpers from libbpf,
In your own language. In all three cases, the update itself is done through a call to the bpf() system call.

What do xfrm_replay_state_esn fields mean?

I'm trying to understand a little bit more about Linux kernel IPSec networking by looking at the kernel source. I understand conceptually that IPSec prevents replay attacks with a sequence number and a replay window, i.e. if a recipient receives a packet with a sequence number that is not within the replay window, or it has received before, then it drops that packet and increments the replay counter.
I'm trying to correlate this to the structure xfrm_replay_state_esn which is defined as such:
struct xfrm_replay_state_esn {
unsigned int bmp_len;
__u32 oseq;
__u32 seq;
__u32 oseq_hi;
__u32 seq_hi;
__u32 replay_window;
__u32 bmp[0];
};
I've tried searching for documentation, but it's scant and I haven't been able to find a man of the various functions and structures, so I don't understand what the individual fields relate to.
XFRM is an IPSec implementation for the Linux kernel. The name XFRM stands for "transform" referencing the transformation of IP packets as per the IPSec protocol.
The following RFCs are relevant for IPSec:
RFC4301: Definition of the IPSec protocol.
RFC4302: Definition of the Authentication Header (AH) sub-protocol for ensuring authenticity of IP packets.
RFC4303: Definition of the Encapsulating Security Payload (ESP) sub-protocol for ensuring authenticity and secrecy of IP packets.
The IPSec protocol allows for sequence numbers of size 32 bits or 64 bits. The 64 bit sequence numbers are referred to as Extended Sequence Numbers (ESN).
The anti-replay mechanism is defined in the RFCs for both AH and ESP. The mechanism keeps a window of acceptable sequence numbers of incoming packets. The window extends back from the highest sequence number received so far, defining a lower bound for the acceptable sequence numbers. When receiving a sequence number below that bound, it is rejected. When receiving a sequence number higher than the current highest sequence number, the window is shifted forward. When receiving a sequence number within the window, the mechanism will mark this sequence number in a checklist for ensuring that each sequence number in the window is only received once. If the sequence number has already been marked, it is rejected.
This checklist can be implemented as a bitmap, where each sequence number in the window is represented by a single bit, with 0 meaning this sequence number has not been received yet, and 1 meaning it has already been received.
Based on this information, the meaning of the fields in the xfrm_replay_state_esn struct can be given as follows.
The struct holds the state of the anti-replay mechanism with extended sequence numbers (64 bits):
The highest sequence number received so far is represented by seq and seq_hi. Each is a 32 bit integer, so together they can represent a 64 bit number, with seq holding the lower 32 bit and seq_hi holding the higher 32 bit. The reason for splitting the 64 bit value into two 32 bit values, instead of representing it as a single 64 bit variable, is that the IPSec protocol mandates an optimization where only the lower 32 bit of the sequence number are included in the package. For this reason, it is more convenient to have the lower 32 bits as a separate variable in the struct, so that it can be accessed directly without resorting to bit-operations.
The sequence number counter for outgoing packages is tracked in oseq and oseq_hi. As before, the 64 bit number is represented by two 32 bit variables.
The size of the window is represented by replay_window. The smallest acceptable sequence number if given by the sequence number expressed by seq and seq_hi minus replay_window plus one.
The bitmap for checking off received sequence numbers within the window is represented by bmp. It is defined as a zero-sized array, but when the memory for the struct is allocated, additional memory is reserved after the struct, which can then be accessed e.g. with bmp[i] (which is of course just syntactic sugar for *(bmp+i)). The size of the bitmap is held in bmp_len. It is of course related to the window size, i.e. window size divided by 8*sizeof(u32), rounded up. I would speculate that it is stored explicitly to avoid having to recalculate this value frequently.

Variable length messages in Verilog (serial CRC-32)

I'm working with a serial protocol. Messages are of variable length that is known in advance. On both transmission and reception sides, I have the message saved to a shift register that is as long as the longest possible message.
I need to calculate CRC32 of these registers, the same as for Ethernet, as fast as possible. Since messages are variable length (anything from 12 to 64 bits), I chose serial implementation that should run already in parallel with reception/transmission of the message.
I ran into a problem with organization of data before calculation. As specified here , the data needs to be bit-reversed, padded with 32 zeros and complemented before calculation.
Even if I forget the part about running in parallel with receiving or transmitting data, how can I effectively get only my relevant message from max-length register so that I can pad it before calculation? I know that ideas like
newregister[31:0] <= oldregister[X:0] // X is my variable length
don't work. It's also impossible to have the generate for loop clause that I use to bit-reverse the old vector run variable number of times. I could use a counter to serially move data to desired length, but I cannot afford to lose this much time.
Alternatively, is there an operation that would directly give me the padded and complemented result? I do not even have an idea how to start developing such an idea.
Thanks in advance for any insight.
You've misunderstood how to do a serial CRC; the Python question you quote isn't relevant. You only need a 32-bit shift register, with appropriate feedback taps. You'll get a million hits if you do a Google search for "serial crc" or "ethernet crc". There's at least one Xilinx app note that does the whole thing for you. You'll need to be careful to preload the 32-bit register with the correct value, and whether or not you invert the 32-bit data on completion.
EDIT
The first hit on 'xilinx serial crc' is xapp209, which has the basic answer in fig 1. On top of this, you need the taps, the preload value, whether or not to invert the answer, and the value to check against on reception. I'm sure they used to do all this in another app note, but I can't find it at the moment. The basic references are the Ethernet 802.3 spec (3.2.8 Frame check Sequence field, which was p27 in the original book), and the V42 spec (8.1.1.6.2 32-bit frame check sequence, page 311 in the old CCITT Blue Book). Both give the taps. V42 requires a preload to all 1's, invert of completion, and gives the test value on reception. Warren has a (new) chapter in Hacker's Delight, which shows the taps graphically; see his website.
You only need the online generators to check your solution. Be careful, though: they will generally have different preload values, and may or may not invert the result, and may or may not be bit-reversed.
Since X is a viarable, you will need to bit assignments with a for-loop. The for-loop needs to be inside an always block and the for-loop must static unroll (ie the starting index, ending index, and step value must be constants).
for(i=0; i<32; i=i+1) begin
if (i<X)
newregister[i] <= oldregister[i];
else
newregister[i] <= 1'b0; // pad zeros
end

How should one use Disruptor (Disruptor Pattern) to build real-world message systems?

As the RingBuffer up-front allocates objects of a given type, how can you use a single ring buffer to process messages of various different types?
You can't create new object instances to insert into the ringBuffer and that would defeat the purpose of up-front allocation.
So you could have 3 messages in an async messaging pattern:
NewOrderRequest
NewOrderCreated
NewOrderRejected
So my question is how are you meant to use the Disruptor pattern for real-world messageing systems?
Thanks
Links:
http://code.google.com/p/disruptor-net/wiki/CodeExamples
http://code.google.com/p/disruptor-net
http://code.google.com/p/disruptor
One approach (our most common pattern) is to store the message in its marshalled form, i.e. as a byte array. For incoming requests e.g. Fix messages, binary message, are quickly pulled of the network and placed in the ring buffer. The unmarshalling and dispatch of different types of messages are handled by EventProcessors (Consumers) on that ring buffer. For outbound requests, the message is serialised into the preallocated byte array that forms the entry in the ring buffer.
If you are using some fixed size byte array as the preallocated entry, some additional logic is required to handle overflow for larger messages. I.e. pick a reasonable default size and if it is exceeded allocate a temporary array that is bigger. Then discard it when the entry is reused or consumed (depending on your use case) reverting back to the original preallocated byte array.
If you have different consumers for different message types you could quickly identify if your consumer is interested in the specific message either by knowing an offset into the byte array that carries the type information or by passing a discriminator value through on the entry.
Also there is no rule against creating object instances and passing references (we do this in a couple of places too). You do lose the benefits of object preallocation, however one of the design goals of the disruptor was to allow the user the choice of the most appropriate form of storage.
There is a library called Javolution (http://javolution.org/) that let's you defined objects as structs with fixed-length fields like string[40] etc. that rely on byte-buffers internally instead of variable size objects... that allows the token ring to be initialized with fixed size objects and thus (hopefully) contiguous blocks of memory that allow the cache to work more efficiently.
We are using that for passing events / messages and use standard strings etc. for our business-logic.
Back to object pools.
The following is an hypothesis.
If you will have 3 types of messages (A,B,C), you can make 3 arrays of those pre-allocated. That will create 3 memory zones A, B, C.
It's not like there is only one cache line, there are many and they don't have to be contiguous. Some cache lines will refer to something in zone A, other B and other C.
So the ring buffer entry can have 1 reference to a common ancestor or interface of A & B & C.
The problem is to select the instance in the pools; the simplest is to have the same array length as the ring buffer length. This implies a lot of wasted pooled objects since only one of the 3 is ever used at any entry, ex: ring buffer entry 1234 might be using message B[1234] but A[1234] and C[1234] are not used and unusable by anyone.
You could also make a super-entry with all 3 A+B+C instance inlined and indicate the type with some byte or enum. Just as wasteful on memory size, but looks a bit worse because of the fatness of the entry. For example a reader only working on C messages will have less cache locality.
I hope I'm not too wrong with this hypothesis.

To pad or not to pad - creating a communication protocol

I am creating a protocol to have two applications talk over a TCP/IP stream and am figuring out how to design a header for my messages. Using the TCP header as an initial guide, I am wondering if I will need padding. I understand that when we're dealing with a cache, we want to make sure that data being stored fits in a row of cache so that when it is retrieved it is done so efficiently. However, I do not understand how it makes sense to pad a header considering that an application will parse a stream of bytes and store it how it sees fit.
For example: I want to send over a message header consisting of a 3 byte field followed by a 1 byte padding field for 32 bit alignment. Then I will send over the message data.
In this case, the receiver will just take 3 bytes from the stream and throw away the padding byte. And then start reading message data. As I see it, he will not be storing the 3 bytes and the message data the way he wants. The whole point of byte alignment is so that it will be retrieved in an efficient manner. But if the retriever doesn't care about the padding how will it be retrieved efficiently?
Without the padding, the retriever just takes the 3 header bytes from the stream and then takes the data bytes. Since the retriever stores these bytes however he wants, how does it matter whether or not the padding is done?
Maybe I'm missing the point of padding.
It's slightly hard to extract a question from this post, but with what I've said you guys can probably point out my misconceptions.
Please let me know what you guys think.
Thanks,
jbu
If word alignment of the message body is of some use, then by all means, pad the message to avoid other contortions. The padding will be of benefit if most of the message is processed as machine words with decent intensity.
If the message is a stream of bytes, for instance xml, then padding won't do you a whole heck of a lot of good.
As far as actually designing a wire protocol, you should probably consider using a plain text protocol with compression (including the header), which will probably use less bandwidth than any hand-designed binary protocol you could possibly invent.
I do not understand how it makes sense to pad a header considering that an application will parse a stream of bytes and store it how it sees fit.
If I'm a receiver, I might pass a buffer (i.e. an array of bytes) to the protocol driver (i.e. the TCP stack) and say, "give this back to me when there's data in it".
What I (the application) get back, then, is an array of bytes which contains the data. Using C-style tricks like "casting" and so on I can treat portions of this array as if it were words and double-words (not just bytes) ... provided that they're suitably aligned (which is where padding may be required).
Here's an example of a statement which reads a DWORD from an offset in a byte buffer:
DWORD getDword(const byte* buffer)
{
//we want the DWORD which starts at byte-offset 8
buffer += 8;
//dereference as if it were pointing to a DWORD
//(this would fail on some machines if the pointer
//weren't pointing to a DWORD-aligned boundary)
return *((DWORD*)buffer);
}
Here's the corresponding function in Intel assembly; note that it's a single opcode i.e. quite an efficient way to access the data, more efficient that reading and accumulating separate bytes:
mov eax,DWORD PTR [esi+8]
Oner reason to consider padding is if you plan to extend your protocol over time. Some of the padding can be intentionally set aside for future assignment.
Another reason to consider padding is to save a couple of bits on length fields. I.e. always a multiple of 4, or 8 saves 2 or 3 bits off the length field.
One other good reason that TCP has padding (which probably does not apply to you) is it allows dedicated network processing hardware to easily separate the data from the header. As the data always starts on a 32 bit boundary, it's easier to separate the header from the data when the packet gets routed.
If you have a 3 byte header and align it to 4 bytes, then designate the unused byte as 'reserved for future use' and require the bits to be zero (rejecting messages where they are not as malformed). That leaves you some extensibility. Or you might decide to use the byte as a version number - initially zero, and then incrementing it if (when) you make incompatible changes to the protocol. Don't let the value be 'undefined' and "don't care"; you'll never be able to use it if you start out that way.

Resources