I am trying to wrap my head around how the sender identifies the endianness of the sender. I know the initial byte is usually the architecture/type of the sender. For example 0x00 is i386 etc. However, how does the first byte help at all if the receiver has no idea how to interpret it?
Endianness refers to the ordering of bytes into larger numbers, not the order of bits inside a byte. A single byte is always endian-safe; networks transfer byte streams transparently (that is, bytes are received in the same order in which they were sent).
Related
I need to transfer data over a serial port. In order to ensure integrity of the data, I want a small envelope protocol around each protobuf message. I thought about the following:
message type (1 byte)
message size (2 bytes)
protobuf message (N bytes)
(checksum; optional)
The message type will mostly be a mapping between messages defined in proto files. However, if a message gets corrupted or some bytes are lost, the message size will not be correct and all subsequent bytes cannot be interpreted anymore. One way to solve this would be the introduction of limiters between messages, but for that I need to choose something that is not used by protobuf. Is there a byte sequence that is never used by any protobuf message?
I also thought about a different way. If the master finds out that packages are corrupted, it should reset the communication to a clean start. For that I want the master to send a RESTART command to the slave. The slave should answer with an ACK and then start sending complete messages again. All bytes received between RESTART and ACK are to be discarded by the master. I want to encode ACK and RESTART as special messages. But with that approach I face the same problem: I need to find byte sequences for ACK and RESTART that are not used by any protobuf messages.
Maybe I am also taking the wrong approach - feel free to suggest other approaches to deal with lost bytes.
Is there a byte sequence that is never used by any protobuf message?
No; it is a binary serializer and can contain arbitrary binary payloads (especially in the bytes type). You cannot use sentinel values. Length prefix is fine (your "message size" header), and a checksum may be a pragmatic option. Alternatively, you could impose an artificial sentinel to follow each message (maybe a guid chosen per-connection as part of the initial handshake), and use that to double-check that everything looks correct.
One way to help recover packet synchronization after a rare problem is to use synchronization words in the beginning of the message, and use the checksum to check for valid messages.
This means that you put a constant value, e.g. 0x12345678, before your message type field. Then if a message fails checksum check, you can recover by finding the next 0x12345678 in your data.
Even though that value could sometimes occur in the middle of the message, it doesn't matter much. The checksum check will very probably catch that there isn't a real message at that position, and you can search forwards until you find the next marker.
I'm a bit confuse about the bitfield message in bittorrent. I have noted the confusion in form of question below.
Optional vs Required
Bitfield to be sent immediately after the handshaking sequence is
completed
I'm assuming this is compulsory i.e after handshake there must follow a bitfield message. Correct?
When to expect bitfield?
The bitfield message may only be sent immediately after the
handshaking sequence is completed, and before any other messages are
sent
assuming I read this clear although be optional message. peer can still broadcast the bitfield message prior to any message (like request, choke, uncoke etc). correct ?
The high bit in the first byte corresponds to piece index 0
If I'm correct bitfield represent the state i.e whether or not the peer has a given piece with it.
Assuming that my bitfield is [1,1,1,1,1,1,1,1,1,1 ..]. I establish the fact that the peer has 10th piece missing and if the bitfield look like this [1,1,0,1,1,1,1,1,1,1 ..] the peer has a 3rd piece missing. Then what is the high bit in the first byte corresponds to piece index 0 means.
Spare bits
Spare bits at the end are set to zero
What does this mean ? I mean if have a bit at end as 0 does it not means that peers has that as missing piece. why is the spare bit used.
The most important of all what is the purpose of the bitfield.
My hunch on this is that bitfield make it easier to find the right peer for a piece knowing available with the peer but am i correct on this?
#Encombe
here how my bitfield payload looks like
\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFE
I'm assuming this is compulsory i.e after handshake there must follow a bitfield message. Correct?
No, the bitfield message is optional, but if a client sends it, it MUST be the first message after the handshake.
Also, both peers must have sent their complete handshakes, (ie the handshaking sequence is completed), before anyone of them starts to send any type of regular messages including the bitfield message.
assuming I read this clear although be optional message. peer can still broadcast the bitfield message prior to any message (like request, choke, uncoke etc). correct ?
Yes, see above. If a client sends a bitfield message anywhere else the connection must be closed.
Assuming that my bitfield is [1,1,1,1,1,1,1,1,1,1 ..]. I establish the fact that the peer has 10th piece missing
No. It's unclear to me if your numbers is bits (0b1111111111) or bytes (0x01010101010101010101).
If it's bits (0b11111111): It means have pieces 0 to 9
If it's bytes (0x01010101010101010101): It means have pieces 7, 15, 23, 31, 39, 47, 55, 63, 71 and 79
if the bitfield look like this [1,1,0,1,1,1,1,1,1,1 ..] the peer has a 3rd piece missing.
No, pieces are zero indexed. 0b1101111111: means piece 2 is missing.
Then what is the high bit in the first byte corresponds to piece index 0 means.
It means that the piece with index 0 is represented by the leftmost bit. (Most significant bit in bigendian.)
. eight bits = one byte
. 0b10000000 = 0x80
. ^ high bit set meaning that the client have piece 0
. 0b00000001 = 0x01
. ^ low bit set meaning that the client have piece 7
why is the spare bit used
If the number of pieces in the torrent is not evenly divisible by eight; there will be bits over, that don't represent any pieces, in the last byte of the bitfield. Those bits must be set to zero.
The size of the bitfield in bytes can be calculated this way:
size_bitfield = math.ceil( number_of_pieces / 8 )
and the number of spare bits is:
spare_bits = 8 * size_bitfield - number_of_pieces
what is the purpose of the bitfield
The purpose is to tell what pieces the client has, so the other peer know what pieces it can request.
I'm working with python3 and do not find an answer for my little problem.
My problem is sending a byte greater than 0x7F over the serial port with my raspberry pi.
example:
import serial
ser=serial.Serial("/dev/ttyAMA0")
a=0x7F
ser.write(bytes(chr(a), 'UTF-8'))
works fine! The receiver gets 0x7F
if a equals 0x80
a=0x80
ser.write(bytes(chr(a), 'UTF-8'))
the receiver gets two bytes: 0xC2 0x80
if i change the type to UTF-16 the receiver reads
0xFF 0xFE 0x80 0x00
The receiver should get only 0x80!
Whats wrong! Thanks for your answers.
UTF-8 specification says that words that are 1 byte/octet start with 0. Because 0x80 is "10000000" in binary, it needs to be preceded by a C2, "11000010 10000000" (2 bytes/octets). 0x7F is 01111111, so when reading it, it knows it is only 1 byte/octet long.
UTF-16 says that all words are represented as 2 byte/octets and has a Byte Order Mark which essentially tells the reader which one is the most-significant octet (or endianness.
Check on UTF-8 for full specifications, but essentially you are moving from the end of the 1 byte range, to the start of the 2 byte range.
I don't understand why you want to send your own custom 1-byte words, but what you are really looking for is any SBCS (Single Byte Character Set) which has a character for those bytes you specify. UTF-8/UTF-16 are MBCS, which means when you encode a character, it may give you more than a single byte.
Before UTF-? came along, everything was SBCS, which meant that any code page you selected was coded using 8-bits. The problem arose when 256 characters were not enough, and they had to make code pages like IBM273 (IBM EBCDIC Germany) and ISO-8859-1 (ANSI Latin 1; Western European) to interpret what "0x2C" meant. Both the sender and receiver needed to set their code page identifier to the same, or they wouldn't understand each other. There is further confusion because these SBCS code pages don't always use the full 256 characters, so "0x7F" may not even exist / have a meaning.
What you could do is encode it to something like codepage 737/IBM 00737, send the "Α" (Greek Alpha) character and it should encode it as 0x80.
If it doesn't work, t'm not sure if you can send the raw byte through pyserial as the write() method seems to require an encoding, you may need to look into the source code to see the lower level details.
a=0x80
ser.write(bytes(chr(a), 'ISO-8859-1'))
I have some designing to do for a serial protocol and am running into some questions that I figure must have been considered elsewhere.
So I'm wondering if there are some recommendations for best practices in designing serial protocols. (Please either state a fact that is easily verifiable, or cite a reputable source if you make a claim.) General recommendations for websites/books are also welcome.
In particular I have to deal with issues like
parsing a stream of bytes into packets
verifying a packet is correct (easy with a CRC, for instance)
identifying reasonable types of errors that can occur (e.g. in a point-to-point serial stream, sporadic single bit errors, and dropped series of bytes, are both likely, but extra phantom bytes are unlikely; whereas with a record stored in flash memory or on a disk drive the types of errors that predominate are different)
error correction or recovery (if I detect an error in a packet, can I correct it? If not, can I resync to the boundary of the next packet?)
how to make variable-length packets robust to error correction / recovery.
Any suggestions?
Packet delimiting
For syncing to packet boundaries, typically you have a byte or byte sequence that identifies the packet boundary, which cannot occur within the packet itself. If the packet data happens to contain that identifier, then you have to "escape" (aka byte stuff) it.
Examples:
PPP Encapsulation
Consistent Overhead Byte Stuffing (COBS), or maybe COBS/R, which encodes data packets so no zero bytes are present, thus you can use zero bytes for packet delimiting
Packet verification
Various options are:
Checksum
Adler-32
Fletcher
CRC (the more bits the better the check)
Error correction etc
Good questions. I've not had much experience with that.
Have you considered FEC (Forward Error Correction)?
This procedure is very often used in "physical" level communication protocols such as WDM (Wavelength Division Multiplexing) / OTN (Optical Transport Network).
I am creating a protocol to have two applications talk over a TCP/IP stream and am figuring out how to design a header for my messages. Using the TCP header as an initial guide, I am wondering if I will need padding. I understand that when we're dealing with a cache, we want to make sure that data being stored fits in a row of cache so that when it is retrieved it is done so efficiently. However, I do not understand how it makes sense to pad a header considering that an application will parse a stream of bytes and store it how it sees fit.
For example: I want to send over a message header consisting of a 3 byte field followed by a 1 byte padding field for 32 bit alignment. Then I will send over the message data.
In this case, the receiver will just take 3 bytes from the stream and throw away the padding byte. And then start reading message data. As I see it, he will not be storing the 3 bytes and the message data the way he wants. The whole point of byte alignment is so that it will be retrieved in an efficient manner. But if the retriever doesn't care about the padding how will it be retrieved efficiently?
Without the padding, the retriever just takes the 3 header bytes from the stream and then takes the data bytes. Since the retriever stores these bytes however he wants, how does it matter whether or not the padding is done?
Maybe I'm missing the point of padding.
It's slightly hard to extract a question from this post, but with what I've said you guys can probably point out my misconceptions.
Please let me know what you guys think.
Thanks,
jbu
If word alignment of the message body is of some use, then by all means, pad the message to avoid other contortions. The padding will be of benefit if most of the message is processed as machine words with decent intensity.
If the message is a stream of bytes, for instance xml, then padding won't do you a whole heck of a lot of good.
As far as actually designing a wire protocol, you should probably consider using a plain text protocol with compression (including the header), which will probably use less bandwidth than any hand-designed binary protocol you could possibly invent.
I do not understand how it makes sense to pad a header considering that an application will parse a stream of bytes and store it how it sees fit.
If I'm a receiver, I might pass a buffer (i.e. an array of bytes) to the protocol driver (i.e. the TCP stack) and say, "give this back to me when there's data in it".
What I (the application) get back, then, is an array of bytes which contains the data. Using C-style tricks like "casting" and so on I can treat portions of this array as if it were words and double-words (not just bytes) ... provided that they're suitably aligned (which is where padding may be required).
Here's an example of a statement which reads a DWORD from an offset in a byte buffer:
DWORD getDword(const byte* buffer)
{
//we want the DWORD which starts at byte-offset 8
buffer += 8;
//dereference as if it were pointing to a DWORD
//(this would fail on some machines if the pointer
//weren't pointing to a DWORD-aligned boundary)
return *((DWORD*)buffer);
}
Here's the corresponding function in Intel assembly; note that it's a single opcode i.e. quite an efficient way to access the data, more efficient that reading and accumulating separate bytes:
mov eax,DWORD PTR [esi+8]
Oner reason to consider padding is if you plan to extend your protocol over time. Some of the padding can be intentionally set aside for future assignment.
Another reason to consider padding is to save a couple of bits on length fields. I.e. always a multiple of 4, or 8 saves 2 or 3 bits off the length field.
One other good reason that TCP has padding (which probably does not apply to you) is it allows dedicated network processing hardware to easily separate the data from the header. As the data always starts on a 32 bit boundary, it's easier to separate the header from the data when the packet gets routed.
If you have a 3 byte header and align it to 4 bytes, then designate the unused byte as 'reserved for future use' and require the bits to be zero (rejecting messages where they are not as malformed). That leaves you some extensibility. Or you might decide to use the byte as a version number - initially zero, and then incrementing it if (when) you make incompatible changes to the protocol. Don't let the value be 'undefined' and "don't care"; you'll never be able to use it if you start out that way.