Incomplete Output from connection.recv buffer [duplicate]

Incomplete Output from connection.recv buffer [duplicate] - python-3.x

Some sources say that recv should have the max length possible of a message, like recv(1024):
message0 = str(client.recv(1024).decode('utf-8'))
But other sources say that it should have the total bytes of the receiving message. If the message is "hello":
message0 = str(client.recv(5).decode('utf-8'))
What is the correct way of using recv()?

Some sources say ... But other sources say ... message ...
Both sources are wrong.
The argument for the recv is the maximum number of bytes one want to read at once.
With an UDP socket this is the message size one want to read or larger, but a single recv will only return a single message anyway. If the given size is smaller than the message it will be pruned and the rest will be discarded.
With a TCP socket (the case you ask about) there is no concept of a message in the first place since TCP is a byte stream only. recv will simply return the number of bytes available for read, up to the given size. Specifically a single recv in a TCP receiver does not not need to match a single send in the sender. It might match and it often will match if the amount of data is small, but there is no guarantee and one should never rely on it.
... message0 = str(client.recv(5).decode('utf-8'))
Note that calling decode('utf-8') directly on the data returned by recv is a bad idea. One first need to be sure that all the expected data are read and only then call decode('utf-8'). If only part of the data are read the end of the read data could be in the middle of a character, since a single character in UTF-8 might be encoded in multiple bytes (everything except ASCII characters). If decode('utf-8') is called with an incomplete encoded character it will throw an exception and your code will break.

Related

Multiplex connection over a single TCP socket

Let's assume I have two machines, A and B, which run two programs each:
A.1 needs to talk to B.1 (duplex)
A.2 needs to talk to B.2 (duplex)
A TCP socket (duplex) connects A to B.
Upon reception of a message from A, B needs to know where to dispatch the message: should it go to B.1, or B.2 ? And vice versa.
A naive idea could be to prefix each message with a kind of header identifying the program to map to. Instead of sending "message to B.1", A could send "1: message to B.1". Upon reception, B sees the header "1: ", and knows to send "message to B.1" to B.1.
The problem is that messages can be split and when A sends "1: message to B.1", B could very well receive several chunks:
"1: me"
"ssage"
" to B"
".1"
If the chunks all have the same length, I can split the original messages in small chunks, each prefixed with a header. That adds some overhead, but that's fine.
The problem is: I am not sure I can configure the chunk size. Chunks may be of random lengths. If both A.1 and A.2 write to B.1 and B.2 respectively, I am not sure how B can know how to properly dispatch the chunks to the right recipients.
I'm using nodejs by the way. I'll look into the readableHighWaterMark and writableHighWaterMark options of the Duplex module to see if I can fix the chunk size, but I'm not sure this is how it works.
Any tips / ideas?

Header should be fixed size and contain both target program id and current chunk data size.
On receiving size - read fixed size header, then read exactly nnn bytes of data as specified in header (you have to call read/recv repeatly until entire chunk received) and dispatch that data to corresponging program. Then read next header etc etc etc

Python TCP packets getting mixed

I have a multiplayer game written in python and uses TCP, So when I send two packets at the same time they get mixed up example if I send "Hello there" and "man" the client receives "hello thereman".
What should I do to prevent them from getting mixed?

That's the way TCP works. It is a byte stream. It is not message-based.
Consider if you write "Hello there" and "man" to a file. If you read the file, you see "hello thereman". A socket works the same way.
If you want to make sense of the byte stream, you need other information. For example, add line feeds to the stream to indicate end of line. For a binary file, include data structures such as "2-byte length (big-endian) followed by <length> bytes of data" so you can read the stream and break it into decipherable messages.
Note that socket methods send() and recv() must have their return values checked. recv(1024) for example can return '' (socket closed) or 1-1024 bytes of data. The size is a maximum to be returned. send() can send less than requested and you'll have to re-send the part that didn't send (or use sendall() in the first place).
Or, use a framework that does all this for you...

Unused bytes by protobuf implementation (for limiter implementation)

I need to transfer data over a serial port. In order to ensure integrity of the data, I want a small envelope protocol around each protobuf message. I thought about the following:
message type (1 byte)
message size (2 bytes)
protobuf message (N bytes)
(checksum; optional)
The message type will mostly be a mapping between messages defined in proto files. However, if a message gets corrupted or some bytes are lost, the message size will not be correct and all subsequent bytes cannot be interpreted anymore. One way to solve this would be the introduction of limiters between messages, but for that I need to choose something that is not used by protobuf. Is there a byte sequence that is never used by any protobuf message?
I also thought about a different way. If the master finds out that packages are corrupted, it should reset the communication to a clean start. For that I want the master to send a RESTART command to the slave. The slave should answer with an ACK and then start sending complete messages again. All bytes received between RESTART and ACK are to be discarded by the master. I want to encode ACK and RESTART as special messages. But with that approach I face the same problem: I need to find byte sequences for ACK and RESTART that are not used by any protobuf messages.
Maybe I am also taking the wrong approach - feel free to suggest other approaches to deal with lost bytes.

Is there a byte sequence that is never used by any protobuf message?
No; it is a binary serializer and can contain arbitrary binary payloads (especially in the bytes type). You cannot use sentinel values. Length prefix is fine (your "message size" header), and a checksum may be a pragmatic option. Alternatively, you could impose an artificial sentinel to follow each message (maybe a guid chosen per-connection as part of the initial handshake), and use that to double-check that everything looks correct.

One way to help recover packet synchronization after a rare problem is to use synchronization words in the beginning of the message, and use the checksum to check for valid messages.
This means that you put a constant value, e.g. 0x12345678, before your message type field. Then if a message fails checksum check, you can recover by finding the next 0x12345678 in your data.
Even though that value could sometimes occur in the middle of the message, it doesn't matter much. The checksum check will very probably catch that there isn't a real message at that position, and you can search forwards until you find the next marker.

Should I read N times when I write N times at server side

Server like this:
char buf[10];
memset(buf, 0, 10);
write(sock, "te", 2);
write(sock, "ab", 2);
Client side:
char buf[5] = {0};
read(connfd, buf, 5);
I mean 5 if less then 2 + 2, but result shows that I only received 2 bytes which is "te".
this linke read() is not blocking in socket programming has told me that
When you call N write() at server its not necessary there should be N
read() call at other side.
What is wrong with my understand or code? Should I use another system call or something else.

When using stream oriented sockets, like TCP, there is no guarantee whatsoever about the number of read and write. That means one write can be read with multiple read, and multiple write can be read with a single read. Usually you will have one read for each write if the writes are short and spaced, but there is no guarantee. Here you only read 2 bytes. That may happen, you just need one or two more calls to read. Note that if you are using the loopback interface, read and write calls may be closely matching, so you may have exactly one read call for each write, but again you can't be sure. The usual pattern is to keep reading in a loop until you got the number of bytes required.
If you are using datagram oriented socket, like UDP, one write will be exactly one read (if transmission was successful), and if the read buffer supplied is too short, some data is discarded.

read() will return how many bytes were read. In the case where you always know how much data to expect, you want to read(), then check the return to verify the full amount was read. If it wasn't, read again until you've received the rest of the data you were expecting.

To pad or not to pad - creating a communication protocol

I am creating a protocol to have two applications talk over a TCP/IP stream and am figuring out how to design a header for my messages. Using the TCP header as an initial guide, I am wondering if I will need padding. I understand that when we're dealing with a cache, we want to make sure that data being stored fits in a row of cache so that when it is retrieved it is done so efficiently. However, I do not understand how it makes sense to pad a header considering that an application will parse a stream of bytes and store it how it sees fit.
For example: I want to send over a message header consisting of a 3 byte field followed by a 1 byte padding field for 32 bit alignment. Then I will send over the message data.
In this case, the receiver will just take 3 bytes from the stream and throw away the padding byte. And then start reading message data. As I see it, he will not be storing the 3 bytes and the message data the way he wants. The whole point of byte alignment is so that it will be retrieved in an efficient manner. But if the retriever doesn't care about the padding how will it be retrieved efficiently?
Without the padding, the retriever just takes the 3 header bytes from the stream and then takes the data bytes. Since the retriever stores these bytes however he wants, how does it matter whether or not the padding is done?
Maybe I'm missing the point of padding.
It's slightly hard to extract a question from this post, but with what I've said you guys can probably point out my misconceptions.
Please let me know what you guys think.
Thanks,
jbu

If word alignment of the message body is of some use, then by all means, pad the message to avoid other contortions. The padding will be of benefit if most of the message is processed as machine words with decent intensity.
If the message is a stream of bytes, for instance xml, then padding won't do you a whole heck of a lot of good.
As far as actually designing a wire protocol, you should probably consider using a plain text protocol with compression (including the header), which will probably use less bandwidth than any hand-designed binary protocol you could possibly invent.

I do not understand how it makes sense to pad a header considering that an application will parse a stream of bytes and store it how it sees fit.
If I'm a receiver, I might pass a buffer (i.e. an array of bytes) to the protocol driver (i.e. the TCP stack) and say, "give this back to me when there's data in it".
What I (the application) get back, then, is an array of bytes which contains the data. Using C-style tricks like "casting" and so on I can treat portions of this array as if it were words and double-words (not just bytes) ... provided that they're suitably aligned (which is where padding may be required).
Here's an example of a statement which reads a DWORD from an offset in a byte buffer:
DWORD getDword(const byte* buffer)
{
//we want the DWORD which starts at byte-offset 8
buffer += 8;
//dereference as if it were pointing to a DWORD
//(this would fail on some machines if the pointer
//weren't pointing to a DWORD-aligned boundary)
return *((DWORD*)buffer);
}
Here's the corresponding function in Intel assembly; note that it's a single opcode i.e. quite an efficient way to access the data, more efficient that reading and accumulating separate bytes:
mov eax,DWORD PTR [esi+8]

Oner reason to consider padding is if you plan to extend your protocol over time. Some of the padding can be intentionally set aside for future assignment.
Another reason to consider padding is to save a couple of bits on length fields. I.e. always a multiple of 4, or 8 saves 2 or 3 bits off the length field.

One other good reason that TCP has padding (which probably does not apply to you) is it allows dedicated network processing hardware to easily separate the data from the header. As the data always starts on a 32 bit boundary, it's easier to separate the header from the data when the packet gets routed.

If you have a 3 byte header and align it to 4 bytes, then designate the unused byte as 'reserved for future use' and require the bits to be zero (rejecting messages where they are not as malformed). That leaves you some extensibility. Or you might decide to use the byte as a version number - initially zero, and then incrementing it if (when) you make incompatible changes to the protocol. Don't let the value be 'undefined' and "don't care"; you'll never be able to use it if you start out that way.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string