python3 bytes construct adding spurrious bytes - python-3.x

I am new to python 3. I am sending bytes across the wire.
When I send s.send(b'\x8f\x35\x4a\x5f"), and I look at the stack trace, I only see 5f4a358f.
However, if I create a variable:
test=(['\x8f\x35\x4a\x5f'])
print(str(''.join(test).encode()))
I receive b'\xc2\x8f5J_'
As you can see, there is an extra byte /xc2.
My question is two-fold:
1) Why when using str.encode() which encodes a string to bytes which are already "encoded" an extra byte /xc2 is added whereas a literal byte string b'\x8f\x35\x4a\x5f' has no extra encoding is added?
2) If I am passing in bytes into a variable which is used as a buffer to send data across a socket, how does one create and send a set of literal bytes (e.g. b') programmatically such that there is no added /xc2 byte when sent across the wire?
Thank you all for your time! I really appreciate the help.

Because it's not encoded; it's text consisting of U+008F U+0035 U+004A U+005F. And then when you encode it (as UTF-8, per default) the extra byte is added. Either use bytes in the first place, or encode as Latin-1. But use bytes.

Related

I have attempted to supply hand written shellcode, but it is being read as a string and not as bytes, what next?

How do I get "\x90" to be read as the byte value corresponding to the x86 NOP instruction when supplied as a field within the standard argument list in Linux? I have a buffer being stuffed all the way to 10 and then being overwritten into the next 8 bytes with the new return address, at least so I would like. Because the byte sequence being supplied is not read as a byte sequence but rather as characters, I do not know how to fix this. What next?

problems sending bytes greater 0x7F python3 serial port

I'm working with python3 and do not find an answer for my little problem.
My problem is sending a byte greater than 0x7F over the serial port with my raspberry pi.
example:
import serial
ser=serial.Serial("/dev/ttyAMA0")
a=0x7F
ser.write(bytes(chr(a), 'UTF-8'))
works fine! The receiver gets 0x7F
if a equals 0x80
a=0x80
ser.write(bytes(chr(a), 'UTF-8'))
the receiver gets two bytes: 0xC2 0x80
if i change the type to UTF-16 the receiver reads
0xFF 0xFE 0x80 0x00
The receiver should get only 0x80!
Whats wrong! Thanks for your answers.
UTF-8 specification says that words that are 1 byte/octet start with 0. Because 0x80 is "10000000" in binary, it needs to be preceded by a C2, "11000010 10000000" (2 bytes/octets). 0x7F is 01111111, so when reading it, it knows it is only 1 byte/octet long.
UTF-16 says that all words are represented as 2 byte/octets and has a Byte Order Mark which essentially tells the reader which one is the most-significant octet (or endianness.
Check on UTF-8 for full specifications, but essentially you are moving from the end of the 1 byte range, to the start of the 2 byte range.
I don't understand why you want to send your own custom 1-byte words, but what you are really looking for is any SBCS (Single Byte Character Set) which has a character for those bytes you specify. UTF-8/UTF-16 are MBCS, which means when you encode a character, it may give you more than a single byte.
Before UTF-? came along, everything was SBCS, which meant that any code page you selected was coded using 8-bits. The problem arose when 256 characters were not enough, and they had to make code pages like IBM273 (IBM EBCDIC Germany) and ISO-8859-1 (ANSI Latin 1; Western European) to interpret what "0x2C" meant. Both the sender and receiver needed to set their code page identifier to the same, or they wouldn't understand each other. There is further confusion because these SBCS code pages don't always use the full 256 characters, so "0x7F" may not even exist / have a meaning.
What you could do is encode it to something like codepage 737/IBM 00737, send the "Α" (Greek Alpha) character and it should encode it as 0x80.
If it doesn't work, t'm not sure if you can send the raw byte through pyserial as the write() method seems to require an encoding, you may need to look into the source code to see the lower level details.
a=0x80
ser.write(bytes(chr(a), 'ISO-8859-1'))

How many actual bytes of memory node.js Buffer uses internally to store 1 logical byte of data?

Node.js documentations states that:
A Buffer is similar to an array of integers but corresponds to a raw memory allocation outside the V8 heap.
Am I right that all integers are represented as 64-bit floats internally in javascript?
Does it mean that storing 1 byte in Node.js Buffer actually takes 8 bytes of memory?
Thanks
Buffers are simply an array of bytes, so the length of the buffer is essentially the number of bytes that the Buffer will occupy.
For instance, the new Buffer(size) constructor is documented as "Allocates a new buffer of size octets." Here octets clearly identifies the cells as single-byte values. Similarly buf[index] states "Get and set the octet at index. The values refer to individual bytes, so the legal range is between 0x00 and 0xFF hex or 0 and 255.".
While a buffer is absolutely an array of bytes, you may interact with it as integers or other types using the buf.read* class of functions available on the buffer object. Each of these has a specific number of bytes that are affected by the operations.
For more specifics on the internals, Node just passes the length through to smalloc which just uses malloc as you'd expect to allocate the specified number of bytes.

Preon encode() does not fill up remaining bits until the byte boundary is reached

I have a message where a variable length of 7Bit characters is encoded. Unfortunately those 7Bit characters are stored in the message as 7Bit. That means the last byte of the message is not necessarily aligned to a byte boundary.
Decoding a message with Preon works fine, but when encoding the previously decoded message with Preon and comparing the byte arrays, the arrays do not match in length.
The encoded byte array is one byte smaller than the original one.
I debugged Preon because I assumed a bug, but it works as designed. When a byte boundary is reached, Preon stores the remaining bits until the next write() call to the BitChannel occures. But for the last byte there is no further call.
The question is, is there a way to tell Preon to flush the remaining buffer?

To pad or not to pad - creating a communication protocol

I am creating a protocol to have two applications talk over a TCP/IP stream and am figuring out how to design a header for my messages. Using the TCP header as an initial guide, I am wondering if I will need padding. I understand that when we're dealing with a cache, we want to make sure that data being stored fits in a row of cache so that when it is retrieved it is done so efficiently. However, I do not understand how it makes sense to pad a header considering that an application will parse a stream of bytes and store it how it sees fit.
For example: I want to send over a message header consisting of a 3 byte field followed by a 1 byte padding field for 32 bit alignment. Then I will send over the message data.
In this case, the receiver will just take 3 bytes from the stream and throw away the padding byte. And then start reading message data. As I see it, he will not be storing the 3 bytes and the message data the way he wants. The whole point of byte alignment is so that it will be retrieved in an efficient manner. But if the retriever doesn't care about the padding how will it be retrieved efficiently?
Without the padding, the retriever just takes the 3 header bytes from the stream and then takes the data bytes. Since the retriever stores these bytes however he wants, how does it matter whether or not the padding is done?
Maybe I'm missing the point of padding.
It's slightly hard to extract a question from this post, but with what I've said you guys can probably point out my misconceptions.
Please let me know what you guys think.
Thanks,
jbu
If word alignment of the message body is of some use, then by all means, pad the message to avoid other contortions. The padding will be of benefit if most of the message is processed as machine words with decent intensity.
If the message is a stream of bytes, for instance xml, then padding won't do you a whole heck of a lot of good.
As far as actually designing a wire protocol, you should probably consider using a plain text protocol with compression (including the header), which will probably use less bandwidth than any hand-designed binary protocol you could possibly invent.
I do not understand how it makes sense to pad a header considering that an application will parse a stream of bytes and store it how it sees fit.
If I'm a receiver, I might pass a buffer (i.e. an array of bytes) to the protocol driver (i.e. the TCP stack) and say, "give this back to me when there's data in it".
What I (the application) get back, then, is an array of bytes which contains the data. Using C-style tricks like "casting" and so on I can treat portions of this array as if it were words and double-words (not just bytes) ... provided that they're suitably aligned (which is where padding may be required).
Here's an example of a statement which reads a DWORD from an offset in a byte buffer:
DWORD getDword(const byte* buffer)
{
//we want the DWORD which starts at byte-offset 8
buffer += 8;
//dereference as if it were pointing to a DWORD
//(this would fail on some machines if the pointer
//weren't pointing to a DWORD-aligned boundary)
return *((DWORD*)buffer);
}
Here's the corresponding function in Intel assembly; note that it's a single opcode i.e. quite an efficient way to access the data, more efficient that reading and accumulating separate bytes:
mov eax,DWORD PTR [esi+8]
Oner reason to consider padding is if you plan to extend your protocol over time. Some of the padding can be intentionally set aside for future assignment.
Another reason to consider padding is to save a couple of bits on length fields. I.e. always a multiple of 4, or 8 saves 2 or 3 bits off the length field.
One other good reason that TCP has padding (which probably does not apply to you) is it allows dedicated network processing hardware to easily separate the data from the header. As the data always starts on a 32 bit boundary, it's easier to separate the header from the data when the packet gets routed.
If you have a 3 byte header and align it to 4 bytes, then designate the unused byte as 'reserved for future use' and require the bits to be zero (rejecting messages where they are not as malformed). That leaves you some extensibility. Or you might decide to use the byte as a version number - initially zero, and then incrementing it if (when) you make incompatible changes to the protocol. Don't let the value be 'undefined' and "don't care"; you'll never be able to use it if you start out that way.

Resources