What does output_binary_int in OCaml do? - io

output_binary_int stdout 1101
prints M to stdout.
The doc says "Write one integer in binary format (4 bytes, big-endian)", but I don't really understand what to do with this function.

If you're looking at stdout in a terminal, you are expecting stdout to consist of a sequence of characters.
The output of output_binary_int is not characters, it is a raw 32-bit integer value. Since a byte is 8 bits, this means it will output 4 bytes. But they arenn't generally going to be bytes that are meaningful when viewed as characters.
It will make more sense if you send the output to a file, then look at the contents of the file. Something like this:
$ rlwrap ocaml
OCaml version 4.10.0
# let ochan = open_out "mybfile";;
val ochan : out_channel = <abstr>
# output_binary_int ochan 1101;;
- : unit = ()
# close_out ochan;;
- : unit = ()
# ^D
$ od -tx1 mybfile
0000000 00 00 04 4d
0000004
As you can see, 4 bytes of binary data were written to the file. The 4 bytes represent the value 1101 in big-endian format.
If you're not familiar with binary integer representations (forgive me if you are), this means that the value in the file represents the hex value 0x0000044d, which indeed is 1011.
If the character encoding of your terminal is UTF-8 (which is common), then the output represents two null characters followed by the Control-D character (neither of these has a visual representation), followed by the character with code 0x4d, which indeed is M.

Related

Converting an integer to a bytes object in Python 3 results in a "Q" in PyCharm Debugger?

I am writing a function to implement a LFSR in Python 3. I've figured out the hard part, which is actually stepping through the state using the feedback value to get the key and getting all of the output bytes by ANDing the key and the input data. To make it easier to do bit shifts and OR and XOR operations, I did all of my operations on integers and saved the result as an integer, however, the function requires a return type of bytes. I have been doing some Googling of my own, and it seems like an accepted way to convert an integer to a bytes object is to do something like result_bytes = result_bytes.to_bytes((result_bytes.bit_length() + 7) // 8, 'big'). Running hex_result_bytes = hex(result_bytes) (mind you, in this context, result_bytes is still currently an integer) and my result_bytes is an integer with a value of 3187993425, I get a str result of "0xbe04eb51", so clearly I should expect a bytes object that looks like b'\xbe\x04\xeb\x51'.
After running result_bytes = result_bytes.to_bytes((result_bytes.bit_length() + 7) // 8, 'big'), I should expect b'\xbe\x04\xeb\x51'. Instead, the result I get is b'\xbe\x04\xebQ' in the PyCharm debugger. Is there something obvious that I am missing here? I don't even know how I got Q because Q is clearly not something that you can get as a hexadecimal byte.
I'm not sure why the PyCharm debugger does this, but it seems to be converting part of the binary string to ASCII. I ran an expression to see if b'\xbe\x04\xebQ' == b'\xbe\x04\xeb\x51', and it evaluates to True. 51 in hex corresponds to Q in ASCII. Another binary string that I am working with right now should be b'\x48\xDC\x40\xD1\x4C', but it is also converting part of it into ASCII and showing it as b'H\xdc#\xd1L', where hex 48 corresponds to H in ASCII, hex 40 corresponds to # in ASCII, and hex 4C corresponds to L in ASCII

Python bytes representation

I'm writing a hex viewer on python for examining raw packet bytes. I use dpkt module.
I supposed that one hex byte may have value between 0x00 and 0xFF. However, I've noticed that python bytes representation looks differently:
b'\x8a\n\x1e+\x1f\x84V\xf2\xca$\xb1'
I don't understand what do these symbols mean. How can I translate these symbols to original 1-byte values which could be shown in hex viewer?
The \xhh indicates a hex value of hh. i.e. it is the Python 3 way of encoding 0xhh.
See https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
The b at the start of the string is an indication that the variables should be of bytes type rather than str. The above link also covers that. The \n is a newline character.
You can use bytearray to store and access the data. Here's an example using the byte string in your question.
example_bytes = b'\x8a\n\x1e+\x1f\x84V\xf2\xca$\xb1'
encoded_array = bytearray(example_bytes)
print(encoded_array)
>>> bytearray(b'\x8a\n\x1e+\x1f\x84V\xf2\xca$\xb1')
# Print the value of \x8a which is 138 in decimal.
print(encoded_array[0])
>>> 138
# Encode value as Hex.
print(hex(encoded_array[0]))
>>> 0x8a
Hope this helps.

Reading-in a binary JPEG-Header (in Python)

I would like to read in a JPEG-Header and analyze it.
According to Wikipedia, the header consists of a sequences of markers. Each Marker starts with FF xx, where xx is a specific Marker-ID.
So my idea, was to simply read in the image in binary format, and seek for the corresponding character-combinations in the binary stream. This should enable me to split the header in the corresponding marker-fields.
For instance, this is, what I receive, when I read in the first 20 bytes of an image:
binary_data = open('picture.jpg','rb').read(20)
print(binary_data)
b'\xff\xd8\xff\xe1-\xfcExif\x00\x00MM\x00*\x00\x00\x00\x08'
My questions are now:
1) Why does python not return me nice chunks of 2 bytes (in hex-format).
Somthing like this I would expect:
b'\xff \xd8 \xff \xe1 \x-' ... and so on. Some blocks delimited by '\x' are much longer than 2 bytes.
2) Why are there symbols like -, M, * in the returned string? Those are no characters of a hex representation I expect from a byte string (only: 0-9, a-f, I think).
Both observations hinder me in writing a simple parser.
So ultimately my question summarizes to:
How do I properly read-in and parse a JPEG Header in Python?
You seem overly worried about how your binary data is represented on your console. Don't worry about that.
The default built-in string-based representation that print(..) applies to a bytes object is just "printable ASCII characters as such (except a few exceptions), all others as an escaped hex sequence". The exceptions are semi-special characters such as \, ", and ', which could mess up the string representation. But this alternative representation does not change the values in any way!
>>> a = bytes([1,2,4,92,34,39])
>>> a
b'\x01\x02\x04\\"\''
>>> a[0]
1
See how the entire object is printed 'as if' it's a string, but its individual elements are still perfectly normal bytes?
If you have a byte array and you don't like the appearance of this default, then you can write your own. But – for clarity – this still doesn't have anything to do with parsing a file.
>>> binary_data = open('iaijiedc.jpg','rb').read(20)
>>> binary_data
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x01\x00H\x00H\x00\x00'
>>> ''.join(['%02x%02x ' % (binary_data[2*i],binary_data[2*i+1]) for i in range(len(binary_data)>>1)])
'ffd8 ffe0 0010 4a46 4946 0001 0201 0048 0048 0000 '
Why does python not return me nice chunks of 2 bytes (in hex-format)?
Because you don't ask it to. You are asking for a sequence of bytes, and that's what you get. If you want chunks of two-bytes, transform it after reading.
The code above only prints the data; to create a new list that contains 2-byte words, loop over it and convert each 2 bytes or use unpack (there are actually several ways):
>>> wd = [unpack('>H', binary_data[x:x+2])[0] for x in range(0,len(binary_data),2)]
>>> wd
[65496, 65504, 16, 19014, 18758, 1, 513, 72, 72, 0]
>>> [hex(x) for x in wd]
['0xffd8', '0xffe0', '0x10', '0x4a46', '0x4946', '0x1', '0x201', '0x48', '0x48', '0x0']
I'm using the little-endian specifier < and unsigned short H in unpack, because (I assume) these are the conventional ways to represent JPEG 2-byte codes. Check the documentation if you want to derive from this.

Perl strings encode utf 8

I am reading about Perl's Encode and utf8.
The doc says:
$octets = encode_utf8($string);
Equivalent to
$octets = encode("utf8", $string) .
The characters in $string are encoded in Perl's internal format, and
the result is returned as a sequence of octets.
I have no idea what this means. Isn't a string in Perl a sequence of octets (i.e. bytes) anyway?
So what is the difference between:
$string and $octets?
No, a string in Perl is a sequence of characters, not necessarily octets. The chr and ord functions (for transforming between integers and single characters), to name two, can deal with integer values larger than 255. For example
$string = "\x{0421}\x{041F}";
print ord($_)," " for split //, $string;
outputs
1057 1055
When a string is written to a terminal, file, or other output stream, the device receiving the string usually requires and expects bytes, however, so this is where encoding comes in. As you have seen, UTF-8 is a scheme for encoding single value in the range 0x7F-0x10FFFF into multiple bytes.
$octets = Encode::encode("utf-8", "\x{0421}\x{041F}");
print ord($_)," " for split //, $octets;
Now the output is
208 161 208 159
and suitable to be stored on a filesystem.
Internally, perl (in all lower case, this refers to the executable implementation of Perl, the programming language specification) often uses UTF-8 to represent strings with "wide" characters, but this is not something you would every normally have to worry about.

What the heck is a Perl string anyway?

I can't find a basic description of how string data is stored in Perl! Its like all the documentation is assuming I already know this for some reason. I know about encode(), decode(), and I know I can read raw bytes into a Perl "string" and output them again without Perl screwing with them. I know about open modes. I also gather Perl must use some interal format to store character strings and can differentiate between character and binary data. Please where is this documented???
Equivalent question is; given this perl:
$x = decode($y);
Decode to WHAT and from WHAT??
As far as I can figure there must be a flag on the string data structure that says this is binary XOR character data (of some internal format which BTW is a superset of Unicode -http://perldoc.perl.org/Encode.html#DESCRIPTION). But I'd like it if that were stated in the docs or confirmed/discredited here.
This is a great question. To investigate, we can dive a little deeper by using Devel::Peek to see what is actually stored in our strings (or other variables).
First lets start with an ASCII string
$ perl -MDevel::Peek -E 'Dump "string"'
SV = PV(0x9688158) at 0x969ac30
REFCNT = 1
FLAGS = (POK,READONLY,pPOK)
PV = 0x969ea20 "string"\0
CUR = 6
LEN = 12
Then we can turn on unicode IO layers and do the same
$ perl -MDevel::Peek -CSAD -E 'Dump "string"'
SV = PV(0x9eea178) at 0x9efcce0
REFCNT = 1
FLAGS = (POK,READONLY,pPOK)
PV = 0x9f0faf8 "string"\0
CUR = 6
LEN = 12
From there lets try to manually add some wide characters
$ perl -MDevel::Peek -CSAD -e 'Dump "string \x{2665}"'
SV = PV(0x9be1148) at 0x9bf3c08
REFCNT = 1
FLAGS = (POK,READONLY,pPOK,UTF8)
PV = 0x9bf7178 "string \342\231\245"\0 [UTF8 "string \x{2665}"]
CUR = 10
LEN = 12
From that you can clearly see that Perl has interpreted this correctly as utf8. The problem is that if I don't give the octets using the \x{} escaping the representation looks more like the regular string
$ perl -MDevel::Peek -CSAD -E 'Dump "string ♥"'
SV = PV(0x9143058) at 0x9155cd0
REFCNT = 1
FLAGS = (POK,READONLY,pPOK)
PV = 0x9168af8 "string \342\231\245"\0
CUR = 10
LEN = 12
All Perl sees is bytes and has no way to know that you meant them as a unicode character, unlike when you entered the escaped octets above. Now lets use decode and see what happens
$ perl -MDevel::Peek -CSAD -MEncode=decode -E 'Dump decode "utf8", "string ♥"'
SV = PV(0x8681100) at 0x8683068
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x869dbf0 "string \342\231\245"\0 [UTF8 "string \x{2665}"]
CUR = 10
LEN = 12
TADA!, now you can see that the string is correctly internally represented matching what you entered when you used the \x{} escaping.
The actual answer is it is "decoding" from bytes to characters, but I think it makes more sense when you see the Peek output.
Finally, you can make Perl see you source code as utf8 by using the utf8 pragma, like so
$ perl -MDevel::Peek -CSAD -Mutf8 -E 'Dump "string ♥"'
SV = PV(0x8781170) at 0x8793d00
REFCNT = 1
FLAGS = (POK,READONLY,pPOK,UTF8)
PV = 0x87973b8 "string \342\231\245"\0 [UTF8 "string \x{2665}"]
CUR = 10
LEN = 12
Rather like the fluid string/number status of its scalar variables, the internal format of Perl's strings is variable and depends on the contents of the string.
Take a look at perluniintro, which says this.
Internally, Perl currently uses either whatever the native eight-bit character set of the platform (for example Latin-1) is, defaulting to UTF-8, to encode Unicode strings. Specifically, if all code points in the string are 0xFF or less, Perl uses the native eight-bit character set. Otherwise, it uses UTF-8.
What that means is that a string like "I have £ two" is stored as (bytes) I have \x{A3} two. (The pound sign is U+00A3.) Now if I append a multi-byte unicode string such as U+263A - a smiling face - Perl will convert the whole string to UTF-8 before it appends the new character, giving (bytes) I have \xC2\xA3 two\xE2\x98\xBA. Removing this last character again leaves the string UTF-8 encoded, as `I have \xC2\xA3 two.
But I wonder why you need to know this. Unless you are writing an XS extension in C the internal format is transparent and invisible to you.
Perls internal string format is implementation dependant, but usually a super set of UtF-8. It doesn't matter what it is because you use decode and encode to convert strings to and from the internal format to other encodings.
Decode converts to perls internal format, encode converts from perls internal format.
Binary data is stored internaly the same way characters 0 through 255 are.
Encode and decode just convert between formats. For example UTF8 encoding means each character will only be an octet using perl character vlaues 0 through 255, ie that the string consists of UTF8 octets.
Short answer: It's a mess
Slightly longer: The difference isn't visible to the programmer.
Basically you have to remember if your string contains bytes or characters, where characters are unicode codepoints. If you only encounter ASCII, the difference is invisible, which is dangerous.
Data itself and the representation of such data are distinct, and should not be confused. Strings are (conceptually) a sequence of codepoints, but are represented as a byte array in memory, and represented as some byte sequence when encoded. If you want to store binary data in a string, you re-interpret the number of a codepoint as a byte value, and restrict yourself to codepoints in 0–255.
(E.g. a file has no encoding. The information in that file has some encoding (be it ASCII, UTF-16 or EBCDIC at a character level, and Perl, HTML or .ini at an application level))
The exact storage format of a string is irrelevant, but you can store complete integers inside such a string:
# this will work if your perl was compiled with large integers
my $string = chr 2**64; # this is so not unicode
say ord $string; # 18446744073709551615
The internal format is adjusted accordingly to accomodate such values; normal strings won't take up one integer per character.
Perl can handle more than Unicode can, so it's very flexible. Sometimes you want to interface with something that cannot, so you can use encode(...) and decode(...) handle those transformations. see http://perldoc.perl.org/utf8.html

Resources