I'm trying to read hex values from a binary file. I don't have a problem with extracting a string and converting letters to hex values, but how can I do that with control characters and other non-printable characters? Is there a way of reading the string directly in hex values without the need of converting it?
Have a look over here:
As a last example, the following program makes a dump of a binary file. Again, the first program argument is the input file name; the output goes to the standard output. The program reads the file in chunks of 10 bytes. For each chunk, it writes the hexadecimal representation of each byte, and then it writes the chunk as text, changing control characters to dots.
local f = assert(io.open(arg[1], "rb"))
local block = 10
while true do
local bytes = f:read(block)
if not bytes then break end
for b in string.gfind(bytes, ".") do
io.write(string.format("%02X ", string.byte(b)))
end
io.write(string.rep(" ", block - string.len(bytes) + 1))
io.write(string.gsub(bytes, "%c", "."), "\n")
end
From your question it's not clear what exactly you aim to do, so I'll give 2 approaches.
Either you have a file full with hex values, and read it like this:
s='ABCDEF1234567890'
t={}
for val in s:lower():gmatch'(%x%x)' do
-- do whatever you want with the data
t[#t+1]=s:char(val)
end
Or you have a binary file, and you convert it to hex values:
s='kl978331asdfjhvkasdf'
t={s:byte(1,-1)}
Related
I would like to read in a JPEG-Header and analyze it.
According to Wikipedia, the header consists of a sequences of markers. Each Marker starts with FF xx, where xx is a specific Marker-ID.
So my idea, was to simply read in the image in binary format, and seek for the corresponding character-combinations in the binary stream. This should enable me to split the header in the corresponding marker-fields.
For instance, this is, what I receive, when I read in the first 20 bytes of an image:
binary_data = open('picture.jpg','rb').read(20)
print(binary_data)
b'\xff\xd8\xff\xe1-\xfcExif\x00\x00MM\x00*\x00\x00\x00\x08'
My questions are now:
1) Why does python not return me nice chunks of 2 bytes (in hex-format).
Somthing like this I would expect:
b'\xff \xd8 \xff \xe1 \x-' ... and so on. Some blocks delimited by '\x' are much longer than 2 bytes.
2) Why are there symbols like -, M, * in the returned string? Those are no characters of a hex representation I expect from a byte string (only: 0-9, a-f, I think).
Both observations hinder me in writing a simple parser.
So ultimately my question summarizes to:
How do I properly read-in and parse a JPEG Header in Python?
You seem overly worried about how your binary data is represented on your console. Don't worry about that.
The default built-in string-based representation that print(..) applies to a bytes object is just "printable ASCII characters as such (except a few exceptions), all others as an escaped hex sequence". The exceptions are semi-special characters such as \, ", and ', which could mess up the string representation. But this alternative representation does not change the values in any way!
>>> a = bytes([1,2,4,92,34,39])
>>> a
b'\x01\x02\x04\\"\''
>>> a[0]
1
See how the entire object is printed 'as if' it's a string, but its individual elements are still perfectly normal bytes?
If you have a byte array and you don't like the appearance of this default, then you can write your own. But – for clarity – this still doesn't have anything to do with parsing a file.
>>> binary_data = open('iaijiedc.jpg','rb').read(20)
>>> binary_data
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x01\x00H\x00H\x00\x00'
>>> ''.join(['%02x%02x ' % (binary_data[2*i],binary_data[2*i+1]) for i in range(len(binary_data)>>1)])
'ffd8 ffe0 0010 4a46 4946 0001 0201 0048 0048 0000 '
Why does python not return me nice chunks of 2 bytes (in hex-format)?
Because you don't ask it to. You are asking for a sequence of bytes, and that's what you get. If you want chunks of two-bytes, transform it after reading.
The code above only prints the data; to create a new list that contains 2-byte words, loop over it and convert each 2 bytes or use unpack (there are actually several ways):
>>> wd = [unpack('>H', binary_data[x:x+2])[0] for x in range(0,len(binary_data),2)]
>>> wd
[65496, 65504, 16, 19014, 18758, 1, 513, 72, 72, 0]
>>> [hex(x) for x in wd]
['0xffd8', '0xffe0', '0x10', '0x4a46', '0x4946', '0x1', '0x201', '0x48', '0x48', '0x0']
I'm using the little-endian specifier < and unsigned short H in unpack, because (I assume) these are the conventional ways to represent JPEG 2-byte codes. Check the documentation if you want to derive from this.
I have a binary vector which I converted to hex. I want this hex data to go to a .bin(or any other fornat) file. Since this is a vector,I tried to first convert it to a string so that the hex data can be formatted and then output to a file. Please see below the code I was trying to use. Also shown is the snapshot of the problem. As you can see, all the converted hex data is present in 1 cell. I want it to be byte wise for each cell.
my_new_vector = binaryVectorToHex(M); %M is my input binary matrix
%cellfun(FormatHexStr, mat2cell(my_new_vector), 'UniformOutput', false)
%new_vector = mat2cell(my_new_vector);
vect2str(my_new_vector); %[matlab file exchange function][2] for converting vector to string
FormatHexStr(my_new_vector,2); %FormatHexStr is a function for formatting hex values which requires a hex string as an input [function is here][3]
[n_rows,n_cols] = size(my_new_vector);
fileID = fopen('my_flipped_data.bin','wt');
for row = 1:n_cols
fprintf(fileID,'%d\n',my_new_vector(:,row));
end
fclose(fileID);
Convert the binary values into blocks of 8 in a uint8 (byte) variable, then write it to file.
nbytes = floor(length(M)/8);
bytevec = zeros(1,nbytes, 'uint8');
for i = 1:8
bytevec = bytevec + uint8((2^(8-i))*M(i:8:end));
end
fileID = fopen('my_flipped_data.bin','wb');
fwrite(fileID, bytevec);
fclose(fileID);
This writes the first bit as the MSB of the first byte. Use (2^(i-1)) for first bit as LSB.
New to this python thing.
A little while ago I saved off output from an external device that was connected to a serial port as I would not be able to keep that device. I read in the data at the serial port as bytes with the intention of creating an emulator for that device.
The data was saved one 'byte' per line to a file as example extract below.
b'\x9a'
b'X'
b'}'
b'}'
b'x'
b'\x8c'
I would like to read in each line from the data capture and append what would have been the original byte to a byte array.
I have tried various append() and concatenation operations (+=) on a bytearray but the above lines are python string objects and these operations fail.
Is there an easy way (a built-in way?) to add each of the original byte values of these lines to a byte array?
Thanks.
M
Update
I came across the .encode() string method and have created a crude function to meet my needs.
def string2byte(str):
# the 'byte' string will consist of 5, 6 or 8 characters including the newline
# as in b',' or b'\r' or b'\x0A'
if len(str) == 5:
return str[2].encode('Latin-1')
elif len(str) == 6:
return str[3].encode('Latin-1')
else:
return str[4:6].encode('Latin-1')
...well, it is functional.
If anyone knows of a more elegant solution perhaps you would be kind enough to post this.
b'\x9a' is a literal representation of the byte 0x9a in Python source code. If your file literally contains these seven characters b'\x9a' then it is bad because you could have saved it using only one byte. You could convert it to a byte using ast.literal_eval():
import ast
with open('input') as file:
a = b"".join(map(ast.literal_eval, file)) # assume 1 byte literal per line
I can't find a basic description of how string data is stored in Perl! Its like all the documentation is assuming I already know this for some reason. I know about encode(), decode(), and I know I can read raw bytes into a Perl "string" and output them again without Perl screwing with them. I know about open modes. I also gather Perl must use some interal format to store character strings and can differentiate between character and binary data. Please where is this documented???
Equivalent question is; given this perl:
$x = decode($y);
Decode to WHAT and from WHAT??
As far as I can figure there must be a flag on the string data structure that says this is binary XOR character data (of some internal format which BTW is a superset of Unicode -http://perldoc.perl.org/Encode.html#DESCRIPTION). But I'd like it if that were stated in the docs or confirmed/discredited here.
This is a great question. To investigate, we can dive a little deeper by using Devel::Peek to see what is actually stored in our strings (or other variables).
First lets start with an ASCII string
$ perl -MDevel::Peek -E 'Dump "string"'
SV = PV(0x9688158) at 0x969ac30
REFCNT = 1
FLAGS = (POK,READONLY,pPOK)
PV = 0x969ea20 "string"\0
CUR = 6
LEN = 12
Then we can turn on unicode IO layers and do the same
$ perl -MDevel::Peek -CSAD -E 'Dump "string"'
SV = PV(0x9eea178) at 0x9efcce0
REFCNT = 1
FLAGS = (POK,READONLY,pPOK)
PV = 0x9f0faf8 "string"\0
CUR = 6
LEN = 12
From there lets try to manually add some wide characters
$ perl -MDevel::Peek -CSAD -e 'Dump "string \x{2665}"'
SV = PV(0x9be1148) at 0x9bf3c08
REFCNT = 1
FLAGS = (POK,READONLY,pPOK,UTF8)
PV = 0x9bf7178 "string \342\231\245"\0 [UTF8 "string \x{2665}"]
CUR = 10
LEN = 12
From that you can clearly see that Perl has interpreted this correctly as utf8. The problem is that if I don't give the octets using the \x{} escaping the representation looks more like the regular string
$ perl -MDevel::Peek -CSAD -E 'Dump "string ♥"'
SV = PV(0x9143058) at 0x9155cd0
REFCNT = 1
FLAGS = (POK,READONLY,pPOK)
PV = 0x9168af8 "string \342\231\245"\0
CUR = 10
LEN = 12
All Perl sees is bytes and has no way to know that you meant them as a unicode character, unlike when you entered the escaped octets above. Now lets use decode and see what happens
$ perl -MDevel::Peek -CSAD -MEncode=decode -E 'Dump decode "utf8", "string ♥"'
SV = PV(0x8681100) at 0x8683068
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x869dbf0 "string \342\231\245"\0 [UTF8 "string \x{2665}"]
CUR = 10
LEN = 12
TADA!, now you can see that the string is correctly internally represented matching what you entered when you used the \x{} escaping.
The actual answer is it is "decoding" from bytes to characters, but I think it makes more sense when you see the Peek output.
Finally, you can make Perl see you source code as utf8 by using the utf8 pragma, like so
$ perl -MDevel::Peek -CSAD -Mutf8 -E 'Dump "string ♥"'
SV = PV(0x8781170) at 0x8793d00
REFCNT = 1
FLAGS = (POK,READONLY,pPOK,UTF8)
PV = 0x87973b8 "string \342\231\245"\0 [UTF8 "string \x{2665}"]
CUR = 10
LEN = 12
Rather like the fluid string/number status of its scalar variables, the internal format of Perl's strings is variable and depends on the contents of the string.
Take a look at perluniintro, which says this.
Internally, Perl currently uses either whatever the native eight-bit character set of the platform (for example Latin-1) is, defaulting to UTF-8, to encode Unicode strings. Specifically, if all code points in the string are 0xFF or less, Perl uses the native eight-bit character set. Otherwise, it uses UTF-8.
What that means is that a string like "I have £ two" is stored as (bytes) I have \x{A3} two. (The pound sign is U+00A3.) Now if I append a multi-byte unicode string such as U+263A - a smiling face - Perl will convert the whole string to UTF-8 before it appends the new character, giving (bytes) I have \xC2\xA3 two\xE2\x98\xBA. Removing this last character again leaves the string UTF-8 encoded, as `I have \xC2\xA3 two.
But I wonder why you need to know this. Unless you are writing an XS extension in C the internal format is transparent and invisible to you.
Perls internal string format is implementation dependant, but usually a super set of UtF-8. It doesn't matter what it is because you use decode and encode to convert strings to and from the internal format to other encodings.
Decode converts to perls internal format, encode converts from perls internal format.
Binary data is stored internaly the same way characters 0 through 255 are.
Encode and decode just convert between formats. For example UTF8 encoding means each character will only be an octet using perl character vlaues 0 through 255, ie that the string consists of UTF8 octets.
Short answer: It's a mess
Slightly longer: The difference isn't visible to the programmer.
Basically you have to remember if your string contains bytes or characters, where characters are unicode codepoints. If you only encounter ASCII, the difference is invisible, which is dangerous.
Data itself and the representation of such data are distinct, and should not be confused. Strings are (conceptually) a sequence of codepoints, but are represented as a byte array in memory, and represented as some byte sequence when encoded. If you want to store binary data in a string, you re-interpret the number of a codepoint as a byte value, and restrict yourself to codepoints in 0–255.
(E.g. a file has no encoding. The information in that file has some encoding (be it ASCII, UTF-16 or EBCDIC at a character level, and Perl, HTML or .ini at an application level))
The exact storage format of a string is irrelevant, but you can store complete integers inside such a string:
# this will work if your perl was compiled with large integers
my $string = chr 2**64; # this is so not unicode
say ord $string; # 18446744073709551615
The internal format is adjusted accordingly to accomodate such values; normal strings won't take up one integer per character.
Perl can handle more than Unicode can, so it's very flexible. Sometimes you want to interface with something that cannot, so you can use encode(...) and decode(...) handle those transformations. see http://perldoc.perl.org/utf8.html
I'm reading in a lot of lines of hex data. They come in as strings and I parse them for line_codes which tell me what to do with the rest of the data. One line sets a most significant word of an address (MSW), another line sets the least significant (LSW).
I then need to concatenate those together such that if MSW = "00ff" and LSW = "f10a"
address would be 00fff10a.
This all went fine, but then I was supposed to check if address was between a certain set of values:
if address <= "007FFFh" and address >= "000200h" then
print "I'm in"
end
As you all probably know, Lua is not a fan of this as it gives me an error using <= and >= with strings.
If there a way I can convert the string into hex, such that "FFFF" would become 0xFFFF?
You use tonumber:
local someHexString = "03FFACB"
local someNumber = tonumber(someHexString, 16)
Note that numbers are not in hexadecimal. Nor are they in decimal, octal, or anything else. They're just numbers. The number 0xFF is the same number as 255. "FF" and "255" are string representations of the same number.