Understanding the zlib header; CMF (CM, CINFO), FLG, (FDICT/DICTID, FLEVEL); RFC1950 § 2.2. Data format - python-3.x

I am curious about the zlib data format and trying to understand the zlib header as described in RFC1950 (https://www.rfc-editor.org/rfc/rfc1950). I am however new to this kind of low level interpretation and seem to have run afoul with some of my conclusions.
I have the following compressed data (from a PDF stream object):
b'h\xdebbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01\x02\x0c\x00!\xa4\x03\xc4'
In python, I have successfully decompressed and re-compressed the data:
b'x\xdacbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01!\xa4\x03\xc4'
As I have understood the discussion/answer in Deflate and inflate for PDF, using zlib C++
The difference in result of the compressed data should not matter as it is an effect of different applied methods to compress the data.
Assuming the last four bytes !\xa4\x03\xc4 are the ADLER32 (Adler-32 checksum) my questions pertain to the first 2 bytes.
0 1 0 1 2 3 0 1 2 3
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
|CMF|FLG| | [DICTID] | |...compressed data...| | ADLER32 |
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
CMF
The first byte represents the CMF, which in my two instances would be
chr h = dec 104 = hex 68 = 01101000
and chr x = dec 120 = hex 78 = 01111000
This byte is divided into a 4-bit compression method and a 4-bit information field depending on the compression method.
bits 0 to 3 CM Compression method
bits 4 to 7 CINFO Compression info
+----|----+ +----|----+ +----|----+
|0000|0000| i.e. |0110|1000| and |0111|1000|
+----|----+ +----|----+ +----|----+
CM |CINFO CM |CINFO CM |CINFO
Where
[CM] identifies the compression method used in the file.
CM = 8 denotes the "deflate" compression method with a window size up to >32K. This is the method used by gzip and PNG (see
CM = 15 is reserved.
and
For CM = 8, CINFO is the base-2 logarithm of the LZ77 window size, minus eight (CINFO=7 indicates a 32K window size). Values of CINFO above 7 are not allowed in this version of the specification. CINFO is not defined in this specification for CM not equal to 8.
As I understand it,
the only valid CM is 8
CINFO can be 0-7
Cf https://stackoverflow.com/a/34926305/7742349
You should NOT assume that it's always 8. Instead, you should check it and, if it's not 8, throw a "not supported" error.
Cf https://groups.google.com/forum/#!msg/comp.compression/_y2Wwn_Vq_E/EymIVcQ52cEJ
An exhaustive list of all 64 current possibilities for zlib headers:
COMMON
78 01
78 5e
78 9c
78 da
RARE
08 1d 18 19 28 15 38 11 48 0d 58 09 68 05
08 5b 18 57 28 53 38 4f 48 4b 58 47 68 43
08 99 18 95 28 91 38 8d 48 89 58 85 68 81
08 d7 18 d3 28 cf 38 cb 48 c7 58 c3 68 de
VERY RARE
08 3c 18 38 28 34 38 30 48 2c 58 28 68 24 78 3f
08 7a 18 76 28 72 38 6e 48 6a 58 66 68 62 78 7d
08 b8 18 b4 28 b0 38 ac 48 a8 58 a4 68 bf 78 bb
08 f6 18 f2 28 ee 38 ea 48 e6 58 e2 68 fd 78 f9
Q1 My first question is simply
Why is the CINFO before the CM?, i.e.,
why is it not 87, 80, 81, 82, 83, ...
As far as I know, byte order is not an issue here. I suspect it may be related to the least significant bit (RFC1950 § 2.1. Overall conventions), but I cannot quite understand how it would result in, e.g., 78 instead of 87...
Q2 My second question
If CINFO 7 represents "a window size up to 32K", then what does 1-6 correspond to? (assuming 0 means window size 0, as in, no compression applied).
FLG
The second byte represents the FLG
\xde -> 11011110
\xda -> 11011010
[FLG] [...] is divided as follows:
bits 0 to 4 FCHECK (check bits for CMF and FLG)
bit 5 FDICT (preset dictionary)
bits 6 to 7 FLEVEL (compression level)
+-----|-|--+ +-----|-|--+ +-----|-|--+
|00000|0|00| i.e. |11011|1|10| and |11011|0|10|
+-----|-|--+ +-----|-|--+ +-----|-|--+
C |D| L C |D| L C |D| L
Bit 0-4 as far as I can tell is some form of "checksum" or integrity control?
Bit 5 indicate whether a dictionary is present.
FDICT (Preset dictionary)
If FDICT is set, a DICT dictionary identifier is present immediately after the FLG byte. The dictionary is a sequence of bytes which are initially fed to the compressor without producing any compressed output. DICT is the Adler-32 checksum of this sequence of bytes (see the definition of ADLER32 below). The decompressor can use this identifier to determine which dictionary has been used by the compressor.
Q3 My third question
Assuming that "1" indicates "is set"
\xde -> 11011_1_10
\xda -> 11011_0_10
According to the specification DICTID consist of 4 bytes. The four following bytes in the compressed streams I have are
bbd\x10
cbd\x10
Why are the compressed data from the PDF stream object (with the FDICT 1) and the compressed data with python zlib (with the FDICT 0) almost identical?
Granted that I do not understand the function of the DICTID, but is it not supposed to exist only if FDICT is set?
Q4 My fourth question
Bit 6-7 sets the FLEVEL (Compression level)
These flags are available for use by specific compression methods. The "deflate" method (CM = 8) sets these flags as follows:
0 - compressor used fastest algorithm
1 - compressor used fast algorithm
2 - compressor used default algorithm
3 - compressor used maximum compression, slowest algorithm
The information in FLEVEL is not needed for decompression; it is there to indicate if recompression might be worthwhile.
I would have thought that the flags would be:
0 (00)
1 (01)
2 (10)
3 (11)
However from the What does a zlib header look like?
01 (00000001) - No Compression/low
[5e (01011100) - Default Compression?]
9c (10011100) - Default Compression
da (11011010) - Best Compression
I note however that the two left-most bits seem to correspond to what I have expected; I feel am obviously failing to comprehend something fundamental in how to interpret bits...

The RFC says:
CMF (Compression Method and flags)
This byte is divided into a 4-bit compression method and a 4-
bit information field depending on the compression method.
bits 0 to 3 CM Compression method
bits 4 to 7 CINFO Compression info
The least significant bit of a byte is bit 0. The most significant bit is bit 7. So the diagram you made for mapping CM and CINFO to bits is backwards. 0x78 and 0x68 both have a CM of 8. Their CINFO's are 7 and 6 respectively.
CINFO is what the RFC says it is:
CINFO (Compression info)
For CM = 8, CINFO is the base-2 logarithm of the LZ77 window
size, minus eight (CINFO=7 indicates a 32K window size).
So, a CINFO of 7 means a 32 KiB window. 6 means a 16 KiB. CINFO == 0 does not mean no compression. It means a window size of 256 bytes.
For the flag byte, you got it backwards again. FDICT is not set. For both of your examples, the compression level is 11, maximum compression.

Related

replacing a value in python

I'm writing a bingo game in python. So far I can generate a bingo card and print it.
My problem is after I've randomly generated a number to call out, I don't know how to 'cross out' that number on the card to note that it's been called out.
This is the ouput, it's a randomly generated card:
B 11 13 14 2 1
I 23 28 26 27 22
N 42 45 40 33 44
G 57 48 59 56 55
O 66 62 75 63 67
I was thinking to use random.pop to generate a number to call out (in bingo the numbers go from 1 to 75)
random_draw_list = random.sample(range(1, 76), 75)
number_drawn = random_draw_list.pop()
How can I write a funtion that will 'cross out' a number on the card after its been called.
So for example if number_drawn results in 11, it should replace 11 on the card with an x or a zero.

I've had some issue testing SHA256

When using SHA256 it always returned me characters contained in the standard english alphabet(lowercase) and arabic numbers(0-9). So the character set returned was [a-z]U[0-9]
The reason this confuses me is that I've heard a SHA256 should have 2^256 different results, since each bit is "random" each byte should be represented by a completely random ASCII character, not one that fits into a restricted set of 36 characters(26 letters and 10 numerals)
Basically, I want to know if my SHA256 is behaving properly and if it is, why is it like this. I am using the standard sha256sum function that comes with linux.
Yes, your assumption is correct. SHA256 will generate a total of 32 bytes (= 256 bits), each byte having an arbitrary value between 0 and 255 (inclusive).
But here lies the problem, most of those bytes do not represent valid ASCII characters (only 0 - 127) and some of them are invisible (space, tab, linefeed, and several control characters).
To "render" the SHA256, the bytes are encoded in hexadecimal format. A single byte is represented by 2 characters. 00 = 0, 7f = 127, ff = 255.
The SHA256 hash of the empty string is e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855, or if each byte is converted to decimal:
227 176 196 66 152 252 28 20 154 251 244 200 153 111 185 36 39 174 65 228 100 155 147 76 164 149 153 27 120 82 184 85

Where to place the return statement when defining a function to read in a file using with open(...) as ...?

I have a text file consisting of data that is separated by tab-delimited columns. There are many ways to read data in from the file into python, but I am specifically trying to use a method similar to one outlined below. When using a context manager like with open(...) as ..., I've seen that the general concept is to have all of the subsequent code indented within the with statement. Yet when defining a function, the return statement is usually placed at the same indentation as the first line of code within the function (excluding cases with awkward if-else loops). In this case, both approaches work. Is one method considered correct or generally preferred over the other?
def read_in(fpath, contents=[], row_limit=np.inf):
"""
fpath is filelocation + filename + '.txt'
contents is the initial data that the file data will be appeneded to
row_limit is the maximum number of rows to be read (in case one would like to not read in every row).
"""
nrows = 0
with open(fpath, 'r') as f:
for row in f:
if nrows < row_limit:
contents.append(row.split())
nrows += 1
else:
break
# return contents
return contents
Below is a snippet of the text-file I am using for this example.
1996 02 08 05 17 49 263 70 184 247 126 0 -6.0 1.6e+14 2.7e+28 249
1996 02 12 05 47 26 91 53 160 100 211 236 2.0 1.3e+15 1.6e+29 92
1996 02 17 02 06 31 279 73 317 257 378 532 9.9 3.3e+14 1.6e+29 274
1996 02 17 05 18 59 86 36 171 64 279 819 27.9 NaN NaN 88
1996 02 19 05 15 48 98 30 266 129 403 946 36.7 NaN NaN 94
1996 03 02 04 11 53 88 36 108 95 120 177 1.0 1.5e+14 8.7e+27 86
1996 03 03 04 12 30 99 26 186 141 232 215 2.3 1.6e+14 2.8e+28 99
And below is a sample call.
fpath = "/Users/.../sample_data.txt"
data_in = read_in(fpath)
for i in range(len(data_in)):
print(data_in[i])
(I realize that it's better to use chunks of pre-defined sizes to read in data, but the number of characters per row of data varies. So I'm instead trying to give user control over the number of rows read in; one could read in a subset of the rows at a time and append them into contents, continually passing them into read_in - possibly in a loop - if the file size is large enough. That said, I'd love to know if I'm wrong about this approach as well, though this isn't my main question.)
If your function needs to do some other things after writing to the file, you usually do it outside the with block. So essentially you need to return outside the with block too.
However if the purpose of your function is just to read in a file, you can return within the with block, or outside it. I believe none of the methods are preferred in this case.
I don't really understand your second question.
You can put return also withing with context.
By exiting context, the cleanup are done. This is the power of with, not to need to check all possible exit paths. Note: also with exception inside with the exit context is called.
But if file is empty (as an example), you should still return something. So in such case your code is clear, and follow the principle: one exit path. But if you should handle end of file without finding something important, I would putting normal return within with context, and handle the special case after it.

What encoding/compression algorithm is this?

I'm trying to reverse engineer a binary file format, but it has no magic bytes, and no specific extension. I can influence only a single aspect of the file: a short string. By trying different strings, I was able to figure out how data is stored in the file. It seems that the whole file uses some sort of simple encoding. I hope that finding the exact encoding allows me to narrow down my search for the file format. I know the file is generated by a Windows program written in C++.
Now, after much trial-and-error, I found that some sections of the file are encoded in runs. Each run starts with a byte that indicates how many bytes will follow and where the data is retrieved.
000ddddd (1 byte)Take the following (ddddd)+1 bytes from the encoded data.
111····· ···ddddd ···bbbbb (3 bytes)Go back (bbbbb)+1 bytes in the decoded data, and take the next (ddddd)+9 bytes from it.
ddd····· ··bbbbbb (2 bytes)Go back (bbbbbb)+1 bytes in the decoded data, and take the next (ddd)+2 bytes from it.
Here's an example:
This is the start of the file, with the UTF-16 string abracadabra encoded in it:
. . . a . b . r . . c . . d . € .
0C 20 03 04 61 00 62 00 72 20 05 00 63 20 03 00 64 20 03 80 0D
To decode the string:
0C number of Unicode chars: 12 (11 chars + \0)
20 03 . . . ??
04 next 5
61 00 a .
62 00 b .
72 r
20 05 . a . back 6, take 3
00 next 1
63 c
20 03 . a . back 4, take 3
00 next 1
64 d
20 03 . a . back 4, take 3
80 0D b . r . a . back 14, take 6
This results in (UTF-16):
a . b . r . a . c . a . d . a . b . r . a .
61 00 62 00 72 00 61 00 63 00 61 00 64 00 61 00 62 00 72 00 61 00
However, I have no clue as to what encoding/compression algorithm this might be. It looks like some variant of LZ that doesn't use a dictionary (like LZ77), but so far I haven't been able to find any algorithm that matches this description. I'm also not sure whether the entire file is encoded like this, or only portions of it.
Do you know this encoding? Or do you have any hints for things I might look for in the file to identify the encoding?
After your edit I think it's LZF with the following differences to your observations:
The magic header and indication of compressed vs uncompressed has been removed in your example (not too surprising if it's embedded in a file).
You took the block length as one byte, but it is two bytes and big-endian, so the prior 0x00 is part of the length, which still works.
Could be NTFS compression, which is LZNT1. This idea is supported by the platform and the apparent 2-byte structure, along with the byte-alignment of the actual data.
The following elements are specific to this algorithm.
Chunks: Segments of data that are compressed, uncompressed, or that denote the end of the buffer.
Chunk header: The header for a compressed or uncompressed chunk of data.
Flag bytes: A bit flag whose bits, read from low order to high order, specify the formats of the data elements that follow. For example, bit 0 corresponds to the first data element, bit 1 to the second, and so on. If the bit corresponding to a data element is set, the element is a 2-byte compressed word; otherwise, it is a 1-byte literal value.
Flag group: A flag byte followed by zero or more data elements, each of which is a single literal byte or a 2-byte compressed word.

HEX & Decimal conversion

I have a binary file , the definition of its content is as below : ( all data is stored
in little endian (ie. least significant byte first)) . The example numbers below are HEX
11 63 39 46 --- Time, UTC in seconds since 1 Jan 1970.
01 00 --- 0001 = No Fix, 0002 = SPS
97 85 ff e0 7b db 4c 40 --- Latitude, as double
a1 d5 ce 56 8d 26 28 40 --- Longitude, as double
f0 37 e1 42 --- Height in meters, as float
fe 2b f0 3a --- Speed in km/h, as float
00 00 00 00 --- Heading (degrees ?), as float
01 00 --- RCR, log reason. 0001=Time, 0004=Distance
59 20 6a f3 4a 26 e3 3f --- Distance in meters, as double,
2a --- ? Don't know
a8 --- Checksum, xor of all bytes above not including 0x2a
the data from the Binary file "in HEX" is as below
"F25D39460200269652F5032445401F4228D79BCC54C09A3A2743B4ADE73F2A83"
I appreciate if you can support me to translate this data line based on the instruction before.
Probably wrong, but here's a shot at it using Ruby:
hex = "F25D39460200269652F5032445401F4228D79BCC54C09A3A2743B4ADE73F2A83"
ints = hex.scan(/../).map{ |s| s.to_i(16) }
raw = ints.pack('C*')
fields = raw.unpack( 'VvEEVVVvE')
p fields
#=> [1178164722, 2, 42.2813707974677, -83.1970117467067, 1126644378, 1072147892, nil, 33578, nil]
p Time.at( fields.first )
#=> 2007-05-02 21:58:42 -0600
I'd appreciate it if someone well-versed in #pack and #unpack would show me a better way to accomplish the first three lines.
My Cygnus Hex Editor could load such a file and, using structure templates, display the data in its native formats.
Beyond that, it's just a matter of doing through each value and working out the translation for each byte.

Resources