I've had some issue testing SHA256 - linux

When using SHA256 it always returned me characters contained in the standard english alphabet(lowercase) and arabic numbers(0-9). So the character set returned was [a-z]U[0-9]
The reason this confuses me is that I've heard a SHA256 should have 2^256 different results, since each bit is "random" each byte should be represented by a completely random ASCII character, not one that fits into a restricted set of 36 characters(26 letters and 10 numerals)
Basically, I want to know if my SHA256 is behaving properly and if it is, why is it like this. I am using the standard sha256sum function that comes with linux.

Yes, your assumption is correct. SHA256 will generate a total of 32 bytes (= 256 bits), each byte having an arbitrary value between 0 and 255 (inclusive).
But here lies the problem, most of those bytes do not represent valid ASCII characters (only 0 - 127) and some of them are invisible (space, tab, linefeed, and several control characters).
To "render" the SHA256, the bytes are encoded in hexadecimal format. A single byte is represented by 2 characters. 00 = 0, 7f = 127, ff = 255.
The SHA256 hash of the empty string is e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855, or if each byte is converted to decimal:
227 176 196 66 152 252 28 20 154 251 244 200 153 111 185 36 39 174 65 228 100 155 147 76 164 149 153 27 120 82 184 85

Related

VBA string to byte array conversion giving incorrect results for certain values only

I a routine to take a string and make it into an array of numbers. This is in VBA running in Excel as part of Office Professional 2019.
The code below is a demo version to illustrate the problem, which encapsulates the original code.
I need to display the numberical equivalent of each char in the string, so am using Cstr(by) elsewhere in code.
Public Sub TestByteFromString()
'### vars
Dim ss As String, i As Integer
Dim arrBytes() As Byte
Dim by As Byte
'###
ss = ""
For i = 0 To 127 Step 1
ss = ss & Chr(Val(i + 126))
Next i
arrBytes = ss
'###
For i = LBound(arrBytes) To UBound(arrBytes) Step 2
by = arrBytes(i)
Debug.Print "Index " & CStr(i) & " Byte " & CStr(by) & " Original " & CStr((i / 2 + 126)) & " Difference = " & CStr(((i / 2 + 126)) - CInt(by))
Next i
'###
End Sub
`
It seems to work fine except for certain values greater than 126, some of which are shown by the demo above.
I am getting these results and cannot see an explanation or a consistant pattern. Does this make sense to anyone what is wrong?
Index 0 Byte 126 Original 126 Difference = 0
Index 2 Byte 127 Original 127 Difference = 0
Index 4 Byte 172 Original 128 Difference = -44
Index 6 Byte 129 Original 129 Difference = 0
Index 8 Byte 26 Original 130 Difference = 104
Index 10 Byte 146 Original 131 Difference = -15
Index 12 Byte 30 Original 132 Difference = 102
Index 14 Byte 38 Original 133 Difference = 95
Index 16 Byte 32 Original 134 Difference = 102
Index 18 Byte 33 Original 135 Difference = 102
Index 20 Byte 198 Original 136 Difference = -62
Index 22 Byte 48 Original 137 Difference = 89
Index 24 Byte 96 Original 138 Difference = 42
Index 26 Byte 57 Original 139 Difference = 82
Index 28 Byte 82 Original 140 Difference = 58
Index 30 Byte 141 Original 141 Difference = 0
Index 32 Byte 125 Original 142 Difference = 17
Index 34 Byte 143 Original 143 Difference = 0
Index 36 Byte 144 Original 144 Difference = 0
Index 38 Byte 24 Original 145 Difference = 121
Index 40 Byte 25 Original 146 Difference = 121
Index 42 Byte 28 Original 147 Difference = 119
Index 44 Byte 29 Original 148 Difference = 119
Index 46 Byte 34 Original 149 Difference = 115
Index 48 Byte 19 Original 150 Difference = 131
Index 50 Byte 20 Original 151 Difference = 131
Index 52 Byte 220 Original 152 Difference = -68
Index 54 Byte 34 Original 153 Difference = 119
Index 56 Byte 97 Original 154 Difference = 57
Index 58 Byte 58 Original 155 Difference = 97
Index 60 Byte 83 Original 156 Difference = 73
Index 62 Byte 157 Original 157 Difference = 0
Index 64 Byte 126 Original 158 Difference = 32
Index 66 Byte 120 Original 159 Difference = 39
Index 68 Byte 160 Original 160 Difference = 0
It seems fine for everything beyond 160 and below 126.
I don't think it is the Cstr() function. If I multiply the byte value by 2 and use Cstr() I get this kind of result, suggesting the byte numerical value is the problem.
Index 66 Byte 120 Original 159 Difference = 39
Index 66 2*Byte 240
Other causes investigated but cannot see an explanation-
-two byte storage in strings for chars.
-ASCII char set
-bytes being decoded as negative numbers if MSB set, but unlikley as 160 onwards is correct.
There may be much better ways to get the array, and they would be very useful, but if possible I would like to also know what has gone wrong so I, and anyone reading, would not make the same mistake again.
Thanks for any assistance, R.

How to parse shell values into mongoexport

I am working on a shell script that will execute mongoexport and upload it to a S3 bucket.
The goal is to extract date as a readable JSON format on data that is 45 days old.The script will run everyday as a crontab.
So basically the purpose is to archive data older than 45 days
Normal queries work intended but when I try to use variables it results an error.
The code regular format is as the following:
firstdate="$(date -v-46d +%Y-%m-%d)"
afterdate="$(date -v-45d +%Y-%m-%d)"
backup_name=gamebook
colname=test1
mongoexport --uri mongodb+srv://<user>:<pass>#gamebookserver.tvdmx.mongodb.net/$dbname
--collection $colname --query '{"gameDate": {"$gte": {"$date": "2020-09-04T00:00:00:000Z"}, "$lte": {"$date": "2020-09-05T00:00:00.000Z"}}}' --out $backup_name;
The previous code works but I want to make it more dynamic in the dates so I tried the code as shown below:
firstdate="$(date -v-46d +%Y-%m-%d)"
afterdate="$(date -v-45d +%Y-%m-%d)"
backup_name=gamebook
colname=test1
mongoexport --uri mongodb+srv://<user>:<pass>#gamebookserver.tvdmx.mongodb.net/$dbname
--collection $colname --query '{"gameDate": {"$gte": {"$date": "$firstdateT00:00:00:000Z"}, "$lte": {"$date": "$afterdateT00:00:00.000Z"}}}' --out $backup_name;
This results in the error:
2020-10-20T15:36:13.881+0700 query '[123 34 103 97 109 101 68 97 116 101 34 58 32 123 34 36 103 116 101 34 58 32 123 34 36 100 97 116 101 34 58 32 36 102 105 114 115 116 100 97 116 101 84 48 48 58 48 48 58 48 48 58 48 48 48 90 125 44 32 34 36 108 116 101 34 58 32 123 34 36 100 97 116 101 34 58 32 34 36 97 102 116 101 114 100 97 116 101 84 48 48 58 48 48 58 48 48 46 48 48 48 90 34 125 125 125]' is not valid JSON: invalid character '$' looking for beginning of value
2020-10-20T15:36:13.881+0700 try 'mongoexport --help' for more information
I've read in the documentation and it says:
You must enclose the query document in single quotes ('{ ... }') to ensure that it does not interact with your shell environment.
So my overall question is that is there a way to use values in the shell environment and parse them into the query section?
Or is there a better way that might get me the same result?
I'm still new to mongodb in general so any advise would be great.
You can always put together a string by combining interpolating and non-interpolating parts:
For instance,
--query '{"gameDate": {"$gte": {"'"$date"'": "'"$firstdate"'T00:00:00:000Z"}, "$lte": {"$date": "$afterdateT00:00:00.000Z"}}}'
would interpolate the first occurance of date and the shell variable firstdate, but would passs the rest literally to mongoexport (I've picked two variables for demonstration, because I don't understand from your question, which ones you want to expand and which one you don't want to). Basically, a
'$AAAA'"$BBBB"'$CCCCC'
is in effect a single string, but the $BBBB part would undergo parameter expansion. Hence, if
BBBB=foo
you would get the literal string $AAAAfoo$CCCCC out of this.
Since this become tedious to work, an alternative approach is to enclose everything into double-quotes, which means all parameters are expanded, and manually escape those parts which you don't want to expand. You could write the last example also as
"\$AAAA$BBBB\$CCCCC"

Where to place the return statement when defining a function to read in a file using with open(...) as ...?

I have a text file consisting of data that is separated by tab-delimited columns. There are many ways to read data in from the file into python, but I am specifically trying to use a method similar to one outlined below. When using a context manager like with open(...) as ..., I've seen that the general concept is to have all of the subsequent code indented within the with statement. Yet when defining a function, the return statement is usually placed at the same indentation as the first line of code within the function (excluding cases with awkward if-else loops). In this case, both approaches work. Is one method considered correct or generally preferred over the other?
def read_in(fpath, contents=[], row_limit=np.inf):
"""
fpath is filelocation + filename + '.txt'
contents is the initial data that the file data will be appeneded to
row_limit is the maximum number of rows to be read (in case one would like to not read in every row).
"""
nrows = 0
with open(fpath, 'r') as f:
for row in f:
if nrows < row_limit:
contents.append(row.split())
nrows += 1
else:
break
# return contents
return contents
Below is a snippet of the text-file I am using for this example.
1996 02 08 05 17 49 263 70 184 247 126 0 -6.0 1.6e+14 2.7e+28 249
1996 02 12 05 47 26 91 53 160 100 211 236 2.0 1.3e+15 1.6e+29 92
1996 02 17 02 06 31 279 73 317 257 378 532 9.9 3.3e+14 1.6e+29 274
1996 02 17 05 18 59 86 36 171 64 279 819 27.9 NaN NaN 88
1996 02 19 05 15 48 98 30 266 129 403 946 36.7 NaN NaN 94
1996 03 02 04 11 53 88 36 108 95 120 177 1.0 1.5e+14 8.7e+27 86
1996 03 03 04 12 30 99 26 186 141 232 215 2.3 1.6e+14 2.8e+28 99
And below is a sample call.
fpath = "/Users/.../sample_data.txt"
data_in = read_in(fpath)
for i in range(len(data_in)):
print(data_in[i])
(I realize that it's better to use chunks of pre-defined sizes to read in data, but the number of characters per row of data varies. So I'm instead trying to give user control over the number of rows read in; one could read in a subset of the rows at a time and append them into contents, continually passing them into read_in - possibly in a loop - if the file size is large enough. That said, I'd love to know if I'm wrong about this approach as well, though this isn't my main question.)
If your function needs to do some other things after writing to the file, you usually do it outside the with block. So essentially you need to return outside the with block too.
However if the purpose of your function is just to read in a file, you can return within the with block, or outside it. I believe none of the methods are preferred in this case.
I don't really understand your second question.
You can put return also withing with context.
By exiting context, the cleanup are done. This is the power of with, not to need to check all possible exit paths. Note: also with exception inside with the exit context is called.
But if file is empty (as an example), you should still return something. So in such case your code is clear, and follow the principle: one exit path. But if you should handle end of file without finding something important, I would putting normal return within with context, and handle the special case after it.

Understanding the zlib header; CMF (CM, CINFO), FLG, (FDICT/DICTID, FLEVEL); RFC1950 § 2.2. Data format

I am curious about the zlib data format and trying to understand the zlib header as described in RFC1950 (https://www.rfc-editor.org/rfc/rfc1950). I am however new to this kind of low level interpretation and seem to have run afoul with some of my conclusions.
I have the following compressed data (from a PDF stream object):
b'h\xdebbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01\x02\x0c\x00!\xa4\x03\xc4'
In python, I have successfully decompressed and re-compressed the data:
b'x\xdacbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01!\xa4\x03\xc4'
As I have understood the discussion/answer in Deflate and inflate for PDF, using zlib C++
The difference in result of the compressed data should not matter as it is an effect of different applied methods to compress the data.
Assuming the last four bytes !\xa4\x03\xc4 are the ADLER32 (Adler-32 checksum) my questions pertain to the first 2 bytes.
0 1 0 1 2 3 0 1 2 3
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
|CMF|FLG| | [DICTID] | |...compressed data...| | ADLER32 |
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
CMF
The first byte represents the CMF, which in my two instances would be
chr h = dec 104 = hex 68 = 01101000
and chr x = dec 120 = hex 78 = 01111000
This byte is divided into a 4-bit compression method and a 4-bit information field depending on the compression method.
bits 0 to 3 CM Compression method
bits 4 to 7 CINFO Compression info
+----|----+ +----|----+ +----|----+
|0000|0000| i.e. |0110|1000| and |0111|1000|
+----|----+ +----|----+ +----|----+
CM |CINFO CM |CINFO CM |CINFO
Where
[CM] identifies the compression method used in the file.
CM = 8 denotes the "deflate" compression method with a window size up to >32K. This is the method used by gzip and PNG (see
CM = 15 is reserved.
and
For CM = 8, CINFO is the base-2 logarithm of the LZ77 window size, minus eight (CINFO=7 indicates a 32K window size). Values of CINFO above 7 are not allowed in this version of the specification. CINFO is not defined in this specification for CM not equal to 8.
As I understand it,
the only valid CM is 8
CINFO can be 0-7
Cf https://stackoverflow.com/a/34926305/7742349
You should NOT assume that it's always 8. Instead, you should check it and, if it's not 8, throw a "not supported" error.
Cf https://groups.google.com/forum/#!msg/comp.compression/_y2Wwn_Vq_E/EymIVcQ52cEJ
An exhaustive list of all 64 current possibilities for zlib headers:
COMMON
78 01
78 5e
78 9c
78 da
RARE
08 1d 18 19 28 15 38 11 48 0d 58 09 68 05
08 5b 18 57 28 53 38 4f 48 4b 58 47 68 43
08 99 18 95 28 91 38 8d 48 89 58 85 68 81
08 d7 18 d3 28 cf 38 cb 48 c7 58 c3 68 de
VERY RARE
08 3c 18 38 28 34 38 30 48 2c 58 28 68 24 78 3f
08 7a 18 76 28 72 38 6e 48 6a 58 66 68 62 78 7d
08 b8 18 b4 28 b0 38 ac 48 a8 58 a4 68 bf 78 bb
08 f6 18 f2 28 ee 38 ea 48 e6 58 e2 68 fd 78 f9
Q1 My first question is simply
Why is the CINFO before the CM?, i.e.,
why is it not 87, 80, 81, 82, 83, ...
As far as I know, byte order is not an issue here. I suspect it may be related to the least significant bit (RFC1950 § 2.1. Overall conventions), but I cannot quite understand how it would result in, e.g., 78 instead of 87...
Q2 My second question
If CINFO 7 represents "a window size up to 32K", then what does 1-6 correspond to? (assuming 0 means window size 0, as in, no compression applied).
FLG
The second byte represents the FLG
\xde -> 11011110
\xda -> 11011010
[FLG] [...] is divided as follows:
bits 0 to 4 FCHECK (check bits for CMF and FLG)
bit 5 FDICT (preset dictionary)
bits 6 to 7 FLEVEL (compression level)
+-----|-|--+ +-----|-|--+ +-----|-|--+
|00000|0|00| i.e. |11011|1|10| and |11011|0|10|
+-----|-|--+ +-----|-|--+ +-----|-|--+
C |D| L C |D| L C |D| L
Bit 0-4 as far as I can tell is some form of "checksum" or integrity control?
Bit 5 indicate whether a dictionary is present.
FDICT (Preset dictionary)
If FDICT is set, a DICT dictionary identifier is present immediately after the FLG byte. The dictionary is a sequence of bytes which are initially fed to the compressor without producing any compressed output. DICT is the Adler-32 checksum of this sequence of bytes (see the definition of ADLER32 below). The decompressor can use this identifier to determine which dictionary has been used by the compressor.
Q3 My third question
Assuming that "1" indicates "is set"
\xde -> 11011_1_10
\xda -> 11011_0_10
According to the specification DICTID consist of 4 bytes. The four following bytes in the compressed streams I have are
bbd\x10
cbd\x10
Why are the compressed data from the PDF stream object (with the FDICT 1) and the compressed data with python zlib (with the FDICT 0) almost identical?
Granted that I do not understand the function of the DICTID, but is it not supposed to exist only if FDICT is set?
Q4 My fourth question
Bit 6-7 sets the FLEVEL (Compression level)
These flags are available for use by specific compression methods. The "deflate" method (CM = 8) sets these flags as follows:
0 - compressor used fastest algorithm
1 - compressor used fast algorithm
2 - compressor used default algorithm
3 - compressor used maximum compression, slowest algorithm
The information in FLEVEL is not needed for decompression; it is there to indicate if recompression might be worthwhile.
I would have thought that the flags would be:
0 (00)
1 (01)
2 (10)
3 (11)
However from the What does a zlib header look like?
01 (00000001) - No Compression/low
[5e (01011100) - Default Compression?]
9c (10011100) - Default Compression
da (11011010) - Best Compression
I note however that the two left-most bits seem to correspond to what I have expected; I feel am obviously failing to comprehend something fundamental in how to interpret bits...
The RFC says:
CMF (Compression Method and flags)
This byte is divided into a 4-bit compression method and a 4-
bit information field depending on the compression method.
bits 0 to 3 CM Compression method
bits 4 to 7 CINFO Compression info
The least significant bit of a byte is bit 0. The most significant bit is bit 7. So the diagram you made for mapping CM and CINFO to bits is backwards. 0x78 and 0x68 both have a CM of 8. Their CINFO's are 7 and 6 respectively.
CINFO is what the RFC says it is:
CINFO (Compression info)
For CM = 8, CINFO is the base-2 logarithm of the LZ77 window
size, minus eight (CINFO=7 indicates a 32K window size).
So, a CINFO of 7 means a 32 KiB window. 6 means a 16 KiB. CINFO == 0 does not mean no compression. It means a window size of 256 bytes.
For the flag byte, you got it backwards again. FDICT is not set. For both of your examples, the compression level is 11, maximum compression.

How does Excel evaluate FACT(170)/FACT(169) correctly?

170! approaches the limit of a floating point double: 171! will overflow.
However 170! is over 300 digits long.
There is, therefore, no way that 170! can be represented precisely in floating point.
Yet Excel returns the correct answer for 170! / 169!.
Why is this? I'd expect some error to creep in, but it returns an integral value. Does Excel somehow know how to optimise this calculation?
If you find the closest doubles to 170! and 169!, they are
double oneseventy = 5818033100654137.0 * 256;
double onesixtynine = 8761273375102700.0;
times the same power of two. The closest double to the quotient of these is exactly 170.0.
Also, Excel may compute 170! by multiplying 169! by 170.
William Kahan has a paper called "How Futile are Mindless Assessments of Roundoff in Floating-Point Computation?" where he discusses some of the insanity that goes on in Excel. It may be that Excel is not computing 170 exactly, but rather it's hiding an ulp of reality from you.
The answer of tmyklebu is already perfect. But I wanted to know more.
What if implementation of n! was something trivial as return double(n)*(n-1)!...
Here is a Smalltalk snippet, but you can translate in many other languages, that's not the point:
(2 to: 170) count: [:n |
| num den |
den := (2 to: n - 1) inject: 1.0 into: [:p :e | p*e].
num := n*den.
num / den ~= n].
And the answer is 12
So you have not been particulary lucky, due to good properties of round to nearest even rounding mode, out of these 169 numbers, only 12 don't behave as expected.
Which ones? Replace count: by select: and you get:
#(24 47 59 61 81 96 101 104 105 114 122 146)
If I had an Excel handy, I would ask to evaluate 146!/145!.
Curiously (only apparently curiously), a less naive solution that computes the exact factorial with large integer arithmetic, then convert to nearest float, does not perform better !
(2 to: 170) reject: [:n |
n factorial asFloat / (n-1) factorial asFloat = n]
leads to:
#(24 31 34 40 41 45 46 57 61 70 75 78 79 86 88 92 93 111 115 116 117 119 122 124 141 144 147 164)

Resources