Where to place the return statement when defining a function to read in a file using with open(...) as ...? - python-3.x

I have a text file consisting of data that is separated by tab-delimited columns. There are many ways to read data in from the file into python, but I am specifically trying to use a method similar to one outlined below. When using a context manager like with open(...) as ..., I've seen that the general concept is to have all of the subsequent code indented within the with statement. Yet when defining a function, the return statement is usually placed at the same indentation as the first line of code within the function (excluding cases with awkward if-else loops). In this case, both approaches work. Is one method considered correct or generally preferred over the other?
def read_in(fpath, contents=[], row_limit=np.inf):
"""
fpath is filelocation + filename + '.txt'
contents is the initial data that the file data will be appeneded to
row_limit is the maximum number of rows to be read (in case one would like to not read in every row).
"""
nrows = 0
with open(fpath, 'r') as f:
for row in f:
if nrows < row_limit:
contents.append(row.split())
nrows += 1
else:
break
# return contents
return contents
Below is a snippet of the text-file I am using for this example.
1996 02 08 05 17 49 263 70 184 247 126 0 -6.0 1.6e+14 2.7e+28 249
1996 02 12 05 47 26 91 53 160 100 211 236 2.0 1.3e+15 1.6e+29 92
1996 02 17 02 06 31 279 73 317 257 378 532 9.9 3.3e+14 1.6e+29 274
1996 02 17 05 18 59 86 36 171 64 279 819 27.9 NaN NaN 88
1996 02 19 05 15 48 98 30 266 129 403 946 36.7 NaN NaN 94
1996 03 02 04 11 53 88 36 108 95 120 177 1.0 1.5e+14 8.7e+27 86
1996 03 03 04 12 30 99 26 186 141 232 215 2.3 1.6e+14 2.8e+28 99
And below is a sample call.
fpath = "/Users/.../sample_data.txt"
data_in = read_in(fpath)
for i in range(len(data_in)):
print(data_in[i])
(I realize that it's better to use chunks of pre-defined sizes to read in data, but the number of characters per row of data varies. So I'm instead trying to give user control over the number of rows read in; one could read in a subset of the rows at a time and append them into contents, continually passing them into read_in - possibly in a loop - if the file size is large enough. That said, I'd love to know if I'm wrong about this approach as well, though this isn't my main question.)

If your function needs to do some other things after writing to the file, you usually do it outside the with block. So essentially you need to return outside the with block too.
However if the purpose of your function is just to read in a file, you can return within the with block, or outside it. I believe none of the methods are preferred in this case.
I don't really understand your second question.

You can put return also withing with context.
By exiting context, the cleanup are done. This is the power of with, not to need to check all possible exit paths. Note: also with exception inside with the exit context is called.
But if file is empty (as an example), you should still return something. So in such case your code is clear, and follow the principle: one exit path. But if you should handle end of file without finding something important, I would putting normal return within with context, and handle the special case after it.

Related

replacing a value in python

I'm writing a bingo game in python. So far I can generate a bingo card and print it.
My problem is after I've randomly generated a number to call out, I don't know how to 'cross out' that number on the card to note that it's been called out.
This is the ouput, it's a randomly generated card:
B 11 13 14 2 1
I 23 28 26 27 22
N 42 45 40 33 44
G 57 48 59 56 55
O 66 62 75 63 67
I was thinking to use random.pop to generate a number to call out (in bingo the numbers go from 1 to 75)
random_draw_list = random.sample(range(1, 76), 75)
number_drawn = random_draw_list.pop()
How can I write a funtion that will 'cross out' a number on the card after its been called.
So for example if number_drawn results in 11, it should replace 11 on the card with an x or a zero.

How to parse shell values into mongoexport

I am working on a shell script that will execute mongoexport and upload it to a S3 bucket.
The goal is to extract date as a readable JSON format on data that is 45 days old.The script will run everyday as a crontab.
So basically the purpose is to archive data older than 45 days
Normal queries work intended but when I try to use variables it results an error.
The code regular format is as the following:
firstdate="$(date -v-46d +%Y-%m-%d)"
afterdate="$(date -v-45d +%Y-%m-%d)"
backup_name=gamebook
colname=test1
mongoexport --uri mongodb+srv://<user>:<pass>#gamebookserver.tvdmx.mongodb.net/$dbname
--collection $colname --query '{"gameDate": {"$gte": {"$date": "2020-09-04T00:00:00:000Z"}, "$lte": {"$date": "2020-09-05T00:00:00.000Z"}}}' --out $backup_name;
The previous code works but I want to make it more dynamic in the dates so I tried the code as shown below:
firstdate="$(date -v-46d +%Y-%m-%d)"
afterdate="$(date -v-45d +%Y-%m-%d)"
backup_name=gamebook
colname=test1
mongoexport --uri mongodb+srv://<user>:<pass>#gamebookserver.tvdmx.mongodb.net/$dbname
--collection $colname --query '{"gameDate": {"$gte": {"$date": "$firstdateT00:00:00:000Z"}, "$lte": {"$date": "$afterdateT00:00:00.000Z"}}}' --out $backup_name;
This results in the error:
2020-10-20T15:36:13.881+0700 query '[123 34 103 97 109 101 68 97 116 101 34 58 32 123 34 36 103 116 101 34 58 32 123 34 36 100 97 116 101 34 58 32 36 102 105 114 115 116 100 97 116 101 84 48 48 58 48 48 58 48 48 58 48 48 48 90 125 44 32 34 36 108 116 101 34 58 32 123 34 36 100 97 116 101 34 58 32 34 36 97 102 116 101 114 100 97 116 101 84 48 48 58 48 48 58 48 48 46 48 48 48 90 34 125 125 125]' is not valid JSON: invalid character '$' looking for beginning of value
2020-10-20T15:36:13.881+0700 try 'mongoexport --help' for more information
I've read in the documentation and it says:
You must enclose the query document in single quotes ('{ ... }') to ensure that it does not interact with your shell environment.
So my overall question is that is there a way to use values in the shell environment and parse them into the query section?
Or is there a better way that might get me the same result?
I'm still new to mongodb in general so any advise would be great.
You can always put together a string by combining interpolating and non-interpolating parts:
For instance,
--query '{"gameDate": {"$gte": {"'"$date"'": "'"$firstdate"'T00:00:00:000Z"}, "$lte": {"$date": "$afterdateT00:00:00.000Z"}}}'
would interpolate the first occurance of date and the shell variable firstdate, but would passs the rest literally to mongoexport (I've picked two variables for demonstration, because I don't understand from your question, which ones you want to expand and which one you don't want to). Basically, a
'$AAAA'"$BBBB"'$CCCCC'
is in effect a single string, but the $BBBB part would undergo parameter expansion. Hence, if
BBBB=foo
you would get the literal string $AAAAfoo$CCCCC out of this.
Since this become tedious to work, an alternative approach is to enclose everything into double-quotes, which means all parameters are expanded, and manually escape those parts which you don't want to expand. You could write the last example also as
"\$AAAA$BBBB\$CCCCC"

modifying 2 lists and giving them scores

I want to make a output file which is simply the input file with the value of each byte incremented by one.
here is the expected output:
04 fb 56 13 21 67 68 51 e9 ac
which also will be in hexadecimal notation. I am trying to do that in python3 using the following command:

awk split adds whole string to array position 1 (reason unknown)

So I have a .txt file that looks like this:
mona 70 77 85 77
john 85 92 78 80
andreja 89 90 85 94
jasper 84 64 81 66
george 54 77 82 73
ellis 90 93 89 88
I have created a grades.awk script that contains the following code:
{
FS=" "
names=$1
vi1=$2
vi2=$3
vi3=$4
rv=$5
#printf("%s ",names);
split(names,nameArray," ");
printf("%s\t",nameArray[1]); //prints the whole array of names for some reason, instead of just the name at position 1 in array ("john")
}
So my question is, how do I split this correctly? Am I doing something wrong?
How do you read line by line, word by word correctly. I need to add each column into its own array. I've been searching for the answer for quite some time now and can't fix my problem.
here is a template to calculate average grades per student
$ awk '{sum=0; for(i=2;i<=NF;i++) sum+=$i;
printf "%s\t%5.2f\n", $1, sum/(NF-1)}' file
mona 77.25
john 83.75
andreja 89.50
jasper 73.75
george 71.50
ellis 90.00
printf("%s\t",nameArray[1])
is doing exactly what you want it to do but you aren't printing any newline between invocations so it's getting called once per input line and outputting one word at a time but since you aren't outputting any newlines between words you just get 1 line of output. Change it to:
printf("%s\n",nameArray[1])
There are a few other issues with your code of course (e.g. you're setting FS in the wrong place and unnecessarily, names only every contains 1 word so splitting it into an array doesn't make sense, etc.) but I think that's what you were asking about specifically.
If that's not all you want then edit your question to clarify what you're trying to do and add concise, testable sample input and expected output.

Understanding the zlib header; CMF (CM, CINFO), FLG, (FDICT/DICTID, FLEVEL); RFC1950 § 2.2. Data format

I am curious about the zlib data format and trying to understand the zlib header as described in RFC1950 (https://www.rfc-editor.org/rfc/rfc1950). I am however new to this kind of low level interpretation and seem to have run afoul with some of my conclusions.
I have the following compressed data (from a PDF stream object):
b'h\xdebbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01\x02\x0c\x00!\xa4\x03\xc4'
In python, I have successfully decompressed and re-compressed the data:
b'x\xdacbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01!\xa4\x03\xc4'
As I have understood the discussion/answer in Deflate and inflate for PDF, using zlib C++
The difference in result of the compressed data should not matter as it is an effect of different applied methods to compress the data.
Assuming the last four bytes !\xa4\x03\xc4 are the ADLER32 (Adler-32 checksum) my questions pertain to the first 2 bytes.
0 1 0 1 2 3 0 1 2 3
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
|CMF|FLG| | [DICTID] | |...compressed data...| | ADLER32 |
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
CMF
The first byte represents the CMF, which in my two instances would be
chr h = dec 104 = hex 68 = 01101000
and chr x = dec 120 = hex 78 = 01111000
This byte is divided into a 4-bit compression method and a 4-bit information field depending on the compression method.
bits 0 to 3 CM Compression method
bits 4 to 7 CINFO Compression info
+----|----+ +----|----+ +----|----+
|0000|0000| i.e. |0110|1000| and |0111|1000|
+----|----+ +----|----+ +----|----+
CM |CINFO CM |CINFO CM |CINFO
Where
[CM] identifies the compression method used in the file.
CM = 8 denotes the "deflate" compression method with a window size up to >32K. This is the method used by gzip and PNG (see
CM = 15 is reserved.
and
For CM = 8, CINFO is the base-2 logarithm of the LZ77 window size, minus eight (CINFO=7 indicates a 32K window size). Values of CINFO above 7 are not allowed in this version of the specification. CINFO is not defined in this specification for CM not equal to 8.
As I understand it,
the only valid CM is 8
CINFO can be 0-7
Cf https://stackoverflow.com/a/34926305/7742349
You should NOT assume that it's always 8. Instead, you should check it and, if it's not 8, throw a "not supported" error.
Cf https://groups.google.com/forum/#!msg/comp.compression/_y2Wwn_Vq_E/EymIVcQ52cEJ
An exhaustive list of all 64 current possibilities for zlib headers:
COMMON
78 01
78 5e
78 9c
78 da
RARE
08 1d 18 19 28 15 38 11 48 0d 58 09 68 05
08 5b 18 57 28 53 38 4f 48 4b 58 47 68 43
08 99 18 95 28 91 38 8d 48 89 58 85 68 81
08 d7 18 d3 28 cf 38 cb 48 c7 58 c3 68 de
VERY RARE
08 3c 18 38 28 34 38 30 48 2c 58 28 68 24 78 3f
08 7a 18 76 28 72 38 6e 48 6a 58 66 68 62 78 7d
08 b8 18 b4 28 b0 38 ac 48 a8 58 a4 68 bf 78 bb
08 f6 18 f2 28 ee 38 ea 48 e6 58 e2 68 fd 78 f9
Q1 My first question is simply
Why is the CINFO before the CM?, i.e.,
why is it not 87, 80, 81, 82, 83, ...
As far as I know, byte order is not an issue here. I suspect it may be related to the least significant bit (RFC1950 § 2.1. Overall conventions), but I cannot quite understand how it would result in, e.g., 78 instead of 87...
Q2 My second question
If CINFO 7 represents "a window size up to 32K", then what does 1-6 correspond to? (assuming 0 means window size 0, as in, no compression applied).
FLG
The second byte represents the FLG
\xde -> 11011110
\xda -> 11011010
[FLG] [...] is divided as follows:
bits 0 to 4 FCHECK (check bits for CMF and FLG)
bit 5 FDICT (preset dictionary)
bits 6 to 7 FLEVEL (compression level)
+-----|-|--+ +-----|-|--+ +-----|-|--+
|00000|0|00| i.e. |11011|1|10| and |11011|0|10|
+-----|-|--+ +-----|-|--+ +-----|-|--+
C |D| L C |D| L C |D| L
Bit 0-4 as far as I can tell is some form of "checksum" or integrity control?
Bit 5 indicate whether a dictionary is present.
FDICT (Preset dictionary)
If FDICT is set, a DICT dictionary identifier is present immediately after the FLG byte. The dictionary is a sequence of bytes which are initially fed to the compressor without producing any compressed output. DICT is the Adler-32 checksum of this sequence of bytes (see the definition of ADLER32 below). The decompressor can use this identifier to determine which dictionary has been used by the compressor.
Q3 My third question
Assuming that "1" indicates "is set"
\xde -> 11011_1_10
\xda -> 11011_0_10
According to the specification DICTID consist of 4 bytes. The four following bytes in the compressed streams I have are
bbd\x10
cbd\x10
Why are the compressed data from the PDF stream object (with the FDICT 1) and the compressed data with python zlib (with the FDICT 0) almost identical?
Granted that I do not understand the function of the DICTID, but is it not supposed to exist only if FDICT is set?
Q4 My fourth question
Bit 6-7 sets the FLEVEL (Compression level)
These flags are available for use by specific compression methods. The "deflate" method (CM = 8) sets these flags as follows:
0 - compressor used fastest algorithm
1 - compressor used fast algorithm
2 - compressor used default algorithm
3 - compressor used maximum compression, slowest algorithm
The information in FLEVEL is not needed for decompression; it is there to indicate if recompression might be worthwhile.
I would have thought that the flags would be:
0 (00)
1 (01)
2 (10)
3 (11)
However from the What does a zlib header look like?
01 (00000001) - No Compression/low
[5e (01011100) - Default Compression?]
9c (10011100) - Default Compression
da (11011010) - Best Compression
I note however that the two left-most bits seem to correspond to what I have expected; I feel am obviously failing to comprehend something fundamental in how to interpret bits...
The RFC says:
CMF (Compression Method and flags)
This byte is divided into a 4-bit compression method and a 4-
bit information field depending on the compression method.
bits 0 to 3 CM Compression method
bits 4 to 7 CINFO Compression info
The least significant bit of a byte is bit 0. The most significant bit is bit 7. So the diagram you made for mapping CM and CINFO to bits is backwards. 0x78 and 0x68 both have a CM of 8. Their CINFO's are 7 and 6 respectively.
CINFO is what the RFC says it is:
CINFO (Compression info)
For CM = 8, CINFO is the base-2 logarithm of the LZ77 window
size, minus eight (CINFO=7 indicates a 32K window size).
So, a CINFO of 7 means a 32 KiB window. 6 means a 16 KiB. CINFO == 0 does not mean no compression. It means a window size of 256 bytes.
For the flag byte, you got it backwards again. FDICT is not set. For both of your examples, the compression level is 11, maximum compression.

Resources