How to parse shell values into mongoexport - linux

I am working on a shell script that will execute mongoexport and upload it to a S3 bucket.
The goal is to extract date as a readable JSON format on data that is 45 days old.The script will run everyday as a crontab.
So basically the purpose is to archive data older than 45 days
Normal queries work intended but when I try to use variables it results an error.
The code regular format is as the following:
firstdate="$(date -v-46d +%Y-%m-%d)"
afterdate="$(date -v-45d +%Y-%m-%d)"
backup_name=gamebook
colname=test1
mongoexport --uri mongodb+srv://<user>:<pass>#gamebookserver.tvdmx.mongodb.net/$dbname
--collection $colname --query '{"gameDate": {"$gte": {"$date": "2020-09-04T00:00:00:000Z"}, "$lte": {"$date": "2020-09-05T00:00:00.000Z"}}}' --out $backup_name;
The previous code works but I want to make it more dynamic in the dates so I tried the code as shown below:
firstdate="$(date -v-46d +%Y-%m-%d)"
afterdate="$(date -v-45d +%Y-%m-%d)"
backup_name=gamebook
colname=test1
mongoexport --uri mongodb+srv://<user>:<pass>#gamebookserver.tvdmx.mongodb.net/$dbname
--collection $colname --query '{"gameDate": {"$gte": {"$date": "$firstdateT00:00:00:000Z"}, "$lte": {"$date": "$afterdateT00:00:00.000Z"}}}' --out $backup_name;
This results in the error:
2020-10-20T15:36:13.881+0700 query '[123 34 103 97 109 101 68 97 116 101 34 58 32 123 34 36 103 116 101 34 58 32 123 34 36 100 97 116 101 34 58 32 36 102 105 114 115 116 100 97 116 101 84 48 48 58 48 48 58 48 48 58 48 48 48 90 125 44 32 34 36 108 116 101 34 58 32 123 34 36 100 97 116 101 34 58 32 34 36 97 102 116 101 114 100 97 116 101 84 48 48 58 48 48 58 48 48 46 48 48 48 90 34 125 125 125]' is not valid JSON: invalid character '$' looking for beginning of value
2020-10-20T15:36:13.881+0700 try 'mongoexport --help' for more information
I've read in the documentation and it says:
You must enclose the query document in single quotes ('{ ... }') to ensure that it does not interact with your shell environment.
So my overall question is that is there a way to use values in the shell environment and parse them into the query section?
Or is there a better way that might get me the same result?
I'm still new to mongodb in general so any advise would be great.

You can always put together a string by combining interpolating and non-interpolating parts:
For instance,
--query '{"gameDate": {"$gte": {"'"$date"'": "'"$firstdate"'T00:00:00:000Z"}, "$lte": {"$date": "$afterdateT00:00:00.000Z"}}}'
would interpolate the first occurance of date and the shell variable firstdate, but would passs the rest literally to mongoexport (I've picked two variables for demonstration, because I don't understand from your question, which ones you want to expand and which one you don't want to). Basically, a
'$AAAA'"$BBBB"'$CCCCC'
is in effect a single string, but the $BBBB part would undergo parameter expansion. Hence, if
BBBB=foo
you would get the literal string $AAAAfoo$CCCCC out of this.
Since this become tedious to work, an alternative approach is to enclose everything into double-quotes, which means all parameters are expanded, and manually escape those parts which you don't want to expand. You could write the last example also as
"\$AAAA$BBBB\$CCCCC"

Related

datetime syntax seems valid, but causes a syntax error

I'm trying to query my customlogs table (Eg: CustomData_CL) by giving the time range. The result of this query will be the filtered time ranged data. I want to find out the data size of the resulted output.
Query which I have used to fetch the time ranged o/p:
CustomData_CL
| where TimeGenerated between (datetime(2022–09–14 04:00:00) .. datetime(2020–09–14 05:00:00))
But it is giving the following error:
Can anyone please suggest on the same ?
Note the characters with code point 8211.
These are not standard hyphens (-) 🙂.
let p_str = "(datetime(2022–09–14 04:00:00) .. datetime(2020–09–14 05:00:00))";
print str = p_str
| mv-expand str = extract_all("(.)", str) to typeof(string)
| extend dec = to_utf8(str)[0]
str
dec
(
40
d
100
a
97
t
116
e
101
t
116
i
105
m
109
e
101
(
40
2
50
0
48
2
50
2
50
–
8211
0
48
9
57
–
8211
1
49
4
52
32
0
48
4
52
:
58
0
48
0
48
:
58
0
48
0
48
)
41
32
.
46
.
46
32
d
100
a
97
t
116
e
101
t
116
i
105
m
109
e
101
(
40
2
50
0
48
2
50
0
48
–
8211
0
48
9
57
–
8211
1
49
4
52
32
0
48
5
53
:
58
0
48
0
48
:
58
0
48
0
48
)
41
)
41
Fiddle
Update, per OP request:
Please note that in addition to the use of a wrong character that caused the syntax error, your 2nd datetime year was wrong.
// Generation of mock table. Not part of the solution
let CustomData_CL = datatable(TimeGenerated:datetime)[datetime(2022-09-14 04:30:00)];
// Solution starts here
CustomData_CL
| where TimeGenerated between (datetime(2022-09-14 04:00:00) .. datetime(2022-09-14 05:00:00))
TimeGenerated
2022-09-14T04:30:00Z
Fiddle
I'm preparing the material for a KQL course, and I thought about creating a challenge, based on your question.
Check out what happened when I posted your code into Kusto Web Explorer... 🙂
How cool is that?!

Read Data from Specific Line in binary file

I realize this is similar to other questions that have been asked, but after pouring over stack overflow and node.js documentation for several days I decided it was time to ask for help.
Basically what I need to do is read data from a binary file and extract information to then convert into plain text.
I have been trying to use buffers but I am not seeing any progress.
If someone could get me to a point where you can read the entire file, and decode into plain text I believe I can take it from there.
const fs = require('fs');
let buf = fs.readFileSync(__dirname + '/sd_bl.bin');
let buffer = new Buffer.alloc(Buffer.byteLength(buf));
buffer.write(buf.toString('base64'));
console.log(buffer);
This is what I currently have, and the output is:
<Buffer 41 41 45 43 41 77 51 46 42 67 63 41 41 51 49 44 42 41 55 47 42 77 41 42 41 67 4d 45 42 51 59 48 41 41 45 43 41 77 51 46 42 67 63 41 41 51 49 44 42 41 ... 7950 more bytes>
So it seems to be allocating and writing to the buffer correctly, but I don't know what the next steps should be as far as getting plain text out of it.
I am relatively new here, but here's a shot! I believe your console.log statement may be misleading you.
Console-logging buffer.toString() instead, should be sufficient to get you the human-readable contents of the file.
Here is an example:
var buffer = fs.readFileSync('yourfile.txt')
console.log(buffer.toString())
Currently, your code creates a new buffer object (buffer) after reading the file, writes the contents of the old buffer to it and then prints out the new Buffer object itself instead of its string representation.
Edit: Screenshot of this in Terminal
For more on actually reading the file line-by-line, I would check this other so post out: Read a file one line at a time in node.js?

Where to place the return statement when defining a function to read in a file using with open(...) as ...?

I have a text file consisting of data that is separated by tab-delimited columns. There are many ways to read data in from the file into python, but I am specifically trying to use a method similar to one outlined below. When using a context manager like with open(...) as ..., I've seen that the general concept is to have all of the subsequent code indented within the with statement. Yet when defining a function, the return statement is usually placed at the same indentation as the first line of code within the function (excluding cases with awkward if-else loops). In this case, both approaches work. Is one method considered correct or generally preferred over the other?
def read_in(fpath, contents=[], row_limit=np.inf):
"""
fpath is filelocation + filename + '.txt'
contents is the initial data that the file data will be appeneded to
row_limit is the maximum number of rows to be read (in case one would like to not read in every row).
"""
nrows = 0
with open(fpath, 'r') as f:
for row in f:
if nrows < row_limit:
contents.append(row.split())
nrows += 1
else:
break
# return contents
return contents
Below is a snippet of the text-file I am using for this example.
1996 02 08 05 17 49 263 70 184 247 126 0 -6.0 1.6e+14 2.7e+28 249
1996 02 12 05 47 26 91 53 160 100 211 236 2.0 1.3e+15 1.6e+29 92
1996 02 17 02 06 31 279 73 317 257 378 532 9.9 3.3e+14 1.6e+29 274
1996 02 17 05 18 59 86 36 171 64 279 819 27.9 NaN NaN 88
1996 02 19 05 15 48 98 30 266 129 403 946 36.7 NaN NaN 94
1996 03 02 04 11 53 88 36 108 95 120 177 1.0 1.5e+14 8.7e+27 86
1996 03 03 04 12 30 99 26 186 141 232 215 2.3 1.6e+14 2.8e+28 99
And below is a sample call.
fpath = "/Users/.../sample_data.txt"
data_in = read_in(fpath)
for i in range(len(data_in)):
print(data_in[i])
(I realize that it's better to use chunks of pre-defined sizes to read in data, but the number of characters per row of data varies. So I'm instead trying to give user control over the number of rows read in; one could read in a subset of the rows at a time and append them into contents, continually passing them into read_in - possibly in a loop - if the file size is large enough. That said, I'd love to know if I'm wrong about this approach as well, though this isn't my main question.)
If your function needs to do some other things after writing to the file, you usually do it outside the with block. So essentially you need to return outside the with block too.
However if the purpose of your function is just to read in a file, you can return within the with block, or outside it. I believe none of the methods are preferred in this case.
I don't really understand your second question.
You can put return also withing with context.
By exiting context, the cleanup are done. This is the power of with, not to need to check all possible exit paths. Note: also with exception inside with the exit context is called.
But if file is empty (as an example), you should still return something. So in such case your code is clear, and follow the principle: one exit path. But if you should handle end of file without finding something important, I would putting normal return within with context, and handle the special case after it.

How to search, replace specific hex code in automated way

I have a 100M row file that has some encoding problems -- was "originally" EBCDIC, saved as US-ASCII, now UTF-8. I don't know much more about its heritage, sorry -- I've just been asked to analyze the content.
The "cents" character from EBCDIC is "hidden" in this file in random places, causing all sorts of errors. Here is more on this bugger: cents character in hex
Converting this file using iconv -f foo -t UTF-8 -c is not working -- the cents character prevails.
When I use hex editor, I can find the appearance of 0xC2 0xA2 (c2a2). But in a BIG file, this isn't ideal. Sed doesn't work at hex level, so... Not sure about tr -- I only really use it for carriage return / new line.
What linux utility / command can I use to find and delete this character reasonably quickly on very big files?
2 parts:
1 -- utility / command to find / count the number of these occurrences (octal \242)
2 -- command to replace (this works tr '\242' ' ' < source > output )
How the text appears on my ubuntu terminal:
1019EQ?IT DEPT GENERATED
With xxd, how it looks at hex level (ascii to the side looks the same as above):
0000000: 3130 3139 4551 a249 5420 4445 5054 2047 454e 4552 4154 4544 0d0a
With xxd, how it looks with "show ebcdic" -- here, just showing the ebcdic from side:
......s.....&....+........
So hex "a2" is the culprit. I'm now trying xxd -E foo | grep a2 to count the instances up.
Adding output from od -ctxl, rather than xxd, for those interested:
0000000 1 0 1 9 E Q 242 I T D E P T G
31 30 31 39 45 51 a2 49 54 20 44 45 50 54 20 47
0000020 E N E R A T E D \r \n
45 4e 45 52 41 54 45 44 0d 0a
When you say the file was converted what do you mean? Do you mean the binary file was simply dumped from an IBM 360 to another ASCII based computer, or was the file itself converted over to ASCII when it was transferred over?
The question is whether the file is actually in a well encoded state or not. The other question is how do you want the file encoded?
On my Mac (which uses UTF-8 by default, just like Linux systems), I have no problem using sed to get rid of the ¢ character:
Here's my file:
$ cat test.txt
This is a test --¢-- TEST TEST
$ od -ctx1 test.txt
0000000 T h i s i s a t e s t -
54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 2d
0000020 - ¢ ** - - T E S T T E S T \n
2d c2 a2 2d 2d 20 54 45 53 54 20 54 45 53 54 0a
0000040
You can see that cat has no problems printing out that ¢ character. And, you can see in the od dump the c2a2 encoding of the ¢ character.
$ sed 's/¢/$/g' test.txt > new_test.txt
$ cat new_test.txt
This is a test --$-- TEST TEST
$ od -ctx1 new_test.txt
0000000 T h i s i s a t e s t -
54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 2d
0000020 - $ - - T E S T T E S T \n
2d 24 2d 2d 20 54 45 53 54 20 54 45 53 54 0a
0000037
Here's my sed has no problems changing that ¢ into a $ sign. The dump now shows that this test file is equivalent to a strictly ASCII encoded file. That two hexadecimal digit encoded ¢ is now a nice clean single hexadecimal digit encoded $.
It looks like sed can handle your issue.
If you want to use this file on a Windows system, you can convert the file to the standard Windows Code Page 1252:
$ iconv -f utf8 -t cp1252 test.txt > new_test.txt
$ cat new_test.txt
This is a test --?-- TEST TEST
$ od -ctx1 new_test.txt
0000000 T h i s i s a t e s t -
54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 2d
0000020 - 242 - - T E S T T E S T \n
2d a2 2d 2d 20 54 45 53 54 20 54 45 53 54 0a
0000037
Here's the file now in Codepage 1252 just like the way Windows likes it! Note that the ¢ is now a nice hex 242 character.
So, what is exactly the issue? Do you need to file in pure ASCII defined 127 characters? Do you need the file encoded, so Windows machines can work on it? Are you having problems entering the ¢ character?
Let me know. I'm not from the government, and yet I'm here to help you.

Shell script printing contents of variable containing output of a command removes newline characters [duplicate]

This question already has answers here:
Capturing multiple line output into a Bash variable
(7 answers)
Closed 7 years ago.
I'm writing a shell script which will store the output of a command in a variable, process the output, and later echo the results. Here's what I've got:
stuff=$(diff -u pens tape)
# process the output
echo $stuff
The problem is, the output I get from running the script is this:
--- pens 2009-09-27 10:29:06.000000000 -0400 +++ tape 2009-09-18 16:45:08.000000000 -0400 ## -1,4 +1,2 ## -highlighter -marker -pencil -POSIX +masking +duct
Whereas I was expecting this:
--- pens 2009-09-27 10:29:06.000000000 -0400
+++ tape 2009-09-18 16:45:08.000000000 -0400
## -1,4 +1,2 ##
-highlighter
-marker
-pencil
-POSIX
+masking
+duct
It looks like the newline characters are being removed somehow. How do I get them to say in?
If you want to preserve the newlines, enclose the variable in double quotes:
echo "$stuff"
When you write it without the double quotes, the shell expands $stuff into a space-separated list of words (where 'words' are sequences of non-space characters, and the space characters are blanks and tabs and newlines; upon experimentation, it seems that form feeds, carriage returns and back-spaces are not counted as space).
Demonstrating interpretation of control characters as white space. ASCII 8 is backspace, 9 is tab, 10 is new line (LF), 11 is vertical tab, 12 is form feed, 13 is carriage return. The first command generates a sequence of characters separated by the various control characters. The second command echoes with the result with the original characters preserved - see the hex dump. The third command echoes the result with the shell splitting the words; you can see that the tab and newline were replaced by blank (0x20).
$ x=$(./ascii 64 65 8 66 67 9 68 69 10 70 71 11 72 73 12 74 75 13 76 77)
$ echo "$x" | odx
0x0000: 40 41 08 42 43 09 44 45 0A 46 47 0B 48 49 0C 4A #A.BC.DE.FG.HI.J
0x0010: 4B 0D 4C 4D 0A K.LM.
0x0015:
$ echo $x | odx
0x0000: 40 41 08 42 43 20 44 45 20 46 47 0B 48 49 0C 4A #A.BC DE FG.HI.J
0x0010: 4B 0D 4C 4D 0A K.LM.
0x0015:
$

Resources