Bash Script Efficient For Loop (Different Source File) - linux

First of all i'm a beginner in bash scripting. I usually code in Java but this certain task requires me to create some bash scripts in Linux. FYI i've already made a working script but I think its not efficient enough because of the large files I'm dealing with.
The problem is simple I have 2 logs that I need to compare and make some correction on one of the logs... ill call it logA and logB. This 2 logs contains different format here is an example:
01/04/2015 06:48:59|9691427842139609|601113090635|PR|30.00|8.3|1.7| <- log A
17978712 2015-04-01 06:48:44 601113090635 SUCCESS DealerERecharge.<-log B
17978714 2015-04-01 06:48:49 601113090635 SUCCESS DealerERecharge.<-log B
As you can see there is a gap in time stamp. The actual logs that will match with log A is the one with the ID 17978714 because it is the closest time from it. The highest time gap I've seen is 1 minute. I cant use the RANGE logic because if there are more than one line on log B that is within the 1 minute range then all of the line will show in my regenerated log.
The script I made contains a for loop which iterate the timestamp of log A until it hits something in log B (The first one it hits is the closest)
Inside the for loop I have this line of code which makes the loop slow.
LINEOUTPUT=$(less $File2 | grep "Validation 1" | grep "Validation 2" | grep "Timestamp From Log A")
I've read some sample using SED but the problem is I have 2 more validation to consider before matching it with the time stamp.
The validation works as a filter to narrow down the exact match for log A and B.
Additional Info: I tried doing some benchmark test for the script I made by performing some loop. One thing I've noticed is that even though I only use 1 pipe for that script the loop tick is still slow.

Related

How do I modify/subset a wget script to specify a date range to only download certain years into different scripts?

I am trying to download a lot of data for some research from the CMIP6 website (https://esgf-node.llnl.gov/search/cmip6/) that provides wget scripts for each model.
The scripts are for every 6 hours or month from 1850 to 2014. The date format looks like this (1st script): 185001010600-185101010000 or (for 2nd script) 195001010600-195002010000, 195002010600-195003010000
My goal is to turn one giant script into several smaller ones with five years each for 1980 to 2015
As an example, I would want to subset the main script into different scripts with 5 year intervals ("19800101-19841231" then "19850101-19901231", etc.) with each named wget-1980_1985.sh, wget-1985_1990.sh, respectively
For an example date range for the 2nd script, I would need:
197912010600 through 198601010000, then every 5 years after that
I'm a beginner so please help if you can!
Part of the wget script format for each file looks like this (it won't let me copy and paste the whole thing since there are too many links [see below to find the file yourself]):
1.) #These are the embedded files to be downloaded download_files="$(cat <185001010600-185101010000.nc' 'http://esgf-data2.diasjp.net/thredds/fileServer/esg_dataroot/CMIP6/CMIP/MIROC/MIROC6/historical/r1i1p1f1/6hrPlevPt/hus/gn/v20191204/hus_6hrPlevPt_MIROC6_historical_r1i1p1f1_gn_185001010600-185101010000.nc' 'SHA256'
'fa9ac4149cc700876cb10c4e681173bcc0040ea03b9a439d1c66ef47b0253c5a'
'hus_6hrPlevPt_MIROC6_historical_r1i1p1f1_gn_185101010600-185201010000.nc' 'http://esgf-data2.diasjp.net/thredds/fileServer/esg_dataroot/CMIP6/CMIP/MIROC/MIROC6/historical/r1i1p1f1/6hrPlevPt/hus/gn/v20191204/hus_6hrPlevPt_MIROC6_historical_r1i1p1f1_gn_185101010600-185201010000.nc' 'SHA256'
'4ef4f99aa34aae6dfdafaa4aab206344125abe7808df675d688890825db53047'
2.) For the second script, the dates look like this: 'ps_6hrLev_MIROC6_historical_r1i1p1f1_gn_195001010600-195002010000.nc'
To run it, you just download the script from the website (see below)
or downloading from this link should work:
1.) https://esgf-node.llnl.gov/esg-search/wget/?distrib=false&dataset_id=CMIP6.CMIP.MIROC.MIROC6.historical.r1i1p1f1.6hrPlevPt.hus.gn.v20191204|esgf-data2.diasjp.net
2.) A similar script can be seen here (the dates are different but I need this one too):
https://esgf-node.llnl.gov/esg-search/wget/?distrib=false&dataset_id=CMIP6.CMIP.MIROC.MIROC6.historical.r1i1p1f1.6hrLev.ps.gn.v20191114|esgf-data2.diasjp.net
to run the script in the terminal, this is the command i use
bash wget* -H
and it will download each file.
I can vi the script and delete each file (by using "dd") I don't need but this will be extremely time consuming.
To find this data and get the wget script from the website, go to: https://esgf-node.llnl.gov/search/cmip6/
and select the variables on the left side of the page as follows:
Source ID: MIROC6,
Experiment ID: Historical,
Variant Label: r1i1p1f1,
Table ID: 6hrPlevPt,
and Variable: hus
it will look like this
*If these files are too big, you can also select Frequency:monthly instead for a much smaller file. I just want you to see the date format since monthly is just the month and year
Then hit search and it will give you one model to download. it will look like thisOn the bottom, with the links, it will say "wget script." Click that and it will download.
You can
vi wget*
to view and/or edit it or
bash wget* -H
to run/download each file.
It might ask you to log in but I've found typing in nonsense to the username and password still starts the download.
Please help! This will be the next 6 months of my life and I really don't want to "dd" every file I don't need for all of these!
A bash for loop can generate relevant date ranges and output filename.
A simple sed script can delete relevant lines if they appear in order.
For example:
#!/bin/bash
in=esgf_script
for y in $(seq 1979 5 2014); do
out="wget_{$y}-$((y+4)).sh"
sed '/_gn_/{ # if some kind of url:
/_gn_'$((y+5))'/,$ d; # delete if year >= y+5
/_gn_2015/,$ d; # delete if year >= 2015
/_gn_'$y'/,$ !d; # delete if year < y
}' <"$in" >"$out"
done
The seq command generates every fifth year starting from 1979 up to 2014.
The sed script:
looks for lines containing urls: /_gn_/
deletes if year is too big
otherwise, doesn't delete if year is big enough
This code assumes that:
no lines except urls contain the first regex (/_gn_/)
the urls appear in ascending year order (eg. urls containing 1994 cannot appear before ones containing 1993)

Linux Date not showing the date value sometimes

I have defined a variable inside one of the shell script to create the file name with date value in it.
I used "date +%Y%m%d" command to insert the current date which was defined in date_val variable.
And I have defined the filename variable to have "${path}/sample_${date_val}.txt
For few days it was creating the file name properly as /programfiles/sample_20180308.txt
But today the filename was created without date as /programfiles/sample_.txt
When I try to execute the command "date +%Y%m%d" in linux, it is returning the correct value - 20180309.
Any idea why the filename was created without the date value ??? . I did not modify anything in my script too. So wondering what might have gone wrong.
Sample excerpt of my script is given below for easy understanding :
EDITED
path=/programfiles
date_val=$(date +%Y%m%d )
file_name=${path}/sample_${date_val}.txt
Although incredibly unlikely, it's certainly possible for date to fail, based on the source code. Under the covers, it calls either clock_gettime() or gettimeofday(), both of which can fail.
The date program will also refuse to output anything to standard output if the date from either of those two functions is out of range during the call to (which is possible if they fail).
It's also possible that the date program could "disappear" for various reasons, such as actually being hidden or permissions changed, or a shortage of resources like file handles when attempting to open the executable.
As mentioned, all these possibilities are a stretch, unlikely to happen in the real world.
If you want to handle the case where you get inadequate output from date, you can simply try until you get a valid one, something like (with the possibility of adding some limit to detect if it's never any good):
todaysDate="$(date +%Y%m%d)"
while [[ ! $x =~ ^[0-9]{8}$ ]] ; do
sleep 1
todaysDate="$(date +%Y%m%d)"
done
# todaysDate now guaranteed to be eight digits.

Using nohup to help run a loop of python code while disconnecting from ssh

I'm looking for help running a python script that takes some time to run.
It is a long running process that takes about 2hours per test observation. For example, these observations could be the 50 states of the usa.
I dont want to baby sit this process all day - I'd like to kick it off then drive home from work - or have it run while I'm sleeping.
Since this a loop - I would need to call one python script that loops through my code going over each of the 50 states - and a 2nd that runs my actual code that does things.
I've heard of NOHUP, but I have very limited knowledge. I saw nohup ipython mypython.py but then when I google I get alot of other people chiming in with other methods and so I don't know what is the ideal approach for someone like me. Additionally, I am essentially looking to run this as a loop - so don't know how that complicates things.
Please give me something simple and easier to understand. I don't know linux all that well or I wouldn't be asking as this seems like a common sort of command/activity...
Basic example of my code:
Two files: code_file.py and loop_file.py
Code_file.py does all the work. Loop file just passes in the list of things to run the stuff for.
code_file.py
output = each_state + ' needs some help!'
print output
loop_file.py
states = ['AL','CO','CA','NY','VA','TX']
for each_state in states:
code_file.py
Regarding the loop - I have also heard that I can't pass in parameters or something via nohup? I can fix this part within my python code....for example reading from a CSV in my code and deleting the current record from that CSV file and then re-writing it out...that way I can always select the top record in the CSV file for my loop (the list of states)
May be you could modify your loop_file.py like this:
import os
states = ['AL','CO','CA','NY','VA','TX']
for each_state in states:
os.system("python /dir_of_your_code/code_file.py")
Then in a shell, you could run the loop_file.py with:
nohup python loop_file.py & # & is not necessary, it just redirect all output of the file to a file named nohup.out instead of printing it on screen.

In Bash, How would you only read lines in a log past a certain timestamp?

So right now I'm trying to do the following:
grep "string" logfile.txt
which is going fine, but there's a lot of "string" in logfile.txt; I really only want to see the last hour's worth. In pseudo-code I want to do...
grep "string" logfile.txt OnlyShowThoseInTheLastHour
Is there any way to easily accomplish this in bash? In the logfile the lines look like this:
13:27:50 string morestuff morestuff morestuff
edit: sorry I forgot to mention it, but seeing logs from similar hours on past days is not an issue as these logs are refreshed/archived daily.
This should do it:
awk 'BEGIN { tm = strftime("%H:%M:%S", systime()-3600) } /string/ && $1 >= tm' logfile.txt
Replace string by the pattern you're interested in.
It works by first building a string holding time information from 1 hour ago in HH:MM:SS format, and then selecting those lines that only match string and have the first field (timestamp) lexicographically greater than or equal to the timestamp string just built.
Note that it has its limitations, for example, if you do this at 00:30, log entries from 23:30 through 23:59 will not match. In general, running this command anytime between 00:00 and 00:59 will possibly omit log entries from 23:00 through 23:59. However, this shouldn't be an issue for you, since you mentioned (in the comments) that logs archive and start fresh every midnight.
Also, leap seconds are not dealt with, but this is probably not a problem unless you need 100% precise results. And again - since logs start afresh at midnight, in your specific case this is not a problem at all.

Handle "race-condition" between 2 cron tasks. What is the best approach?

I have a cron task that runs periodically. This task depends on a condition to be valid in order to complete its processing. In case it matters this condition is just a SELECT for specific records in the database. If the condition is not satisfied (i.e the SELECT does not return the result set expected) then the script exits immediately.
This is bad as the condition would be valid soon enough (don't know how soon but it will be valid due to the run of another script).
So I would like somehow to make the script more robust. I thought of 2 solutions:
Put a while loop and sleep constantly until the condition is
valid. This should work but it has the downside that once the script
is in the loop, it is out of control. So I though to additionally
after waking up to check is a specific file exists. If it does it
"understands" that the user wants to "force" stop it.
Once the script figures out that the condition is not valid yet it
appends a script in crontab and stops. That seconds script
continually polls for the condition and if the condition is valid
then restart the first script to restart its processing. This solution to me it seems to work but I am not sure if it is a good solution. E.g. perhaps programatically modifying the crontab is a bad idea?
Anyway, I thought that perhaps this problem is common and could have a standard solution, much better than the 2 I came up with. Does anyone have a better proposal? Which from my ideas would be best? I am not very experienced with cron tasks so there could be things/problems I could be overseeing.
instead of programmatically appending the crontab, you might want to consider using at to schedule the job to run again at some time in the future. If the script determines that it cannot do its job now, it can simply schedule itself to run again a few minutes (or a few hours, as it may) later by way of an at command.
Following up from our conversation in comments, you can take advantage of conditional execution in a cron entry. Supposing you want to branch based on time of day, you might use the output from date.
For example: this would always invoke the first command, then invoke the second command only if the clock hour is currently 11:
echo 'ScriptA running' ; [ $(date +%H) == 11 ] && echo 'ScriptB running'
More examples!
To check the return value from the first command:
echo 'ScriptA' ; [ $? == 0 ] echo 'ScriptB'
To instead check the STDOUT, you can use as colon as a noop and branch by capturing output with the same $() construct we used with date:
: ; [ $(echo 'ScriptA') == 'ScriptA' ] && echo 'ScriptB'
One downside on the last example: STDOUT from the first command won't be printed to the console. You could capture it to a variable which you echo out, or write it to a file with tee, if that's important.

Resources