I have the output from crontab -l containing about 1500 lines.
I need to sort them based on the frequency of execution.
I tried a Perl module named Crontab::Interval but it didn't seem to accept jobs written with interval, that is * 4-20 * * *.
The goal is to identify all jobs that run more than once every hour.
Suggesting to try to filter using awk pattern matching logic.
Print only lines that: Field#1 not match * AND Field#2 not match *
crontab -l| awk '$1 !~ /\*/ && $2 !~ /\*/ 1'| sort
Related
I'm gathering tones of data in a stream on an Ubuntu machine, the data is stored in days packages (where each day_file contains somewhere between 1 and 5 gb). I'm not an experienced linux/bash/awk user, but the data looks something like this (all lines start with a date):
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
2020-09-01T00:00:00Z !In a unreadable way
Now to the problem, the stream is cut around midnight local time (for a few reasons it can't be cut at exact 00.00.00 gtm time). This means that rows from two dates are stored in the same file and I want to separate them into the correct date files. I wrote the following script trying to separate the rows, it works but it takes several hours to run and I think that there must be a faster way of doing this operation?
#!/bin/bash
dateDiff (){
line_str="$1"
dte1="2020-09-01"
dte2=${line_str:0:10}
if [[ "$dte1" == "$dte2" ]]; then
echo $line_str >> correct_date.txt;
else
echo $line_str >> wrong_date.txt;
fi
}
IFS=$'\n'
for line in $(cat massive_file.txt)
do
dateDiff "$line"
done
unset IFS
Using this awk script I'm able to process 10GB file in approx 1 minute on my machine.
awk '{ if ($0 ~ /^2020-08-31/) { print $0 > "correct.txt" } else { print $0 > "wrong.txt" } }' input_file_name.txt
Line is checked against regular expression containing your date, then whole line is printed to file based on regexp match.
Using awk with T as your field separator, the first field, $1, will be the date. Then you can output each record to a file named for the date.
$ cat file
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
2020-09-01T00:00:00Z !In a unreadable way
$ awk -FT '{ print > ($1 ".txt") }' file
$ ls 20*.txt
2020-08-31.txt 2020-09-01.txt
$ cat 2020-09-01.txt
2020-09-01T00:00:00Z !In a unreadable way
$ cat 2020-08-31.txt
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
Some notes:
Using a bash loop to read logs would be very slow.
Using awk, sed, grep or similar is very good, but still you will have to read and write whole files line by line, and this has a perfomance ceiling.
For your specific case, you could only identify the split points, which can be 3, not only 2, (previous, current and next day logs can co-exist in a file) with something like grep -nm1 "^$day" and then split the log file with a combination of head and tail, like this. Then append or prepend them to the existing ones. This would be a very fast solution because you would write the files massively, not line by line.
Here is a simple solution with grep, as you need to test only the 10 first characters of the log lines, and for this job grep is faster than awk.
Assuming that you store logs in a destination directory, every incoming file should pass from something like this script. Order of processing is important, you have to follow date order of the files, e.g. you see that I append to an existing file. This is just a demo solution for guidance.
#!/bin/bash
[[ -f "$1" ]] && f="$1" || { echo "Nothing to do"; exit 1; }
dest_dir=archive/
suffix="_file.log"
curr=${f:0:10}
prev=$( date -d "$curr -1 day" "+%Y-%m-%d" )
next=$( date -d "$curr +1 day" "+%Y-%m-%d" )
for d in $prev $curr $next; do
grep "^$d" "$f" >> "${dest_dir}${d}${suffix}"
done
I bought a new laptop, but for some reason the built-in clock is loosing 15 min per day. The company I bought it from replaced the CMOS battery, but that didn't make a difference. I ended up putting the following in my Ubuntu crontab:
* * * * * date -s "$(wget -S "http://www.google.com/" 2>&1 | grep -E '^[[:space:]]*[dD]ate:' | sed 's/^[[:space:]]*[dD]ate:[[:space:]]*//' | head -1l | awk '{print $1, $3, $2, $5 ,"GMT", $4 }' | sed 's/,//')"
It works great (at least until I'm in a GMT timezone, as Google returns GMT time) when I have Internet connection, but when I'm offline it sets the time to the current date 00:00am. That is date -s "" changes the time to 00:00.
Is there some flag I can pass to date to tell it don't bother changing anything on empty input? Or modify the other parts of the cron job?
If you need conditionality in a crontab line, it's really executing everything using bourne shell and so you can use shell conditionals "&&" and "||"
* * * * * foo=$(wget...) ; test -z "$foo" || date -s "$foo"
Assign your blat output to 'foo' .. test if foo is empty, if not empty use it to set the date
I want to view the entries in Linux /var/log/syslog, but I only want to see the entries since last time I looked (preferably create a bash script to do this). The solution I thought of was to take a copy of syslog and diff it against the last time I took a copy, but this seems unclean because syslog can be big and diff adds artifacts in its output. Im thinking maybe somehow use tail directly on syslog, but I dont know how to do this when I dont know how many lines have been added since last time I tried. Any better thoughts? I would like to be able to redirect the result to a file so I can later interactively grep for specific parts of interest.
Linux has a wc command which can count the number of lines within a file, for example
wc -l /var/log/syslog. The bash script below stores the output of the wc -l command in a file called ./prevlinecount. Whenever you want just the new lines in a file it gets the value in ./prevlinecount and subtracts this value from a new instance of wc -l /var/log/syslog called newlinecount. Then it tails (newlinecount - prevlinecount).
#!/bin/bash
prevlinecount=`cat ./prevlinecount`
if [ -z $prevlinecount ]; then
echo `wc -l $1 | awk '{ print $1 }' > ./prevlinecount`
tail -n +1 $1
else
newlinecount=`wc -l $1 | awk '{print $1}'`
tail -n `expr $newlinecount - $prevlinecount` $1
echo $newlinecount > ./prevlinecount
fi
beware
this is a very rudimentary script which can only keep track of one file. If you would like to extend this script to multiple files, look into associative arrays. With associative arrays you could keep track of multiple files by having the key as the filename and value being the previous line count.
beware too that over time syslog files can be archived after the file reaches a predetermined size (maybe 10MB) and this script does not account for the archival process.
I have multiple files located in multiple directories. From them I search a keyword 'ENERGY' by grep. In each file I get multiple match cases. I want to take the last line from each file and save the results in the output.txt file. I wrote the following code:
labl=SubDir
ENERGY=`grep 'ENERGY' MyDir*${labl}*/*.txt`
cat > output.txt << EOF
${ENERGY}
EOF
This code saves all match cases from each file. But as mentioned, I need the last match case from each file. For that I modified the grep command as:
ENERGY=`grep 'ENERGY' MyDir*${labl}*/*.txt|taile -l`
Unfortunately this doesn't do the job either. Instead, it saves all the match cases from the last file only.
How to solve it?
Please don't run multiple processes/pipes to achieve this.
gawk '/ENERGY/{last=$0} ENDFILE{if(last!="") print last; last=""}' MyDir*"$labl"*/*.txt
/ENERGY/{last=$0}: On lines which match the regex ENERGY, set variable last to the contents of the entire line $0
ENDFILE{...} Run this {action} at the end of every input file supplied by the glob.
if(last!="") print last: print last if it's not null
last="": reset this variable to null, avoiding duplication
MyDir*"${labl}"*/*.txt: Quoted variable in glob will match directory names that include spaces
Use a for loop:
for f in MyDir*"$lab1"*/*.txt; do
grep ENERGY "$f" | tail -1 >> output.txt
done
Yet one but probably not last possible approach is to use parallel like this. Probably you can achieve the same with xargs, but I personally prefer parallel as simpler and giving the possibility to scale your process.
ls -1 file* | parallel -j1 "grep ENERGY {} | tail -n 1" > output.txt
Suppose I have setA.txt:
a|b|0.1
c|d|0.2
b|a|0.3
and I also have setB.txt:
c|d|200
a|b|100
Now I want to delete from setA.txt lines that have the same first 2 fields with setB.txt, so the output should be:
b|a|0.3
I tried:
comm -23 <(sort setA.txt) <(sort setB.txt)
But the equality is defined for whole line, so it won't work. How can I do this?
$ awk -F\| 'FNR==NR{seen[$1,$2]=1;next;} !seen[$1,$2]' setB.txt setA.txt
b|a|0.3
This reads through setB.txt just once, extracts the needed information from it, and then reads through setA.txt while deciding which lines to print.
How it works
-F\|
This sets the field separator to a vertical bar, |.
FNR==NR{seen[$1,$2]=1;next;}
FNR is the number of lines read so far from the current file and NR is the total number of lines read. Thus, when FNR==NR, we are reading the first file, setB.txt. If so, set the value of associative array seen to true, 1, for the key consisting of fields one and two. Lastly, skip the rest of the commands and start over on the next line.
!seen[$1,$2]
If we get to this command, we are working on the second file, setA.txt. Since ! means negation, the condition is true if seen[$1,$2] is false which means that this combination of fields one and two was not in setB.txt. If so, then the default action is performed which is to print the line.
This should work:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p' setB.txt |sed -f- setA.txt
How this works:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p'
generates an output:
/^c|d/d
/^a|b/d
which is then used as a sed script for the next sed after the pipe and outputs:
b|a|0.3
(IFS=$'|'; cat setA.txt | while read x y z; do grep -q -P "\Q$x|$y|\E" setB.txt || echo "$x|$y|$z"; done; )
explanation: grep -q means only test if grep can find the regexp, but do not output, -P means use Perl syntax, so that the | is matched as is because the \Q..\E struct.
IFS=$'|' will make bash to use | instead of the spaces (SPC, TAB, etc.) as token separator.