How to view syslog entries since last time I looked - linux

I want to view the entries in Linux /var/log/syslog, but I only want to see the entries since last time I looked (preferably create a bash script to do this). The solution I thought of was to take a copy of syslog and diff it against the last time I took a copy, but this seems unclean because syslog can be big and diff adds artifacts in its output. Im thinking maybe somehow use tail directly on syslog, but I dont know how to do this when I dont know how many lines have been added since last time I tried. Any better thoughts? I would like to be able to redirect the result to a file so I can later interactively grep for specific parts of interest.

Linux has a wc command which can count the number of lines within a file, for example
wc -l /var/log/syslog. The bash script below stores the output of the wc -l command in a file called ./prevlinecount. Whenever you want just the new lines in a file it gets the value in ./prevlinecount and subtracts this value from a new instance of wc -l /var/log/syslog called newlinecount. Then it tails (newlinecount - prevlinecount).
#!/bin/bash
prevlinecount=`cat ./prevlinecount`
if [ -z $prevlinecount ]; then
echo `wc -l $1 | awk '{ print $1 }' > ./prevlinecount`
tail -n +1 $1
else
newlinecount=`wc -l $1 | awk '{print $1}'`
tail -n `expr $newlinecount - $prevlinecount` $1
echo $newlinecount > ./prevlinecount
fi
beware
this is a very rudimentary script which can only keep track of one file. If you would like to extend this script to multiple files, look into associative arrays. With associative arrays you could keep track of multiple files by having the key as the filename and value being the previous line count.
beware too that over time syslog files can be archived after the file reaches a predetermined size (maybe 10MB) and this script does not account for the archival process.

Related

Find and copy specific files by date

I've been trying to get a script working to backup some files from one machine to another but have been running into an issue.
Basically what I want to do is copy two files, one .log and one (or more) .dmp. Their format is always as follows:
something_2022_01_24.log
something_2022_01_24.dmp
I want to do three things with these files:
find the second to last one .log file (i.e. something_2022_01_24.log is the latest,I want to find the one before that say something_2022_01_22.log)
get a substring with just the date (2022_01_22)
copy every .dmp that matches the date (i.e something_2022_01_24.dmp, something01_2022_01_24.dmp)
For the first one from what I could find the best way is to do: ls -t *.log | head-2 as it displays the second to last file created.
As for the second one I'm more at a loss because I'm not sure how to parse the output of the first command.
The third one I think I could manage with something of the sort:
[ -f "/var/www/my_folder/*$capturedate.dmp" ] && cp "/var/www/my_folder/*$capturedate.dmp" /tmp/
What do you guys think is there any way to do this? How can I compare the substring?
Thanks!
Would you please try the following:
#!/bin/bash
dir="/var/www/my_folder"
second=$(ls -t "$dir/"*.log | head -n 2 | tail -n 1)
if [[ $second =~ .*_([0-9]{4}_[0-9]{2}_[0-9]{2})\.log ]]; then
capturedate=${BASH_REMATCH[1]}
cp -p "$dir/"*"$capturedate".dmp /tmp
fi
second=$(ls -t "$dir"/*.log | head -n 2 | tail -n 1) will pick the
second to last log file. Please note it assumes that the timestamp
of the file is not modified since it is created and the filename
does not contain special characters such as a newline. This is an easy
solution and we may need more improvement for the robustness.
The regex .*_([0-9]{4}_[0-9]{2}_[0-9]{2})\.log will match the log
filename. It extracts the date substring (enclosed with the parentheses) and assigns the bash variable
${BASH_REMATCH[1]} to it.
Then the next cp command will do the job. Please be cateful
not to include the widlcard * within the double quotes so that
the wildcard is properly expanded.
FYI here are some alternatives to extract the date string.
With sed:
capturedate=$(sed -E 's/.*_([0-9]{4}_[0-9]{2}_[0-9]{2})\.log/\1/' <<< "$second")
With parameter expansion of bash (if something does not include underscores):
capturedate=${second%.log}
capturedate=${capturedate#*_}
With cut command (if something does not include underscores):
capturedate=$(cut -d_ -f2,3,4 <<< "${second%.log}")

Separating lines of a huge file into two files depending on the date

I'm gathering tones of data in a stream on an Ubuntu machine, the data is stored in days packages (where each day_file contains somewhere between 1 and 5 gb). I'm not an experienced linux/bash/awk user, but the data looks something like this (all lines start with a date):
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
2020-09-01T00:00:00Z !In a unreadable way
Now to the problem, the stream is cut around midnight local time (for a few reasons it can't be cut at exact 00.00.00 gtm time). This means that rows from two dates are stored in the same file and I want to separate them into the correct date files. I wrote the following script trying to separate the rows, it works but it takes several hours to run and I think that there must be a faster way of doing this operation?
#!/bin/bash
dateDiff (){
line_str="$1"
dte1="2020-09-01"
dte2=${line_str:0:10}
if [[ "$dte1" == "$dte2" ]]; then
echo $line_str >> correct_date.txt;
else
echo $line_str >> wrong_date.txt;
fi
}
IFS=$'\n'
for line in $(cat massive_file.txt)
do
dateDiff "$line"
done
unset IFS
Using this awk script I'm able to process 10GB file in approx 1 minute on my machine.
awk '{ if ($0 ~ /^2020-08-31/) { print $0 > "correct.txt" } else { print $0 > "wrong.txt" } }' input_file_name.txt
Line is checked against regular expression containing your date, then whole line is printed to file based on regexp match.
Using awk with T as your field separator, the first field, $1, will be the date. Then you can output each record to a file named for the date.
$ cat file
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
2020-09-01T00:00:00Z !In a unreadable way
$ awk -FT '{ print > ($1 ".txt") }' file
$ ls 20*.txt
2020-08-31.txt 2020-09-01.txt
$ cat 2020-09-01.txt
2020-09-01T00:00:00Z !In a unreadable way
$ cat 2020-08-31.txt
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
Some notes:
Using a bash loop to read logs would be very slow.
Using awk, sed, grep or similar is very good, but still you will have to read and write whole files line by line, and this has a perfomance ceiling.
For your specific case, you could only identify the split points, which can be 3, not only 2, (previous, current and next day logs can co-exist in a file) with something like grep -nm1 "^$day" and then split the log file with a combination of head and tail, like this. Then append or prepend them to the existing ones. This would be a very fast solution because you would write the files massively, not line by line.
Here is a simple solution with grep, as you need to test only the 10 first characters of the log lines, and for this job grep is faster than awk.
Assuming that you store logs in a destination directory, every incoming file should pass from something like this script. Order of processing is important, you have to follow date order of the files, e.g. you see that I append to an existing file. This is just a demo solution for guidance.
#!/bin/bash
[[ -f "$1" ]] && f="$1" || { echo "Nothing to do"; exit 1; }
dest_dir=archive/
suffix="_file.log"
curr=${f:0:10}
prev=$( date -d "$curr -1 day" "+%Y-%m-%d" )
next=$( date -d "$curr +1 day" "+%Y-%m-%d" )
for d in $prev $curr $next; do
grep "^$d" "$f" >> "${dest_dir}${d}${suffix}"
done

How to create a dynamic command in bash?

I want to have a command in a variable that runs a program and specifies the output filename for it depending on the number of files exits (to work on a new file each time).
Here is what I have:
export MY_COMMAND="myprogram -o ./dir/outfile-0.txt"
However I would like to make this outfile number increases each time MY_COMMAND is being executed. You may suppose myprogram creates the file soon enough before the next call. So the number can be retrieved from the number of files exists in the directory ./dir/. I do not have access to change myprogram itself or the use of MY_COMMAND.
Thanks in advance.
Given that you can't change myprogram — its -o option will always write to the file given on the command line, and assuming that something also out of your control is running MY_COMMAND so you can't change the way that MY_COMMAND gets called, you still have control of MY_COMMAND
For the rest of this answer I'm going to change the name MY_COMMAND to callprog mostly because it's easier to type.
You can define callprog as a variable as in your example export callprog="myprogram -o ./dir/outfile-0.txt", but you could instead write a shell script and name that callprog, and a shell script can do pretty much anything you want.
So, you have a directory full of outfile-<num>.txt files and you want to output to the next non-colliding outfile-<num+1>.txt.
Your shell script can get the numbers by listing the files, cutting out only the numbers, sorting them, then take the highest number.
If we have these files in dir:
outfile-0.txt
outfile-1.txt
outfile-5.txt
outfile-10.txt
ls -1 ./dir/outfile*.txt produces the list
./dir/outfile-0.txt
./dir/outfile-1.txt
./dir/outfile-10.txt
./dir/outfile-5.txt
(using outfile and .txt means this will work even if there are other files not name outfile)
Scrape out the number by piping it through the stream editor sed … capture the number and keep only that part:
ls -1 ./dir/outfile*.txt | sed -e 's:^.*dir/outfile-\([0-9][0-9]*\)\.txt$:\1:'
(I'm using colon : instead of the standard slash / so I don't have to escape the directory separator in dir/outfile)
Now you just need to pick the highest number. Sort the numbers and take the top
| sort -rn | head -1
Sorting with -n is numeric, not lexigraphic sorting, -r reverses so the highest number will be first, not last.
Putting it all together, this will list the files, edit the names keeping only the numeric part, sort, and get just the first entry. You want to assign that to a variable to work with it, so it is:
high=$(ls -1 ./dir/outfile*.txt | sed -e 's:^.*dir/outfile-\([0-9][0-9]*\)\.txt$:\1:' | sort -rn | head -1)
In the shell (I'm using bash) you can do math on that, $[high + 1] so if high is 10, the expression produces 11
You would use that as the numeric part of your filename.
The whole shell script then just needs to use that number in the filename. Here it is, with lines broken for better readability:
#!/bin/sh
high=$(ls -1 ./dir/outfile*.txt \
| sed -e 's:^.*dir/outfile-\([0-9][0-9]*\)\.txt$:\1:' \
| sort -rn | head -1)
echo "myprogram -o ./dir/outfile-$[high + 1].txt"
Of course you wouldn't echo myprogram, you'd just run it.
you could do this in a bash function under your .bashrc by using wc to get the number of files in the dir and then adding 1 to the result
yourfunction () {
dir=/path/to/dir
filenum=$(expr $(ls $dir | wc -w) + 1)
myprogram -o $dir/outfile-${filenum}.txt
}
this should get the number of files in $dir and append 1 to that number to get the number you need for the filename. if you place it in your .bashrc or under .bash_aliases and source .bashrc then it should work like any other shell command
You can try exporting a function for MY_COMMAND to run.
next_outfile () {
my_program -o ./dir/outfile-${_next_number}.txt
((_next_number ++ ))
}
export -f next_outfile
export MY_COMMAND="next_outfile" _next_number=0
This relies on a "private" global variable _next_number being initialized to 0 and not otherwise modified.

Trying to scrub 700 000 data against 15 million data

I am trying to scrub 700 000 data obtained from single file, which need to be scrubbed against a data of 15 million present in multiple files.
Example: 1 file of 700 000 say A. Multiple files pool which have 15 million call it B.
I want a pool B of files with no data of file A.
Below is the shell script I am trying to use it is working fine. But it is taking massive time of more than 8 Hours in scrubbing.
IFS=$'\r\n' suppressionArray=($(cat abhinav.csv1))
suppressionCount=${#suppressionArray[#]}
cd /home/abhinav/01-01-2015/
for (( j=0; j<$suppressionCount; j++));
do
arrayOffileNameInWhichSuppressionFound=`grep "${suppressionArray[$j]}," *.csv| awk -F ':' '{print $1}' > /home/abhinav/fileNameContainer.txt`
IFS=$'\r\n' arrayOffileNameInWhichSuppressionFound=($(cat /home/abhinav/fileNameContainer.txt))
arrayOffileNameInWhichSuppressionFoundCount=${#arrayOffileNameInWhichSuppressionFound[#]}
if [ $arrayOffileNameInWhichSuppressionFoundCount -gt 0 ];
then
echo -e "${suppressionArray[$j]}" >> /home/abhinav/emailid_Deleted.txt
for (( k=0; k<$arrayOffileNameInWhichSuppressionFoundCount; k++));
do
sed "/^${suppressionArray[$j]}/d" /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$k]} > /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}".tmp" && mv -f /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}".tmp" /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}
done
fi
done
Another solution clicked in my mind is to breakdown 700k data into smaller size files of 50K and send across 5-available servers, also POOL A will be available at each server.
Each server will serve for 2-Smaller file.
These two lines are peculiar:
arrayOffileNameInWhichSuppressionFound=`grep "${suppressionArray[$j]}," *.csv| awk -F ':' '{print $1}' > /home/abhinav/fileNameContainer.txt`
IFS=$'\r\n' arrayOffileNameInWhichSuppressionFound=($(cat /home/abhinav/fileNameContainer.txt))
The first assigns an empty string to the mile-long variable name because the standard output is directed to the file. The second then reads that file into the array. ('Tis curious that the name is not arrayOfFileNameInWhichSuppressionFound, but the lower-case f for file is consistent, so I guess it doesn't matter beyond making it harder to read the variable name.)
That could be reduced to:
ArrFileNames=( $(grep -l "${suppressionArray[$j]}," *.csv) )
You shouldn't need to keep futzing with carriage returns in IFS; either set it permanently, or make sure there are no carriage returns before you start.
You're running these loops 7,00,000 times (using the Indian notation). That's a lot. No wonder it is taking hours. You need to group things together.
You should probably simply take the lines from abhinav.csv1 and arrange to convert them into appropriate sed commands, and then split them up and apply them. Along the lines of:
sed 's%.*%/&,/d%' abhinav.csv1 > names.tmp
split -l 500 names.tmp sed-script.
for script in sed-script.*
do
sed -f "$script" -i.bak *.csv
done
This uses the -i option to backup the files. It may be necessary to do redirection explicitly if your sed does not support the -i option:
for file in *.csv
do
sed -f "$script" "$file" > "$file.tmp" &&
mv "$file.tmp" "$file"
done
You should experiment to see how big the scripts can be. I chose 500 in the split command as a moderate compromise. Unless you're on antique HP-UX, that should be safe, but you may be able to increase the size of the script more, which will reduce the number of times you have to edit each file, which speeds up the processing. If you can use 5,000 or 50,000, you should do so. Experiment to see what the upper limit. I'm not sure that you'd find doing all 700,000 lines at once is feasible — but it should be fastest if you can do it that way.

Pull fields/attributes from lsof (Linux command line)

With the recent move to Flash 10 (or maybe it was a distro choice), I and many others are no longer able to copy Flash videos from /tmp. I have, however, found a workaround in the following:
First, execute:
lsof | grep Flash
which should return output like this:
plugin-co 8935 richard 16w REG 8,1 4139180 8220 /tmp/FlashXXq4KyOZ (deleted)
Note: You can see the problem here....the /tmp file has the file pointer released.
You are, however, able to grab the file by using the cp command thusly:
cp /proc/#/fd/# video.flv
where the 1st # is the process ID (8935) and the second if the next number (16, from 16w).
Currently, this works, but it requires a few manual steps. To automate this, I figure I could pull the PID and the fd number and insert them dynamically into the cp command.
My question is how do I pull the appropriate fields into variables? I know you can use $1, etc. for grabbing input arguments, but how do you retrieve outputs?
Note: I could use pidof plugin-container to find the PID, but I still need the other number (since it tells which specific flash video to save).
The following command will return PIDs and FDs for all the files in /tmp that have filenames that begin with "Flash"
lsof -F pfn /tmp/Flash*
and the output will look something like this:
p16471
f16
n/tmp/FlashXXq4KyOZ
f17
n/tmp/FlashXXq4KyOZ
p26588
f16
n/tmp/FlashYYh3JwIW
f17
Where the field identifiers are p: PID, f: FD, n: NAME. The -F option is designed to make the output of lsof easy to parse.
Iterating over these and removing the field identifiers is trivial.
#!/bin/bash
c=-1
while read -r line
do
case $line in
f*)
fds[pids[c]]+=${line:1}" "
;;
n*)
names[pids[c]]+=${line:1}" "
;;
p*)
pids[++c]=${line:1}
;;
esac
done < <(lsof -F pfn -- /tmp/Flash*)
for ((i=0; i<=c; i++))
do
for name in ${names[pids[i]]}
do
for fd in ${fds[pids[i]]}
do
echo "File: $name, Process ID: ${pids[i]}, File Descriptor: $fd"
done
done
done
Lines like this:
fds[pids[c]]+=${line:1}" "
accumulate file descriptors in a string stored in an array indexed by the PID. Doing this for file names will fail for filenames which contain spaces. That could be worked around if necessary.
The line is stripped of the leading field descriptor character by using a substring operator: ${line:1} starts at position one and includes the rest of the string so it drops character zero.
The second loop is just a demo to show iterating over the arrays.
var=$(lsof | awk '/Flash/{gsub(/[^0-9]/,"",$4);print $2 FS $4};exit')
set -- $var
pid=$1
number=$2
Completed Script:
#!/bin/sh
if [ $1 ]; then
#lsof | grep Flash | awk '{print $2}' also works for PID
pid=$(pidof plugin-container)
file_num=$(lsof -p $pid | grep /tmp/Flash | awk '{print substr($4,1,2)}')
cp /proc/$pid/fd/$file_num ~/Downloads/"$1".flv
else
echo "Please enter video name as argument."
fi
Avoid using lsof because it takes too long (>30 seconds) to return the path. The below .bashrc line will work with vlc, mplayer, or whatever you put in and return the path to the deleted temp file in milliseconds.
flashplay () {
vlc $(stat -c %N /proc/*/fd/* 2>&1|awk -F[\`\'] '/lash/{print$2}')
}

Resources