Trying to scrub 700 000 data against 15 million data - linux

I am trying to scrub 700 000 data obtained from single file, which need to be scrubbed against a data of 15 million present in multiple files.
Example: 1 file of 700 000 say A. Multiple files pool which have 15 million call it B.
I want a pool B of files with no data of file A.
Below is the shell script I am trying to use it is working fine. But it is taking massive time of more than 8 Hours in scrubbing.
IFS=$'\r\n' suppressionArray=($(cat abhinav.csv1))
suppressionCount=${#suppressionArray[#]}
cd /home/abhinav/01-01-2015/
for (( j=0; j<$suppressionCount; j++));
do
arrayOffileNameInWhichSuppressionFound=`grep "${suppressionArray[$j]}," *.csv| awk -F ':' '{print $1}' > /home/abhinav/fileNameContainer.txt`
IFS=$'\r\n' arrayOffileNameInWhichSuppressionFound=($(cat /home/abhinav/fileNameContainer.txt))
arrayOffileNameInWhichSuppressionFoundCount=${#arrayOffileNameInWhichSuppressionFound[#]}
if [ $arrayOffileNameInWhichSuppressionFoundCount -gt 0 ];
then
echo -e "${suppressionArray[$j]}" >> /home/abhinav/emailid_Deleted.txt
for (( k=0; k<$arrayOffileNameInWhichSuppressionFoundCount; k++));
do
sed "/^${suppressionArray[$j]}/d" /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$k]} > /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}".tmp" && mv -f /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}".tmp" /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}
done
fi
done
Another solution clicked in my mind is to breakdown 700k data into smaller size files of 50K and send across 5-available servers, also POOL A will be available at each server.
Each server will serve for 2-Smaller file.

These two lines are peculiar:
arrayOffileNameInWhichSuppressionFound=`grep "${suppressionArray[$j]}," *.csv| awk -F ':' '{print $1}' > /home/abhinav/fileNameContainer.txt`
IFS=$'\r\n' arrayOffileNameInWhichSuppressionFound=($(cat /home/abhinav/fileNameContainer.txt))
The first assigns an empty string to the mile-long variable name because the standard output is directed to the file. The second then reads that file into the array. ('Tis curious that the name is not arrayOfFileNameInWhichSuppressionFound, but the lower-case f for file is consistent, so I guess it doesn't matter beyond making it harder to read the variable name.)
That could be reduced to:
ArrFileNames=( $(grep -l "${suppressionArray[$j]}," *.csv) )
You shouldn't need to keep futzing with carriage returns in IFS; either set it permanently, or make sure there are no carriage returns before you start.
You're running these loops 7,00,000 times (using the Indian notation). That's a lot. No wonder it is taking hours. You need to group things together.
You should probably simply take the lines from abhinav.csv1 and arrange to convert them into appropriate sed commands, and then split them up and apply them. Along the lines of:
sed 's%.*%/&,/d%' abhinav.csv1 > names.tmp
split -l 500 names.tmp sed-script.
for script in sed-script.*
do
sed -f "$script" -i.bak *.csv
done
This uses the -i option to backup the files. It may be necessary to do redirection explicitly if your sed does not support the -i option:
for file in *.csv
do
sed -f "$script" "$file" > "$file.tmp" &&
mv "$file.tmp" "$file"
done
You should experiment to see how big the scripts can be. I chose 500 in the split command as a moderate compromise. Unless you're on antique HP-UX, that should be safe, but you may be able to increase the size of the script more, which will reduce the number of times you have to edit each file, which speeds up the processing. If you can use 5,000 or 50,000, you should do so. Experiment to see what the upper limit. I'm not sure that you'd find doing all 700,000 lines at once is feasible — but it should be fastest if you can do it that way.

Related

Separating lines of a huge file into two files depending on the date

I'm gathering tones of data in a stream on an Ubuntu machine, the data is stored in days packages (where each day_file contains somewhere between 1 and 5 gb). I'm not an experienced linux/bash/awk user, but the data looks something like this (all lines start with a date):
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
2020-09-01T00:00:00Z !In a unreadable way
Now to the problem, the stream is cut around midnight local time (for a few reasons it can't be cut at exact 00.00.00 gtm time). This means that rows from two dates are stored in the same file and I want to separate them into the correct date files. I wrote the following script trying to separate the rows, it works but it takes several hours to run and I think that there must be a faster way of doing this operation?
#!/bin/bash
dateDiff (){
line_str="$1"
dte1="2020-09-01"
dte2=${line_str:0:10}
if [[ "$dte1" == "$dte2" ]]; then
echo $line_str >> correct_date.txt;
else
echo $line_str >> wrong_date.txt;
fi
}
IFS=$'\n'
for line in $(cat massive_file.txt)
do
dateDiff "$line"
done
unset IFS
Using this awk script I'm able to process 10GB file in approx 1 minute on my machine.
awk '{ if ($0 ~ /^2020-08-31/) { print $0 > "correct.txt" } else { print $0 > "wrong.txt" } }' input_file_name.txt
Line is checked against regular expression containing your date, then whole line is printed to file based on regexp match.
Using awk with T as your field separator, the first field, $1, will be the date. Then you can output each record to a file named for the date.
$ cat file
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
2020-09-01T00:00:00Z !In a unreadable way
$ awk -FT '{ print > ($1 ".txt") }' file
$ ls 20*.txt
2020-08-31.txt 2020-09-01.txt
$ cat 2020-09-01.txt
2020-09-01T00:00:00Z !In a unreadable way
$ cat 2020-08-31.txt
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
Some notes:
Using a bash loop to read logs would be very slow.
Using awk, sed, grep or similar is very good, but still you will have to read and write whole files line by line, and this has a perfomance ceiling.
For your specific case, you could only identify the split points, which can be 3, not only 2, (previous, current and next day logs can co-exist in a file) with something like grep -nm1 "^$day" and then split the log file with a combination of head and tail, like this. Then append or prepend them to the existing ones. This would be a very fast solution because you would write the files massively, not line by line.
Here is a simple solution with grep, as you need to test only the 10 first characters of the log lines, and for this job grep is faster than awk.
Assuming that you store logs in a destination directory, every incoming file should pass from something like this script. Order of processing is important, you have to follow date order of the files, e.g. you see that I append to an existing file. This is just a demo solution for guidance.
#!/bin/bash
[[ -f "$1" ]] && f="$1" || { echo "Nothing to do"; exit 1; }
dest_dir=archive/
suffix="_file.log"
curr=${f:0:10}
prev=$( date -d "$curr -1 day" "+%Y-%m-%d" )
next=$( date -d "$curr +1 day" "+%Y-%m-%d" )
for d in $prev $curr $next; do
grep "^$d" "$f" >> "${dest_dir}${d}${suffix}"
done

How to view syslog entries since last time I looked

I want to view the entries in Linux /var/log/syslog, but I only want to see the entries since last time I looked (preferably create a bash script to do this). The solution I thought of was to take a copy of syslog and diff it against the last time I took a copy, but this seems unclean because syslog can be big and diff adds artifacts in its output. Im thinking maybe somehow use tail directly on syslog, but I dont know how to do this when I dont know how many lines have been added since last time I tried. Any better thoughts? I would like to be able to redirect the result to a file so I can later interactively grep for specific parts of interest.
Linux has a wc command which can count the number of lines within a file, for example
wc -l /var/log/syslog. The bash script below stores the output of the wc -l command in a file called ./prevlinecount. Whenever you want just the new lines in a file it gets the value in ./prevlinecount and subtracts this value from a new instance of wc -l /var/log/syslog called newlinecount. Then it tails (newlinecount - prevlinecount).
#!/bin/bash
prevlinecount=`cat ./prevlinecount`
if [ -z $prevlinecount ]; then
echo `wc -l $1 | awk '{ print $1 }' > ./prevlinecount`
tail -n +1 $1
else
newlinecount=`wc -l $1 | awk '{print $1}'`
tail -n `expr $newlinecount - $prevlinecount` $1
echo $newlinecount > ./prevlinecount
fi
beware
this is a very rudimentary script which can only keep track of one file. If you would like to extend this script to multiple files, look into associative arrays. With associative arrays you could keep track of multiple files by having the key as the filename and value being the previous line count.
beware too that over time syslog files can be archived after the file reaches a predetermined size (maybe 10MB) and this script does not account for the archival process.

Finding a line that shows in a file only once

Assuming that I have files with 100 lines. There are a lot of lines that repeat themselves in the file, and only one line that does not.
I want to find the line that shows only once. Is there a command for that or do I have to build some complicated loop as below?
My code so far:
#!/bin/bash
filename="repeat_lines.txt"
var="$(wc -l <$filename )"
echo "length:" $var
#cp ex4.txt ex4_copy.txt
for((index=0; index < var; index++));
do
one="$(head -n $index $filename | tail -1)"
counter=0
for((index2=0; index2 < var; index2++));
do
two="$(head -n $index2 $filename | tail -1)"
if [ "$one" == "$two" ]; then
counter=$((counter+1))
fi
done
echo $one"is "$counter" times in the text: "
done
If I understood your question correctly, then
sort repeat_lines.txt | uniq -u should do the trick.
e.g. for file containing:
a
b
a
c
b
it will output c.
For further reference, see sort manpage, uniq manpage.
You've got a reasonable answer that uses standard shell tools sort and uniq. That's probably the solution you want to use, if you want something that is portable and doesn't require bash.
But an alternative would be to use functionality built into your bash shell. One method might be to use an associative array, which is a feature of bash 4 and above.
$ cat file.txt
a
b
c
a
b
$ declare -A lines
$ while read -r x; do ((lines[$x]++)); done < file.txt
$ for x in "${!lines[#]}"; do [[ ${lines["$x"]} -gt 1 ]] && unset lines["$x"]; done
$ declare -p lines
declare -A lines='([c]="1" )'
What we're doing here is:
declare -A creates the associative array. This is the bash 4 feature I mentioned.
The while loop reads each line of the file, and increments a counter that uses the content of a line of the file as the key in the associative array.
The for loop steps through the array, deleting any element whose counter is greater than 1.
declare -p prints the details of an array in a predictable, re-usable format. You could alternately use another for loop to step through the remaining array elements (of which there might be only one) in order to do something with them.
Note that this solution, while fine for small files (say, up to a few thousand lines), may not scale well for very large files of, say, millions of lines. Bash isn't the fastest at reading input this way, and one must be cognizant of memory limits when using arrays.
The sort alternative has the benefit of memory optimization using files on disk for extremely large files, at the expense of speed.
If you're dealing with files of only a few hundred lines, then it's hard to predict which solution will be faster. In the end, the form of output may dictate your choice of solution. The sort | uniq pipe generates a list to standard output. The bash solution above generates the same list as keys in an array. Otherwise, they are functionally equivalent.

How to find files with similar filename and how many of them there are with awk

I was tasked to delete old backup files from our Linux database (all except for the newest 3). Since we have multiple kinds of backups, I have to leave at least 3 backup files for each backup type.
My script should group all files with similar (matched) names together and delete all except for the last 3 files (I assume, that the OS will sort those files for me, so the newest backups will also be the last ones)
The files are in the format project_name.000000-000000.svndmp.bz2 where 0 can be any arbitrary digit and project_name can be any arbitrary name. The first 6 digits are part of the name, while the last 6 digits describe the backup's version.
So far, my code looks like this:
for i in *.svndmp.bz2 # only check backup files
do
nOfOccurences = # Need to find out, how many files have the same name
currentFile = 0
for f in awk -F"[.-]" '{print $1,$2}' $i # This doesn't work
do
if [nOfOccurences - $currentFile -gt 3]
then
break
else
rm $f
currentFile++
fi
done
done
I'm aware, that my script may try to remove old versions of a backup 4 times before moving on to the next backup. I'm not looking for performance or efficiency (we don't have that many backups).
My code is a result of 4 hours of searching the net and I'm running out of good Google queries (and my boss is starting to wonder why I'm still not back to my usual tasks)
Can anybody give me inputs, as to how I can solve my problems?
Find nOfOccurences
Make awk find files that fit the pattern "$1.$2-*"
Try this one, an see if it does what you want.
for project in `ls -1 | awk -F'-' '{ print $1}' | uniq`; do
files=`ls -1 ${project}* | sort`
n_occur=`echo "$files" | wc -l`
for f in $files; do
if ((n_occur < 3)); then
break
fi
echo "rm" $f;
((--n_occur))
done
done
If the output seems to be OK just replace the echo line.
Ah, and don't beat me if anything goes own. Use at your own risk only.

How to delete the last n lines of a file? [duplicate]

This question already has answers here:
How to use sed to remove the last n lines of a file
(26 answers)
Closed 5 years ago.
I was wondering if someone could help me out.
Im writing a bash script and i want to delete the last 12 lines of a specific file.
I have had a look around and somehow come up with the following;
head -n -12 /var/lib/pgsql/9.6/data/pg_hba.conf | tee /var/lib/pgsql/9.6/data/pg_hba.conf >/dev/null
But this wipes the file completely.
All i want to do is permanently delete the last 12 lines of that file so i can overwrite it with my own rules.
Any help on where im going wrong?
There are a number of methods, depending on your exact situation. For small, well-formed files (say, less than 1M, with regular sized lines), you might use Vim in ex mode:
ex -snc '$-11,$d|x' smallish_file.txt
-s -> silent; this is batch processing, so no UI necessary (faster)
-n -> No need for an undo buffer here
-c -> the command list
'$-11,$d' -> Select the 11 lines from the end to the end (for a total of 12 lines) and delete them. Note the single quote so that the shell does not interpolate $d as a variable.
x -> "write and quit"
For a similar, perhaps more authentic throw-back to '69, the ed line-editor could do this for you:
ed -s smallish_file.txt <<< $'-11,$d\nwq'
Note the $ outside of the single quote, which is different from the ex command above.
If Vim/ex and Ed are scary, you could use sed with some shell help:
sed -i "$(($(wc -l < smallish_file.txt) - 11)),\$d" smallish_file.txt
-i -> inplace: write the change to the file
The line count less 11 for a total of 12 lines. Note the escaped dollar symbol ($) so the shell does not interpolate it.
But using the above methods will not be performant for larger files (say, more than a couple of megs). For larger files, use the intermediate/temporary file method, as the other answers have described. A sed approach:
tac some_file.txt | sed '1,12d' | tac > tmp && mv tmp some_file.txt
tac to reverse the line order
sed to remove the last (now first) 12 lines
tac to reverse back to the original order
More efficient than sed is a head approach:
head -n -12 larger_file.txt > tmp_file && mv tmp_file larger_file.txt
-n NUM show only the first NUM lines. Negated as we've done, shows up to the last NUM lines.
But for real efficiency -- perhaps for really large files or for where a temporary file would be unwarranted -- truncate the file in place. Unlike the other methods which involve variations of overwriting the entire old file with entire the new content, this one will be near instantaneous no matter the size of the file.
# In readable form:
BYTES=$(tail -12 really_large.txt | wc -c)
truncate -s -$BYTES really_large.txt
# Inline, perhaps as part of a script
truncate -s -$(tail -12 really_large.txt | wc -c) really_large.txt
The truncate command makes files exactly the specified size in bytes. If the file is too short, it will make it larger, and if the file is too large, it will chop off the excess really efficiently. It does this with filesystem semantics, so it involves writing usually no more than a couple of bytes. The magic here is in calculating where to chop:
-s -NUM -> Note the dash/negative; says to reduce the file by NUM bytes
$(tail -12 really_large.txt | wc -c) -> returns the number of bytes to be removed
So, you pays your moneys and takes your choices. Choose wisely!
Like this:
head -n -12 test.txt > tmp.txt && cp tmp.txt test.txt
You can use a temporary file store the intermediate result of head -n
I think the code below should work:
head -n -12 /var/lib/pgsql/9.6/data/pg_hba.conf > /tmp/tmp.pg.hba.$$ && mv /tmp/tmp.pg.hba.$$ /var/lib/pgsql/9.6/data/pg_hba.conf
If you are putting it a script, a more readable and easy to maintain code would be:
SRC_FILE=/var/lib/pgsql/9.6/data/pg_hba.conf
TMP_FILE=/tmp/tmp.pg.hba.$$
head -n -12 $SRC_FILE > $TMP_FILE && mv $TMP_FILE $SRC_FILE
I would suggest backing up /var/lib/pgsql/9.6/data/pg_hba.conf before running any script.
Simple and clear script
declare -i start
declare -i cnt
cat dummy
1
2
3
4
5
6
7
8
9
10
11
12
13
cnt=`wc -l dummy|awk '{print $1}'`
start=($cnt-12+1)
sed "${start},\$d" dummy
OUTPUT
is the first line
1

Resources