Searching a particular string pattern out of 10000 files in parallel - linux

Problem Statement:-
I need to search a particular String Pattern in around 10000 files and find the records in the files which contains that particular pattern. I can use grep here, but it is taking lots of time.
Below is the command I am using to search a particular string pattern after unzipping the dat.gz file
gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1'
If I simply count how many files are there after unzipping the above dat.gz file
gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | wc -l
I get around 10000 files. And I need to search the above string pattern in all these 10000 files and find out the records which contains the above String Pattern. And my above command is working fine but it is very very slow.
What is the best approach on this? Should we take 100 files at a time and search for the particular String Pattern in that 100 files parallelly.
Note:
I am running SunOS
bash-3.00$ uname -a
SunOS lvsaishdc3in0001 5.10 Generic_142901-02 i86pc i386 i86pc

Do NOT run this in parallel!!!! That's going to bounce the disk head all over the place, it will be much slower.
Since you are reading an archive file there's one way to get a substantial performance boost--don't write the results of the decompression out. The ideal answer would be to decompress to a stream in memory, if that's not viable then decompress to a ramdisk.
In any case you do want some parallelism here--one thread should be obtaining the data and then handing it off to another that does the search. That way you will either be waiting on the disk or on the core doing the decompressing, you won't waste any of that time doing the search.
(Note that in case of the ramdisk you will want to aggressively read the files it wrote and then kill them so the ramdisk doesn't fill up.)

For starters, you will need to uncompress the file to disk.
This does work (in bash,) but you probably don't want to try to start 10,000 processes all at once. Run it inside the uncompressed directory:
for i in `find . -type f`; do ((grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1' $i )&); done
So, we need to have a way to limit the number of spawned processes. This will loop as long as the number of grep processes running on the machine exceeds 10 (including the one doing the counting):
while [ `top -b -n1 | grep -c grep` -gt 10 ]; do echo true; done
I have run this, and it works.... but top takes so long to run that it effectively limits you to one grep per second. Can someone improve upon this, adding one to a count when a new process is started and decrementing by one when a process ends?
for i in `find . -type f`; do ((grep -l 'blah' $i)&); (while [ `top -b -n1 | grep -c grep` -gt 10 ]; do sleep 1; done); done
Any other ideas for how to determine when to sleep and when not to? Sorry for the partial solution, but I hope someone has the other bit you need.

If you are not using regular expressions you can use the -F option of grep or use fgrep. This may provide you with additional performance.

Your gzcat .... | wc -l does not indicate 10000 files, it indicates 10000 lines total for however many files there are.
This is the type of problem that xargs exists for. Assuming your version of gzip came with a script called gzgrep (or maybe just zgrep), you can do this:
find /data/newfolder/real-time-newdata -type f -name "*_20120809_0_*.gz" -print | xargs gzgrep
That will run one gzgrep command with batches of as many individual files as it can fit on a command line (there are options to xargs to limit how many, or for a number of other things). Unfortunately, gzgrep still has to uncompress each file and pass it off to grep, but there's not really any good way to avoid having to uncompress the whole corpus in order to search through it. Using xargs in this way will however cut down some on the overall number of new processes that need to be spawned.

Related

What is the most efficient method to handle reading multiple patterns through multiple parts of a file?

I have a bash script that reads a log file and searches for various patterns in various parts of the file. I am trying to think of the best way to optimize this flow, and I currently have it set such that when searching through the file, it will only search through the areas needed (last 15 minutes of logging, first 15 minutes of logging, last hour, last 30 minutes, etc.).
As you can imagine, as the file size increases, the searches do as well.
Most greps are fgreps with -m / -c to optimize them and use sed to portion out the file along with special redirection and tac in some cases to read from bottom up rather than top to bottom of the file to maximize efficiency. The main thing that I have not implemented is utilizing LC_ALL=C before our greps, but regardless of that, I are still looking for best practices and best performance gains in this.
FILENAME="filename" #Can be large large file 10s of GB
search1 = $(<$FILENAME tac | sed "/thirty_mins_ago/q" | fgrep -c -m 1 "FATAL")
search2 = $(<$FILENAME tac | sed "/fifteen_mins_ago/q" | fgrep -c -m 1 "Exception")
search3 = $(<$FILENAME cat | sed "/start_of_day_fifteen/q" | fgrep -c -m "Maintenance complete")
if [ $search1 -ge 1 ];
echo "FATAL ERRORS DETECTED"
fi
etc
Would storing the entire file in a variable result in a quicker read time for grep or would splitting the file into multiple variables corresponding to block of time and then grepping those provide significant performance improvements?
Any suggestions on best ways to optimize and improve efficiency of this is much appreciated.

Quickly list random set of files in directory in Linux

Question:
I am looking for a performant, concise way to list N randomly selected files in a Linux directory using only Bash. The files must be randomly selected from different subdirectories.
Why I'm asking:
In Linux, I often want to test a random selection of files in a directory for some property. The directories contain 1000's of files, so I only want to test a small number of them, but I want to take them from different subdirectories in the directory of interest.
The following returns the paths of 50 "randomly"-selected files:
find /dir/of/interest/ -type f | sort -R | head -n 50
The directory contains many files, and resides on a mounted file system with slow read times (accessed through ssh), so the command can take many minutes. I believe the issue is that the first find command finds every file (slow), and only then prints a random selection.
If you are using locate and updatedb updates regularly (daily is probably the default), you could:
$ locate /home/james/test | sort -R | head -5
/home/james/test/10kfiles/out_708.txt
/home/james/test/10kfiles/out_9637.txt
/home/james/test/compr/bar
/home/james/test/10kfiles/out_3788.txt
/home/james/test/test
How often do you need it? Do the work periodically in advance to have it quickly available when you need it.
Create a refreshList script.
#! /bin/env bash
find /dir/of/interest/ -type f | sort -R | head -n 50 >/tmp/rand.list
mv -f /tmp/rand.list ~
Put it in your crontab.
0 7-20 * * 1-5 nice -25 ~/refresh
Then you will always have a ~/rand.list that's under an hour old.
If you don't want to use cron and aren't too picky about how old it is, just write a function that refreshes the file after you use it every time.
randFiles() {
cat ~/rand.list
{ find /dir/of/interest/ -type f |
sort -R | head -n 50 >/tmp/rand.list
mv -f /tmp/rand.list ~
} &
}
If you can't run locate and the find command is too slow, is there any reason this has to be done in real time?
Would it be possible to use cron to dump the output of the find command into a file and then do the random pick out of there?

Fastest way to search for multiple values on a linux machine

I want to search for multiple values (say v1, v2, v3....) in a directory with around 6-10 huge files (~300 MB each). I have tried grep and fgrep, with regular expression search like ('v1 | v2 | v3'). The command seems to be running really slow. I am running something like
grep -e 'v1|v2|v3' .
Is there a way I can make my search faster? Please note that the machine is single core, so parallelization may not be of much help.
I'd suggest
LANG=C egrep 'v1|v2|v3'
It won't get much faster than that.
You can search with -R option
-R - Read all files under each directory, recursively.
grep -i 'test' -R .

Maintaining variables in function - Global variables

Im trying to run a script in a function and then calling it
filedetails ()
{
# read TOTAL_DU < "/tmp/sizes.out";
disksize=`du -s "$1" | awk '{print $1}'`;
let TOTAL_DU+=$disksize;
echo "$TOTAL_DU";
# echo $TOTAL_DU > "/tmp/sizes.out"
}
Im using the variable TOTAL_DU as a counter to keep count of the du of all the files
Im running it using parallel or xargs
find . -type f | parallel -j 8 filedetails
But the variable TOTAL_DU is resetting every time and the count is not maintained which is as expected as a new shell is used each time.
I've also tried using a file to export and then read the counter but because of parallel some complete faster than others so its not sequential (as expected) so this is no good....
Question in is there a way to keep the count whilst using parallel or xargs
Aside from learning purposes, this is not likely to be a good use of parallel, because:
Calling du like that will quite possibly be slower than just invoking du in the normal way. First, the information about files sizes can be extracted from the directory, and so an entire directory can be computed in a single access. Effectively, directories are stored as a special kind of file object, whose data is a vector of directory entities ("dirents"), which contain the name and metadata for each file. What you are doing is using find to print these dirents, then getting du to parse each one (every file, not every directory); almost all of this second scan is redundant work.
Insisting that du examine every file prevents it from avoiding double-counting multiple hard-links to the same file. So you can easily end up inflating the disk usage this way. On the other hand, directories also take up diskspace, and normally du will include this space in its reports. But you're never calling it on any directory, so you will end up understating the total disk usage.
You're invoking a shell and an instance of du for every file. Normally, you would only create a single process for a single du. Process creation is a lot slower than reading a filesize from a directory. At a minimum, you should use parallel -X and rewrite your shell function to invoke du on all the arguments, rather than just $1.
There is no way to share environment variables between sibling shells. So you would have to accumulate the results in a persistent store, such as a temporary file or database table. That's also an expensive operation, but if you adopted the above suggestion, you would only need to do it once for each invocation of du, rather than for every file.
So, ignoring the first two issues, and just looking at the last two, solely for didactic purposes, you could do something like the following:
# Create a temporary file to store results
tmpfile=$(mktemp)
# Function which invokes du and safely appends its summary line
# to the temporary file
collectsizes() {
# Get the name of the temporary file, and remove it from the args
tmpfile=$1
shift
# Call du on all the parameters, and get the last (grand total) line
size=$(du -c -s "$#" | tail -n1)
# lock the temporary file and append the dataline under lock
flock "$tmpfile" bash -c 'cat "$1" >> "$2"' _ "$size" "$tmpfile"
}
export -f collectsizes
# Find all regular files, and feed them to parallel taking care
# to avoid problems if files have whitespace in their names
find -type f -print0 | parallel -0 -j8 collectsizes "$tmpfile"
# When all that's done, sum up the values in the temporary file
awk '{s+=$1}END{print s}' "$tmpfile"
# And delete it.
rm "$tmpfile"

How to parallelize my bash script for use with `find` without facing race conditions?

I am trying to execute a command like this:
find ./ -name "*.gz" -print -exec ./extract.sh {} \;
The gz files themselves are small. Currently my extract.sh contains the following:
# Start delimiter
echo "#####" $1 >> Info
zcat $1 > temp
# Series of greps to extract some useful information
grep -o -P "..." temp >> Info
grep -o -P "..." temp >> Info
rm temp
echo "####" >> Info
Obviously, this is not parallelizable because if I run multiple extract.sh instances, they all write to the same file. What is a smart way of doing this?
I have 80K gz files on a machine with massive horse power of 32 cores.
Assume (just for simplicity and clearness) all your files starts with a-z.
So you could use 26 cores in parallel when launching an find sequence like above for each letter. Each "find" need to generate an own aggregate file
find ./ -name "a*.gz" -print -exec ./extract.sh a {} \; &
find ./ -name "b*.gz" -print -exec ./extract.sh b {} \; &
..
find ./ -name "z*.gz" -print -exec ./extract.sh z {} \;
(extract needs to take to first parameter to separate the "info" destination file)
When you want a big aggregate file just joins all aggregate.
However, I am not convinced to gain performance with that approach. In the end all file content will be serialized.
Probably hard disk head movement will be the limitation not the unzip (cpu) performance.
But let's try
A quick check through the findutils source reveals that find starts a child process for each exec. I believe it then moves on, though I may be misreading the source. Because of this you are already parallel, since the OS will handle sharing these out across your cores. And through the magic of virtual memory, the same executables will mostly share the same memory space.
The problem you are going to run into is file locking/data mixing. As each individual child runs, it pipes info into your info file. These are individual script commands, so they will mix their output together like spaghetti. This does not guarantee that the files will be in order! Just that all of an individual file's contents will stay together.
To solve this problem, all you need to do is take advantage of the shell's ability to create a temporary file (using tempfile), have each script dump to the temp file, then have each script cat the temp file into the info file. Don't forget to delete your temp file after use.
If the tempfiles are in ram(see tmpfs), then you will avoid being IO bound except when writing to your final file, and running the find search.
Tmpfs is a special file system that uses your ram as "disk space". It will take up to the amount of ram you allow, not use more than it needs from that amount, and swap to disk as needed if it does fill up.
To use:
Create a mount point ( I like /mnt/ramdisk or /media/ramdisk )
Edit /etc/fstab as root
Add tmpfs /mnt/ramdrive tmpfs size=1G 0 0
Run umount as root to mount your new ramdrive. It will also mount at boot.
See the wikipedia entry on fstab for all the options available.
You can use xargs to run your search in parallel. --max-procs limits number of processes executed (default is 1):
find ./ -name "*.gz" -print | xargs --max-args 1 --max-procs 32 ./extract.sh
In the ./extract.sh you can use mktemp to write data from each .gz to a temporary file, all of which may be later combined:
# Start delimiter
tmp=`mktemp -t Info.XXXXXX`
src=$1
echo "#####" $1 >> $tmp
zcat $1 > $tmp.unzip
src=$tmp.unzip
# Series of greps to extract some useful information
grep -o -P "..." $src >> $tmp
grep -o -P "..." $src >> $tmp
rm $src
echo "####" >> $tmp
If you have massive horse power you can use zgrep directly, without unzipping first. But it may be faster to zcat first if you have many greps later.
Anyway, later combine everything into a single file:
cat /tmp/Info.* > Info
rm /tmp/Info.*
If you care about order of .gz files apply second argument to ./extract.sh:
find files/ -name "*.gz" | nl -n rz | sed -e 's/\t/\n/' | xargs --max-args 2 ...
And in ./extract.sh:
tmp=`mktemp -t Info.$1.XXXXXX`
src=$2
I would create a temporary directory. Then create an output file for each grep (based on the name of te file it processed). Files created under /tmp are located on a RAM disk and so will not thrash your harddrive with lots of writes.
You can then either cat it all together at the end, or get each grep to signal another process when it has finished and that process can begin catting files immediately (and removing them when done).
Example:
working_dir="`pwd`"
temp_dir="`mktemp -d`"
cd "$temp_dir"
find "$working_dir" -name "*.gz" | xargs -P 32 -n 1 extract.sh
cat *.output > "$working_dir/Info"
rm -rf "$temp_dir"
extract.sh
filename=$(basename $1)
output="$filename.output"
extracted="$filename.extracted"
zcat "$1" > "$extracted"
echo "#####" $filename > "$output"
# Series of greps to extract some useful information
grep -o -P "..." "$extracted" >> "$output"
grep -o -P "..." "$extracted" >> "$output"
rm "$extracted"
echo "####" >> "$output"
The multiple grep invocations in extract.sh are probably the main bottleneck here. An obvious optimization is to read each file only once, then print a summary in the order you want. As an added benefit, we can speculate that the report can get written as a single block, but it might not prevent interleaved output completely. Still, here's my attempt.
#!/bin/sh
for f; do
zcat "$f" |
perl -ne '
/(pattern1)/ && push #pat1, $1;
/(pattern2)/ && push #pat2, $1;
# ...
END { print "##### '"$1"'\n";
print join ("\n", #pat1), "\n";
print join ("\n", #pat2), "\n";
# ...
print "#### '"$f"'\n"; }'
done
Doing this in awk instead of Perl might be slightly more efficient, but since you are using grep -P I figure it's useful to be able to keep the same regex syntax.
The script accepts multiple .gz files as input, so you can use find -exec extract.sh {} \+ or xargs to launch a number of parallel processes. With xargs you can try to find a balance between sequential jobs and parallel jobs by feeding each new process, say, 100 to 500 files in one batch. You save on the number of new processes, but lose in parallelization. Some experimentation should reveal what the balance should be, but this is the point where I would just pull a number out of my hat and see if it's good enough already.
Granted, if your input files are small enough, the multiple grep invocations will run out of the disk cache, and turn out to be faster than the overhead of starting up Perl.

Resources