Fast string search in a very large file - linux

What is the fastest method for searching lines in a file containing a string. I have a file containing strings to search. This small file (smallF) contains about 50,000 lines and looks like:
stringToSearch1
stringToSearch2
stringToSearch3
I have to search all of these strings in a larger file (about 100 million lines). If any line in this larger file contains the search string the line is printed.
The best method I have come up with so far is
grep -F -f smallF largeF
But this is not very fast. With just 100 search strings in smallF it takes about 4 minutes. For over 50,000 search strings it will take a lot of time.
Is there a more efficient method?

I once noticed that using -E or multiple -e parameters is faster than using -f. Note that this might not be applicable for your problem as you are searching for 50,000 string in a larger file. However I wanted to show you what can be done and what might be worth testing:
Here is what I noticed in detail:
Have 1.2GB file filled with random strings.
>ls -has | grep string
1,2G strings.txt
>head strings.txt
Mfzd0sf7RA664UVrBHK44cSQpLRKT6J0
Uk218A8GKRdAVOZLIykVc0b2RH1ayfAy
BmuCCPJaQGhFTIutGpVG86tlanW8c9Pa
etrulbGONKT3pact1SHg2ipcCr7TZ9jc
.....
Now I want to search for strings "ab", "cd" and "ef" using different grep approaches:
Using grep without flags, search one at a time:
grep "ab" strings.txt > m1.out
2,76s user 0,42s system 96% cpu 3,313 total
grep "cd" strings.txt >> m1.out
2,82s user 0,36s system 95% cpu 3,322 total
grep "ef" strings.txt >> m1.out
2,78s user 0,36s system 94% cpu 3,360 total
So in total the search takes nearly 10 seconds.
Using grep with -f flag with search strings in search.txt
>cat search.txt
ab
cd
ef
>grep -F -f search.txt strings.txt > m2.out
31,55s user 0,60s system 99% cpu 32,343 total
For some reasons this takes nearly 32 seconds.
Now using multiple search patterns with -e
grep -E "ab|cd|ef" strings.txt > m3.out
3,80s user 0,36s system 98% cpu 4,220 total
or
grep --color=auto -e "ab" -e "cd" -e "ef" strings.txt > /dev/null
3,86s user 0,38s system 98% cpu 4,323 total
The third methode using -E only took 4.22 seconds to search through the file.
Now lets check if the results are the same:
cat m1.out | sort | uniq > m1.sort
cat m3.out | sort | uniq > m3.sort
diff m1.sort m3.sort
#
The diff produces no output, which means the found results are the same.
Maybe want to give it a try, otherwise I would advise you to look at the thread "Fastest possible grep", see comment from Cyrus.

You may want to try sift or ag. Sift in particular lists some pretty impressive benchmarks versus grep.

Note: I realise the following is not a bash based solution, but given your large search space, a parallel solution is warranted.
If your machine has more than one core/processor, you could call the following function in Pythran, to parallelize the search:
#!/usr/bin/env python
#pythran export search_in_file(string, string)
def search_in_file(long_file_path, short_file_path):
_long = open(long_file_path, "r")
#omp parallel for schedule(guided)
for _string in open(short_file_path, "r"):
if _string in _long:
print(_string)
if __name__ == "__main__":
search_in_file("long_file_path", "short_file_path")
Note: Behind the scenes, Pythran takes Python code and attempt to aggressively compile it into very fast C++.

Related

What is the most efficient method to handle reading multiple patterns through multiple parts of a file?

I have a bash script that reads a log file and searches for various patterns in various parts of the file. I am trying to think of the best way to optimize this flow, and I currently have it set such that when searching through the file, it will only search through the areas needed (last 15 minutes of logging, first 15 minutes of logging, last hour, last 30 minutes, etc.).
As you can imagine, as the file size increases, the searches do as well.
Most greps are fgreps with -m / -c to optimize them and use sed to portion out the file along with special redirection and tac in some cases to read from bottom up rather than top to bottom of the file to maximize efficiency. The main thing that I have not implemented is utilizing LC_ALL=C before our greps, but regardless of that, I are still looking for best practices and best performance gains in this.
FILENAME="filename" #Can be large large file 10s of GB
search1 = $(<$FILENAME tac | sed "/thirty_mins_ago/q" | fgrep -c -m 1 "FATAL")
search2 = $(<$FILENAME tac | sed "/fifteen_mins_ago/q" | fgrep -c -m 1 "Exception")
search3 = $(<$FILENAME cat | sed "/start_of_day_fifteen/q" | fgrep -c -m "Maintenance complete")
if [ $search1 -ge 1 ];
echo "FATAL ERRORS DETECTED"
fi
etc
Would storing the entire file in a variable result in a quicker read time for grep or would splitting the file into multiple variables corresponding to block of time and then grepping those provide significant performance improvements?
Any suggestions on best ways to optimize and improve efficiency of this is much appreciated.

Trying to scrub 700 000 data against 15 million data

I am trying to scrub 700 000 data obtained from single file, which need to be scrubbed against a data of 15 million present in multiple files.
Example: 1 file of 700 000 say A. Multiple files pool which have 15 million call it B.
I want a pool B of files with no data of file A.
Below is the shell script I am trying to use it is working fine. But it is taking massive time of more than 8 Hours in scrubbing.
IFS=$'\r\n' suppressionArray=($(cat abhinav.csv1))
suppressionCount=${#suppressionArray[#]}
cd /home/abhinav/01-01-2015/
for (( j=0; j<$suppressionCount; j++));
do
arrayOffileNameInWhichSuppressionFound=`grep "${suppressionArray[$j]}," *.csv| awk -F ':' '{print $1}' > /home/abhinav/fileNameContainer.txt`
IFS=$'\r\n' arrayOffileNameInWhichSuppressionFound=($(cat /home/abhinav/fileNameContainer.txt))
arrayOffileNameInWhichSuppressionFoundCount=${#arrayOffileNameInWhichSuppressionFound[#]}
if [ $arrayOffileNameInWhichSuppressionFoundCount -gt 0 ];
then
echo -e "${suppressionArray[$j]}" >> /home/abhinav/emailid_Deleted.txt
for (( k=0; k<$arrayOffileNameInWhichSuppressionFoundCount; k++));
do
sed "/^${suppressionArray[$j]}/d" /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$k]} > /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}".tmp" && mv -f /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}".tmp" /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}
done
fi
done
Another solution clicked in my mind is to breakdown 700k data into smaller size files of 50K and send across 5-available servers, also POOL A will be available at each server.
Each server will serve for 2-Smaller file.
These two lines are peculiar:
arrayOffileNameInWhichSuppressionFound=`grep "${suppressionArray[$j]}," *.csv| awk -F ':' '{print $1}' > /home/abhinav/fileNameContainer.txt`
IFS=$'\r\n' arrayOffileNameInWhichSuppressionFound=($(cat /home/abhinav/fileNameContainer.txt))
The first assigns an empty string to the mile-long variable name because the standard output is directed to the file. The second then reads that file into the array. ('Tis curious that the name is not arrayOfFileNameInWhichSuppressionFound, but the lower-case f for file is consistent, so I guess it doesn't matter beyond making it harder to read the variable name.)
That could be reduced to:
ArrFileNames=( $(grep -l "${suppressionArray[$j]}," *.csv) )
You shouldn't need to keep futzing with carriage returns in IFS; either set it permanently, or make sure there are no carriage returns before you start.
You're running these loops 7,00,000 times (using the Indian notation). That's a lot. No wonder it is taking hours. You need to group things together.
You should probably simply take the lines from abhinav.csv1 and arrange to convert them into appropriate sed commands, and then split them up and apply them. Along the lines of:
sed 's%.*%/&,/d%' abhinav.csv1 > names.tmp
split -l 500 names.tmp sed-script.
for script in sed-script.*
do
sed -f "$script" -i.bak *.csv
done
This uses the -i option to backup the files. It may be necessary to do redirection explicitly if your sed does not support the -i option:
for file in *.csv
do
sed -f "$script" "$file" > "$file.tmp" &&
mv "$file.tmp" "$file"
done
You should experiment to see how big the scripts can be. I chose 500 in the split command as a moderate compromise. Unless you're on antique HP-UX, that should be safe, but you may be able to increase the size of the script more, which will reduce the number of times you have to edit each file, which speeds up the processing. If you can use 5,000 or 50,000, you should do so. Experiment to see what the upper limit. I'm not sure that you'd find doing all 700,000 lines at once is feasible — but it should be fastest if you can do it that way.

optimize extracting text from large data set

I have a recurring problem where I need to search log files for all threads that match a pattern. e.g the following
(thread a) blah blah
(thread b) blha
(thread a) match blah
(thread c) blah blah
(thread d) blah match
(thread a) xxx
will produce all lines from threads a & d
There are multiple log files (compressed). Multiple are in the hundreds or thousands. Each file up to ~20MB uncompressed.
The way I do this now is first grep "match" in all the files, cut the (thread x) portion into a sort/uniq file, then use fgrep on the file with the matching threads on the original log set.
I'm already parallelizing the initial grep and the final grep. However this is still slow.
Is there a way to improve performance of this workload?
(I though hadoop, but it requires too many resources to setup/implement)
This is the script for the overall process.
#!/bin/bash
FILE=/tmp/thg.$$
PAT=$1
shift
trap "rm -f $FILE" 3 13
pargrep "$PAT" $# | awk '{ print $6 }' | sed 's/(\(.*\)):/\1/' | sort | uniq >$FILE
pargrep --fgrep $FILE $#
rm -f $FILE
The parallel grep is a much longer script that manages a queue of up to N grep processes that work on M files. Each grep process produces an intermediate file (in /dev/shm or /tmp - some memory file system), that are then concatenated once the queue drains from tasks.
I had to reboot my workstation today after it ran on a set of ~3000 files for over 24 hours. I guess dual xeon with 8GB and 16GB of swap are not up to such a workload :-)
Updated
In order to unzip and grep your files in parallel, try using GNU Parallel something like this:
parallel -j 8 -k --eta 'gzcat {} | grep something ' ::: *.gz
Original Answer
Mmmmmm... I see you are not getting any answers, so I'll stick my neck out and have a try. I have not tested this as I don't have many spare GB of your data lying around...
Firstly, I would benchmark the scripts and see what is eating the time. In concrete terms, is it the initial grep phase, or the awk|sed|sort|uniq phase?
I would remove the sort|uniq because I suspect that sorting multiple GB will eat your time up. If that is the case, maybe try replacing the sort|uniq with awk like this:
pargrep ... | ... | awk '!seen[$0]++'
which will only print each line the first time it occurs without the need to sort the entire file.
Failing that, I think you need to benchmark the times for the various phases and report back. Also, you should consider posting your pargrep code as that may be the bottleneck. Also, have you considered using GNU Parallel for that part?

Count lines in large files

I commonly work with text files of ~20 Gb size and I find myself counting the number of lines in a given file very often.
The way I do it now it's just cat fname | wc -l, and it takes very long. Is there any solution that'd be much faster?
I work in a high performance cluster with Hadoop installed. I was wondering if a map reduce approach could help.
I'd like the solution to be as simple as one line run, like the wc -l solution, but not sure how feasible it is.
Any ideas?
Try: sed -n '$=' filename
Also cat is unnecessary: wc -l filename is enough in your present way.
Your limiting speed factor is the I/O speed of your storage device, so changing between simple newlines/pattern counting programs won't help, because the execution speed difference between those programs are likely to be suppressed by the way slower disk/storage/whatever you have.
But if you have the same file copied across disks/devices, or the file is distributed among those disks, you can certainly perform the operation in parallel. I don't know specifically about this Hadoop, but assuming you can read a 10gb the file from 4 different locations, you can run 4 different line counting processes, each one in one part of the file, and sum their results up:
$ dd bs=4k count=655360 if=/path/to/copy/on/disk/1/file | wc -l &
$ dd bs=4k skip=655360 count=655360 if=/path/to/copy/on/disk/2/file | wc -l &
$ dd bs=4k skip=1310720 count=655360 if=/path/to/copy/on/disk/3/file | wc -l &
$ dd bs=4k skip=1966080 if=/path/to/copy/on/disk/4/file | wc -l &
Notice the & at each command line, so all will run in parallel; dd works like cat here, but allow us to specify how many bytes to read (count * bs bytes) and how many to skip at the beginning of the input (skip * bs bytes). It works in blocks, hence, the need to specify bs as the block size. In this example, I've partitioned the 10Gb file in 4 equal chunks of 4Kb * 655360 = 2684354560 bytes = 2.5GB, one given to each job, you may want to setup a script to do it for you based on the size of the file and the number of parallel jobs you will run. You need also to sum the result of the executions, what I haven't done for my lack of shell script ability.
If your filesystem is smart enough to split big file among many devices, like a RAID or a distributed filesystem or something, and automatically parallelize I/O requests that can be paralellized, you can do such a split, running many parallel jobs, but using the same file path, and you still may have some speed gain.
EDIT:
Another idea that occurred to me is, if the lines inside the file have the same size, you can get the exact number of lines by dividing the size of the file by the size of the line, both in bytes. You can do it almost instantaneously in a single job. If you have the mean size and don't care exactly for the the line count, but want an estimation, you can do this same operation and get a satisfactory result much faster than the exact operation.
As per my test, I can verify that the Spark-Shell (based on Scala) is way faster than the other tools (GREP, SED, AWK, PERL, WC). Here is the result of the test that I ran on a file which had 23782409 lines
time grep -c $ my_file.txt;
real 0m44.96s
user 0m41.59s
sys 0m3.09s
time wc -l my_file.txt;
real 0m37.57s
user 0m33.48s
sys 0m3.97s
time sed -n '$=' my_file.txt;
real 0m38.22s
user 0m28.05s
sys 0m10.14s
time perl -ne 'END { $_=$.;if(!/^[0-9]+$/){$_=0;};print "$_" }' my_file.txt;
real 0m23.38s
user 0m20.19s
sys 0m3.11s
time awk 'END { print NR }' my_file.txt;
real 0m19.90s
user 0m16.76s
sys 0m3.12s
spark-shell
import org.joda.time._
val t_start = DateTime.now()
sc.textFile("file://my_file.txt").count()
val t_end = DateTime.now()
new Period(t_start, t_end).toStandardSeconds()
res1: org.joda.time.Seconds = PT15S
On a multi-core server, use GNU parallel to count file lines in parallel. After each files line count is printed, bc sums all line counts.
find . -name '*.txt' | parallel 'wc -l {}' 2>/dev/null | paste -sd+ - | bc
To save space, you can even keep all files compressed. The following line uncompresses each file and counts its lines in parallel, then sums all counts.
find . -name '*.xz' | parallel 'xzcat {} | wc -l' 2>/dev/null | paste -sd+ - | bc
If your data resides on HDFS, perhaps the fastest approach is to use hadoop streaming. Apache Pig's COUNT UDF, operates on a bag, and therefore uses a single reducer to compute the number of rows. Instead you can manually set the number of reducers in a simple hadoop streaming script as follows:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -Dmapred.reduce.tasks=100 -input <input_path> -output <output_path> -mapper /bin/cat -reducer "wc -l"
Note that I manually set the number of reducers to 100, but you can tune this parameter. Once the map-reduce job is done, the result from each reducer is stored in a separate file. The final count of rows is the sum of numbers returned by all reducers. you can get the final count of rows as follows:
$HADOOP_HOME/bin/hadoop fs -cat <output_path>/* | paste -sd+ | bc
I know the question is a few years old now, but expanding on Ivella's last idea, this bash script estimates the line count of a big file within seconds or less by measuring the size of one line and extrapolating from it:
#!/bin/bash
head -2 $1 | tail -1 > $1_oneline
filesize=$(du -b $1 | cut -f -1)
linesize=$(du -b $1_oneline | cut -f -1)
rm $1_oneline
echo $(expr $filesize / $linesize)
If you name this script lines.sh, you can call lines.sh bigfile.txt to get the estimated number of lines. In my case (about 6 GB, export from database), the deviation from the true line count was only 3%, but ran about 1000 times faster. By the way, I used the second, not first, line as the basis, because the first line had column names and the actual data started in the second line.
If your bottleneck is the disk, it matters how you read from it. dd if=filename bs=128M | wc -l is a lot faster than wc -l filename or cat filename | wc -l for my machine that has an HDD and fast CPU and RAM. You can play around with the block size and see what dd reports as the throughput. I cranked it up to 1GiB.
Note: There is some debate about whether cat or dd is faster. All I claim is that dd can be faster, depending on the system, and that it is for me. Try it for yourself.
Hadoop is essentially providing a mechanism to perform something similar to what #Ivella is suggesting.
Hadoop's HDFS (Distributed file system) is going to take your 20GB file and save it across the cluster in blocks of a fixed size. Lets say you configure the block size to be 128MB, the file would be split into 20x8x128MB blocks.
You would then run a map reduce program over this data, essentially counting the lines for each block (in the map stage) and then reducing these block line counts into a final line count for the entire file.
As for performance, in general the bigger your cluster, the better the performance (more wc's running in parallel, over more independent disks), but there is some overhead in job orchestration that means that running the job on smaller files will not actually yield quicker throughput than running a local wc
I'm not sure that python is quicker:
[root#myserver scripts]# time python -c "print len(open('mybigfile.txt').read().split('\n'))"
644306
real 0m0.310s
user 0m0.176s
sys 0m0.132s
[root#myserver scripts]# time cat mybigfile.txt | wc -l
644305
real 0m0.048s
user 0m0.017s
sys 0m0.074s
If your computer has python, you can try this from the shell:
python -c "print len(open('test.txt').read().split('\n'))"
This uses python -c to pass in a command, which is basically reading the file, and splitting by the "newline", to get the count of newlines, or the overall length of the file.
#BlueMoon's:
bash-3.2$ sed -n '$=' test.txt
519
Using the above:
bash-3.2$ python -c "print len(open('test.txt').read().split('\n'))"
519
find -type f -name "filepattern_2015_07_*.txt" -exec ls -1 {} \; | cat | awk '//{ print $0 , system("cat " $0 "|" "wc -l")}'
Output:
I have a 645GB text file, and none of the earlier exact solutions (e.g. wc -l) returned an answer within 5 minutes.
Instead, here is Python script that computes the approximate number of lines in a huge file. (My text file apparently has about 5.5 billion lines.) The Python script does the following:
A. Counts the number of bytes in the file.
B. Reads the first N lines in the file (as a sample) and computes the average line length.
C. Computes A/B as the approximate number of lines.
It follows along the line of Nico's answer, but instead of taking the length of one line, it computes the average length of the first N lines.
Note: I'm assuming an ASCII text file, so I expect the Python len() function to return the number of chars as the number of bytes.
Put this code into a file line_length.py:
#!/usr/bin/env python
# Usage:
# python line_length.py <filename> <N>
import os
import sys
import numpy as np
if __name__ == '__main__':
file_name = sys.argv[1]
N = int(sys.argv[2]) # Number of first lines to use as sample.
file_length_in_bytes = os.path.getsize(file_name)
lengths = [] # Accumulate line lengths.
num_lines = 0
with open(file_name) as f:
for line in f:
num_lines += 1
if num_lines > N:
break
lengths.append(len(line))
arr = np.array(lengths)
lines_count = len(arr)
line_length_mean = np.mean(arr)
line_length_std = np.std(arr)
line_count_mean = file_length_in_bytes / line_length_mean
print('File has %d bytes.' % (file_length_in_bytes))
print('%.2f mean bytes per line (%.2f std)' % (line_length_mean, line_length_std))
print('Approximately %d lines' % (line_count_mean))
Invoke it like this with N=5000.
% python line_length.py big_file.txt 5000
File has 645620992933 bytes.
116.34 mean bytes per line (42.11 std)
Approximately 5549547119 lines
So there are about 5.5 billion lines in the file.
Let us assume:
Your file system is distributed
Your file system can easily fill the network connection to a single node
You access your files like normal files
then you really want to chop the files into parts, count parts in parallel on multiple nodes and sum up the results from there (this is basically #Chris White's idea).
Here is how you do that with GNU Parallel (version > 20161222). You need to list the nodes in ~/.parallel/my_cluster_hosts and you must have ssh access to all of them:
parwc() {
# Usage:
# parwc -l file
# Give one chunck per host
chunks=$(cat ~/.parallel/my_cluster_hosts|wc -l)
# Build commands that take a chunk each and do 'wc' on that
# ("map")
parallel -j $chunks --block -1 --pipepart -a "$2" -vv --dryrun wc "$1" |
# For each command
# log into a cluster host
# cd to current working dir
# execute the command
parallel -j0 --slf my_cluster_hosts --wd . |
# Sum up the number of lines
# ("reduce")
perl -ne '$sum += $_; END { print $sum,"\n" }'
}
Use as:
parwc -l myfile
parwc -w myfile
parwc -c myfile
With slower IO falling back to dd if={file} bs=128M | wc -l helps tremendously while gathering data for wc to churn through.
I've also stumbled upon
https://github.com/crioux/turbo-linecount
which is great.
You could use the following and is pretty fast:
wc -l filename #assume file got 50 lines then output -> 50 filename
In addition, if you just want to the get the number without displaying the file name. You may do this trick. This will only get you the number of lines in the file without displaying its name.
wc -l filename | cut -f1 -d ' ' #space will be delimiter hence output -> 50

Searching a particular string pattern out of 10000 files in parallel

Problem Statement:-
I need to search a particular String Pattern in around 10000 files and find the records in the files which contains that particular pattern. I can use grep here, but it is taking lots of time.
Below is the command I am using to search a particular string pattern after unzipping the dat.gz file
gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1'
If I simply count how many files are there after unzipping the above dat.gz file
gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | wc -l
I get around 10000 files. And I need to search the above string pattern in all these 10000 files and find out the records which contains the above String Pattern. And my above command is working fine but it is very very slow.
What is the best approach on this? Should we take 100 files at a time and search for the particular String Pattern in that 100 files parallelly.
Note:
I am running SunOS
bash-3.00$ uname -a
SunOS lvsaishdc3in0001 5.10 Generic_142901-02 i86pc i386 i86pc
Do NOT run this in parallel!!!! That's going to bounce the disk head all over the place, it will be much slower.
Since you are reading an archive file there's one way to get a substantial performance boost--don't write the results of the decompression out. The ideal answer would be to decompress to a stream in memory, if that's not viable then decompress to a ramdisk.
In any case you do want some parallelism here--one thread should be obtaining the data and then handing it off to another that does the search. That way you will either be waiting on the disk or on the core doing the decompressing, you won't waste any of that time doing the search.
(Note that in case of the ramdisk you will want to aggressively read the files it wrote and then kill them so the ramdisk doesn't fill up.)
For starters, you will need to uncompress the file to disk.
This does work (in bash,) but you probably don't want to try to start 10,000 processes all at once. Run it inside the uncompressed directory:
for i in `find . -type f`; do ((grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1' $i )&); done
So, we need to have a way to limit the number of spawned processes. This will loop as long as the number of grep processes running on the machine exceeds 10 (including the one doing the counting):
while [ `top -b -n1 | grep -c grep` -gt 10 ]; do echo true; done
I have run this, and it works.... but top takes so long to run that it effectively limits you to one grep per second. Can someone improve upon this, adding one to a count when a new process is started and decrementing by one when a process ends?
for i in `find . -type f`; do ((grep -l 'blah' $i)&); (while [ `top -b -n1 | grep -c grep` -gt 10 ]; do sleep 1; done); done
Any other ideas for how to determine when to sleep and when not to? Sorry for the partial solution, but I hope someone has the other bit you need.
If you are not using regular expressions you can use the -F option of grep or use fgrep. This may provide you with additional performance.
Your gzcat .... | wc -l does not indicate 10000 files, it indicates 10000 lines total for however many files there are.
This is the type of problem that xargs exists for. Assuming your version of gzip came with a script called gzgrep (or maybe just zgrep), you can do this:
find /data/newfolder/real-time-newdata -type f -name "*_20120809_0_*.gz" -print | xargs gzgrep
That will run one gzgrep command with batches of as many individual files as it can fit on a command line (there are options to xargs to limit how many, or for a number of other things). Unfortunately, gzgrep still has to uncompress each file and pass it off to grep, but there's not really any good way to avoid having to uncompress the whole corpus in order to search through it. Using xargs in this way will however cut down some on the overall number of new processes that need to be spawned.

Resources