How to grep through many files of same file type - linux

I wish to grep through many (20,000) text files, each with about 1,000,000 lines each, so the faster the better.
I have tried the below code and it just doesn't seem to want to do anything, it doesn't find any matches even after an hour (it should have done by now).
for i in $(find . -name "*.txt"); do grep -Ff firstpart.txt $1; done

Ofir's answer is good. Another option:
find . -name "*.txt" -exec grep -fnFH firstpart.txt {} \;
I like to add the -n for line numbers and -H to get the filename. -H is particularly useful in this case as you could have a lot of matches.

Instead of iterating through the files in a loop, you can just give the file names to grep using xargs and let grep go over all the files.
find . -name "*.txt" | xargs grep $1
I'm not quite sure whether it will actually increase the performance, but it's probably worth a try.

ripgrep is the most amazing tool. You should get that and use it.
To search *.txt files in all directories recursively, do this:
rg -t txt -f patterns.txt
Ripgrep uses one of the fastest regular expression engines out there. It uses multiple threads. It searches directories and files, and filters them to the interesting ones in the fastest way.
It is simply great.
For anyone stuck using grep for whatever reason:
find -name '*.txt' -type f -print0 | xargs -0 -P 8 -n 8 grep -Ff patterns.txt
That tells xargs to -n 8 use 8 arguments per command and to -P 8 run 8 copies in parallel. It has the downside that the output might become interleaved and corrupted.
Instead of xargs you could use parallel which does a fancier job and keeps output in order:
$ find -name '*.txt' -type f -print0 | parallel -0 grep --with-filename grep -Ff patterns.txt

Related

Script /bin/bash augment too long and grep section not working in find

I'm trying to write a simple script that will find if a field in a file is blank,e.g
x_Field=""
find /mnt/sdb1/*/*/ -name 'files.txt' -type f -follow -print0 {} \; | xargs -0 grep -o -P '(?<=x_Field).*(?=y_Field)' | cut -c 3 | awk '{sub(/..$/,"")}1'
I think the command works without the find but not with the find?
grep -o -P '(?<=x_Field).*(?=y_Field)' | cut -c 3 | awk '{sub(/..$/,"")}1'
Also when I get this half working if seems I have too many files to scan so it gives a augment too long :-(
Sorry also to add in that I need to go through hundreds of subfolders hence why I used the wildcard
Your find command has a number of errors.
Apparently, /mnt/sdb1/*/* expands to a list which is too long for your shell. You can replace that with /mnt/sdb1 -mindepth 2 (assuming you want to avoid finding anything in the directories immediately below sdb1).
The {} \; would be useful if you had an -exec option, but you don't.
Also, the grep | cut | awk can probably be refactored into a single Awk script, but without properly understanding what it's supposed to accomplish, it's hard to write a replacement.

Is it possible to pipe the results of FIND to a COPY command CP?

Is it possible to pipe the results of find to a COPY command cp?
Like this:
find . -iname "*.SomeExt" | cp Destination Directory
Seeking, I always find this kind of formula such as from this post:
find . -name "*.pdf" -type f -exec cp {} ./pdfsfolder \;
This raises some questions:
Why cant you just use | pipe? isn't that what its for?
Why does everyone recommend the -exec
How do I know when to use that (exec) over pipe |?
There's a little-used option for cp: -t destination -- see the man page:
find . -iname "*.SomeExt" | xargs cp -t Directory
Good question!
why cant you just use | pipe? isn't that what its for?
You can pipe, of course, xargs is done for these cases:
find . -iname "*.SomeExt" | xargs cp Destination_Directory/
Why does everyone recommend the -exec
The -exec is good because it provides more control of exactly what you are executing. Whenever you pipe there may be problems with corner cases: file names containing spaces or new lines, etc.
how do I know when to use that (exec) over pipe | ?
It is really up to you and there can be many cases. I would use -exec whenever the action to perform is simple. I am not a very good friend of xargs, I tend to prefer an approach in which the find output is provided to a while loop, such as:
while IFS= read -r result
do
# do things with "$result"
done < <(find ...)
You can use | like below:
find . -iname "*.SomeExt" | while read line
do
cp $line DestDir/
done
Answering your questions:
| can be used to solve this issue. But as seen above, it involves a lot of code. Moreover, | will create two process - one for find and another for cp.
Instead using exec() inside find will solve the problem in a single process.
Try this:
find . -iname "*.SomeExt" -print0 | xargs -0 cp -t Directory
# ........................^^^^^^^..........^^
In case there is whitespace in filenames.
I like the spirit of the response from #fedorqui-so-stop-harming, but it needed a tweak to work in my bash terminal.
In this version...
find . -iname "*.SomeExt" | xargs cp Destination_Directory/
The cp command incorrectly takes Destination_Directory/ as the first argument. I needed to add a replacement string in order to get xargs to insert the argument in the right position for cp. I used a percent symbol for the replacement string, but you can use anything that doesn't conflict with the input from the pipe. This version works for me.
find . -iname "*.SomeExt" | xargs -I % cp % Destination_Directory/
This SOLVED my problem.
find . -type f | grep '\.pdf' | while read line
do
cp $line REPLACE_WITH_TARGET_DIRECTORY
done
If there are spaces in the filenames, try:
find . -iname *.ext > list.txt
cat list.txt | awk 'BEGIN {a="'"'"'"}{print "cp "a$0a" Directory"}' > script.sh
sh script.sh
You can inspect list.txt and script.sh before sh script.sh. Remember to delete the list.txt and script.sh afterwards.
I had some files with parenthesis and wanted a progress bar, so replaced the cat line with:
cat list.txt | awk -v X='"' '{print "rsync -Pa "X$0X" /Volumes/Untitled/"}' > script.sh

Optimal string replacing in files for AIX

I need to remove about 40 emails from several files in a distribution list.
One Address might appear in different files and need to be removed from all of them.
I am working in a directory with several .sh files which also have several lines.
I have done something like this in a couple of test files:
find . -type f -exec grep -li ADDRESS_TO_FIND {} 2>/dev/null \; | xargs sed -i 's/ADDRESS_TO_REMOVE/ /g' *
It works fine but once I try it in the real files, it takes a long time and just sits there. I need to run this in different servers, this is the main cause I want to optimize this.
I have tried to run something like this:
find . -type f -name '*sh' 2>/dev/null | xargs grep ADDRESS_TO_FIND
but that will return:
./FileContainingAddress.sh:ADDRESS_TO_FIND
How do I add something like this:
awk '{print substr($0,1,10)}'
But to return me everything before the ":"?
I can do the rest from there, but haven't found how to trim that part
You can use -exec as a predicate in find, as long as you don't use the multiple file + version, which means that you can provide several -exec clauses each of which will be dependent on the success of the previous one. This style will avoid the construction of lists of filenames, which makes it much more robust in the face of files with odd characters in their names.
For example:
find . -type f -name '*sh' \
-exec grep -qi ADDRESS_TO_FIND {} \; \
-exec sed -i 's/ADDRESS_TO_FIND/ /g' {} \;
You probably want to provide the address as a parameter rather than having to type it twice, unless you really meant for the two instance to be different (ADDRESS_TO_FIND vs. ADDRESS_TO_REMOVE):
clean() {
find . -type f -name '*sh' \
-exec grep -qi "$1" {} \; \
-exec sed -i "s/$1/ /g" {} \;
}
(Watch out for / in the argument to clean. I'll leave making the sed more robust as an exercise.)
After looking back at your question, I noticed something that's potentially quite important:
find -type f -exec grep -li ADDRESS {} \; | xargs sed -i 's/ADDRESS/ /g' *
# here! -----------------------------------------------------------------^
The asterisk is being expanded, so the sed line is operating on every file in the directory.
Assuming that this wasn't a typo in your question, I believe that this is the source of your poor performance. You should remove it!

Piping find results into grep for fast directory exclusion

I am successfully using find to create a list of all files in the current subdirectory, excluding those in the subdirectory "cache." Here's my first bit of code:
find . -wholename './cach*' -prune -o -print
I now wish to pipe this into a grep command. It seems like that should be simple:
find . -wholename './cach*' -prune -o -print | xargs grep -r -R -i "samson"
... but this is returning results that are mostly from the cache directory. I've tried removing the xargs reference, but that does what you'd expect, running the grep on text of the file names, rather than on the files themselves. My goal is to find "samson" in any files that aren't cached content.
I'll probably get around this issue by just using doubled greps in this instance, but I'm very curious about why this one-liner behaves this way. I'd love to hear thoughts on a way to modify it while still using these two commands (as there are speed advantages to doing it this way).
(This is in CentOS 5, btw.)
The wholename match may be the reason why it's still including "cache" files. If you're executing the find command in the directory that contains the "cache" folder, it should work. If not, try changing it to -name '*cache*' instead.
Also, you do not need the -r or -R for your grep, that tells it to recurse through directories - but you're testing individual files.
You can update your command using the piped version, or a single-command:
find . -name '*cache*' -prune -o -print0 | xargs -0 grep -il "samson"
or
find . -name '*cache*' -prune -o -exec grep -iq "samson" {} \; -print
Note, the -l in the first command tells grep to "list the file" and not the line(s) that match. The -q in the second does the same; it tells grep to respond quietly so find will then just print the filename.
You've told grep itself to recurse (twice! -r and -R are synonyms). Since one of the arguments you're passing is . (the top directory), grep is searching in every file (some of them twice, or even more if they're in subdirectories).
If you're going to use find and grep, do this:
find . -path './cach*' -prune -o -print0 | xargs -0 grep -i "samson"
Using -print0 and -0 makes your script work even with file names that contain spaces or punctuation characters.
However, you probably don't need to bother with find here, since GNU grep is capable of excluding directories:
grep -R --exclude-dir='cach*' -i "samson" .
(This also excludes ./deeply/nested/directory/cache. If you only want to exclude cache directories at the toplevel, use find as you did.)
Use the -exec option on find instead of piping them to another command. From there you can use grep "samson" {} \; to look for samson in each file listed.
For example:
find . -wholename './cach*' -prune -o -exec grep "samson" "{}" +

find string inside a gzipped file in a folder

My current problem is that I have around 10 folders, which contain gzipped files (around on an average 5 each). This makes it 50 files to open and look at.
Is there a simpler method to find out if a gzipped file inside a folder has a particular pattern or not?
zcat ABC/myzippedfile1.txt.gz | grep "pattern match"
zcat ABC/myzippedfile2.txt.gz | grep "pattern match"
Instead of writing a script, can I do the same in a single line, for all the folders and sub folders?
for f in `ls *.gz`; do echo $f; zcat $f | grep <pattern>; done;
zgrep will look in gzipped files, has a -R recursive option, and a -H show me the filename option:
zgrep -R --include=*.gz -H "pattern match" .
OS specific commands as not all arguments work across the board:
Mac 10.5+: zgrep -R --include=\*.gz -H "pattern match" .
Ubuntu 16+: zgrep -i -H "pattern match" *.gz
You don't need zcat here because there is zgrep and zegrep.
If you want to run a command over a directory hierarchy, you use find:
find . -name "*.gz" -exec zgrep ⟨pattern⟩ \{\} \;
And also “ls *.gz” is useless in for and you should just use “*.gz” in the future.
how zgrep don't support -R
I think the solution of "Nietzche-jou" could be a better answer, but I would add the option -H to show the file name something like this
find . -name "*.gz" -exec zgrep -H 'PATTERN' \{\} \;
use the find command
find . -name "*.gz" -exec zcat "{}" + |grep "test"
or try using the recursive option (-r) of zcat
Coming in a bit late on this, had a similar problem and was able to resolve using;
zcat -r /some/dir/here | grep "blah"
As detailed here;
http://manpages.ubuntu.com/manpages/quantal/man1/gzip.1.html
However, this does not show the original file that the result matched from, instead showing "(standard input)" as it's coming in from a pipe. zcat does not seem to support outputting a name either.
In terms of performance, this is what we got;
$ alias dropcache="sync && echo 3 > /proc/sys/vm/drop_caches"
$ find 09/01 | wc -l
4208
$ du -chs 09/01
24M
$ dropcache; time zcat -r 09/01 > /dev/null
real 0m3.561s
$ dropcache; time find 09/01 -iname '*.txt.gz' -exec zcat '{}' \; > /dev/null
0m38.041s
As you can see, using the find|zcat method is significantly slower than using zcat -r when dealing with even a small volume of files. I was also unable to make zcat output the file name (using -v will apparently output the filename, but not on every single line). It would appear that there isn't currently a tool that will provide both speed and name consistency with grep (i.e. the -H option).
If you need to identify the name of the file that the result belongs to, then you'll need to either write your own tool (could be done in 50 lines of Python code) or use the slower method. If you do not need to identify the name, then use zcat -r.
Hope this helps
find . -name "*.gz"|xargs zcat | grep "pattern" should do.
zgrep "string" ./*/*
You can use above command to search for string in .gz files of dir directory where dir has following sub-directories structure:
/dir
/childDir1
/file1.gz
/file2.gz
/childDir2
/file3.gz
/file4.gz
/childDir3
/file5.gz
/file6.gz
You can use this command -
zgrep "foo" $(find . -name "*.gz")

Resources