Fastest way to search for multiple values on a linux machine - linux

I want to search for multiple values (say v1, v2, v3....) in a directory with around 6-10 huge files (~300 MB each). I have tried grep and fgrep, with regular expression search like ('v1 | v2 | v3'). The command seems to be running really slow. I am running something like
grep -e 'v1|v2|v3' .
Is there a way I can make my search faster? Please note that the machine is single core, so parallelization may not be of much help.

I'd suggest
LANG=C egrep 'v1|v2|v3'
It won't get much faster than that.

You can search with -R option
-R - Read all files under each directory, recursively.
grep -i 'test' -R .

Related

Quickly list random set of files in directory in Linux

Question:
I am looking for a performant, concise way to list N randomly selected files in a Linux directory using only Bash. The files must be randomly selected from different subdirectories.
Why I'm asking:
In Linux, I often want to test a random selection of files in a directory for some property. The directories contain 1000's of files, so I only want to test a small number of them, but I want to take them from different subdirectories in the directory of interest.
The following returns the paths of 50 "randomly"-selected files:
find /dir/of/interest/ -type f | sort -R | head -n 50
The directory contains many files, and resides on a mounted file system with slow read times (accessed through ssh), so the command can take many minutes. I believe the issue is that the first find command finds every file (slow), and only then prints a random selection.
If you are using locate and updatedb updates regularly (daily is probably the default), you could:
$ locate /home/james/test | sort -R | head -5
/home/james/test/10kfiles/out_708.txt
/home/james/test/10kfiles/out_9637.txt
/home/james/test/compr/bar
/home/james/test/10kfiles/out_3788.txt
/home/james/test/test
How often do you need it? Do the work periodically in advance to have it quickly available when you need it.
Create a refreshList script.
#! /bin/env bash
find /dir/of/interest/ -type f | sort -R | head -n 50 >/tmp/rand.list
mv -f /tmp/rand.list ~
Put it in your crontab.
0 7-20 * * 1-5 nice -25 ~/refresh
Then you will always have a ~/rand.list that's under an hour old.
If you don't want to use cron and aren't too picky about how old it is, just write a function that refreshes the file after you use it every time.
randFiles() {
cat ~/rand.list
{ find /dir/of/interest/ -type f |
sort -R | head -n 50 >/tmp/rand.list
mv -f /tmp/rand.list ~
} &
}
If you can't run locate and the find command is too slow, is there any reason this has to be done in real time?
Would it be possible to use cron to dump the output of the find command into a file and then do the random pick out of there?

Using grep to find function

I need to find the usage of functions like system("rm filename") & system("rm -r filename").
I tried grep -r --include=*.{cc,h} "system" . & grep -r --include=*.{cc,h} "rm" . but they are giving too many outcomes.
How do I search for all the instances of system("rm x") where 'x' can be anything. Kind of new with grep.
Try:
grep -E "system\(\"rm [a-zA-Z0-9 ]*\"\)" file.txt
Regexp [a-zA-Z0-9 ] builds a pattern for grep what it needs to find in x of system("rm x"). Unfortunately, grep don't supports groups for matching, so you will need to specify it directly what to search.
A possible way might be to work inside the GCC compiler. You could use the MELT domain specific language for that. It provides easy matching on Gimple internal representation of GCC.
It is more complex than textual solutions, but it would also find e.g. calls to system inside functions after inlining and other optimizations.
So customizing the GCC compiler is probably not worth the effort for your case, unless you have a really large code base (e.g. million of lines of source code).
In a simpler textual based approach, you might pipe two greps, e.g.
grep -rwn system * | grep -w rm
or perhaps just
grep -rn 'system.*rm' *
BTW, in some big enough software, you may probably have a lot of code like e.g.
char cmdbuf[128];
snprintf (cmdbuf, sizeof(cmdbuf), "rm %s", somefilepath);
system (cmdbuf);
and in that case a simple textual grep based approach is not enough (unless you inspect visually surrounding code).
Install ack (http://beyondgrep.com) and your call is:
ack --cc '\bsystem\(.+\brm\rb'

Searching a particular string pattern out of 10000 files in parallel

Problem Statement:-
I need to search a particular String Pattern in around 10000 files and find the records in the files which contains that particular pattern. I can use grep here, but it is taking lots of time.
Below is the command I am using to search a particular string pattern after unzipping the dat.gz file
gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1'
If I simply count how many files are there after unzipping the above dat.gz file
gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | wc -l
I get around 10000 files. And I need to search the above string pattern in all these 10000 files and find out the records which contains the above String Pattern. And my above command is working fine but it is very very slow.
What is the best approach on this? Should we take 100 files at a time and search for the particular String Pattern in that 100 files parallelly.
Note:
I am running SunOS
bash-3.00$ uname -a
SunOS lvsaishdc3in0001 5.10 Generic_142901-02 i86pc i386 i86pc
Do NOT run this in parallel!!!! That's going to bounce the disk head all over the place, it will be much slower.
Since you are reading an archive file there's one way to get a substantial performance boost--don't write the results of the decompression out. The ideal answer would be to decompress to a stream in memory, if that's not viable then decompress to a ramdisk.
In any case you do want some parallelism here--one thread should be obtaining the data and then handing it off to another that does the search. That way you will either be waiting on the disk or on the core doing the decompressing, you won't waste any of that time doing the search.
(Note that in case of the ramdisk you will want to aggressively read the files it wrote and then kill them so the ramdisk doesn't fill up.)
For starters, you will need to uncompress the file to disk.
This does work (in bash,) but you probably don't want to try to start 10,000 processes all at once. Run it inside the uncompressed directory:
for i in `find . -type f`; do ((grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1' $i )&); done
So, we need to have a way to limit the number of spawned processes. This will loop as long as the number of grep processes running on the machine exceeds 10 (including the one doing the counting):
while [ `top -b -n1 | grep -c grep` -gt 10 ]; do echo true; done
I have run this, and it works.... but top takes so long to run that it effectively limits you to one grep per second. Can someone improve upon this, adding one to a count when a new process is started and decrementing by one when a process ends?
for i in `find . -type f`; do ((grep -l 'blah' $i)&); (while [ `top -b -n1 | grep -c grep` -gt 10 ]; do sleep 1; done); done
Any other ideas for how to determine when to sleep and when not to? Sorry for the partial solution, but I hope someone has the other bit you need.
If you are not using regular expressions you can use the -F option of grep or use fgrep. This may provide you with additional performance.
Your gzcat .... | wc -l does not indicate 10000 files, it indicates 10000 lines total for however many files there are.
This is the type of problem that xargs exists for. Assuming your version of gzip came with a script called gzgrep (or maybe just zgrep), you can do this:
find /data/newfolder/real-time-newdata -type f -name "*_20120809_0_*.gz" -print | xargs gzgrep
That will run one gzgrep command with batches of as many individual files as it can fit on a command line (there are options to xargs to limit how many, or for a number of other things). Unfortunately, gzgrep still has to uncompress each file and pass it off to grep, but there's not really any good way to avoid having to uncompress the whole corpus in order to search through it. Using xargs in this way will however cut down some on the overall number of new processes that need to be spawned.

XARGS, GREP and GNU parallel

Being a linux newbie I am having trouble figuring out some of the elementary aspects of text searching.
What I want to accomplish is as follows:
I have a file with a list of absolutepaths to a particular path.
I want to go through this list of files and grep for a particular pattern
If the pattern is found in that file, I would like to redirect it to a different output file.
Since these files are spread out on the NFS, I would like to speed up the lookup using GNU parallel.
So..what I did was as follows:
cat filepaths|xargs -iSomePath echo grep -Pl '\d+,\d+,\d+,\d+' \"SomePath\"> FoundPatternsInFile.out| parallel -v -j 30
When I run this command, I am getting the following error repeatedly:
grep: "/path/to/file/name": No such file or directory
The file and the path exists. Can somebody point out what I might be doing wrong with xargs and grep?
Thanks
cat filepaths | parallel -j 30 grep -Pl '\d+,\d+,\d+,\d+' {} > FoundPatternsInFile.out
In this case you can even leave out {}.

Looking for tool to search text in files on command line

Hello
I'm looking some script or program that use keywords or pattern search in files ex. php, html, etc and show where is this file
I use command cat /home/* | grep "keyword"
but i have too many folders and files and this command causes big uptime :/
I need this script to find fake websites (paypal, ebay, etc)
find /home -exec grep -s "keyword" {} \; -print
You don't really say what OS (and shell) you are using. You might want to retag your question to help us out.
Because you mention cat | ... , I am assuming you are using a Unix/Linux variant, so here are some pointers for looking at files. (bmargulies solution is good too).
I'm looking some script or program that use keywords or pattern search in files
grep is the basic program for searching files for text strings. Its usage is
grep [-options] 'search target' file1 file2 .... filen
(Note that 'search target' contains a space, if you don't surround spaces in your searchTarget with double or single quotes, you will have a minor error to debug.)
(Also note that 'search target' can use a wide range of wild-card characters, like .,?,+,,., and many more, that is beyond the scope of your question). ... anyway ...
As I guess you have discovered, you can only cram so many files at a time into the comand-line, even when using wild-card filename expansion. Unix/linux almost always have a utiltiyt that can help with that,
startDir=/home
find ${startDir} -print | xargs grep -l 'Search Target'
This, as one person will be happy to remind you, will require further enhancements if your filenames contain whitespace characters or newlines.
The options available for grep can vary wildly based on which OS you are using. If you're lucky, you type the following to get the man page for your local grep.
man grep
If you don't have your page buffer setup for a large size, you might need to do
man grep | page
so you can see the top of the 'document'. Press any key to advance to the next page and when you are at the end of the document, the last key press returns you to the command prompt.
Some options that most greps have that might be useful to you are
-i (ignore case)
-l (list filenames only (where txt is found)
There is also fgrep, which is usually interpretted to mean 'file' grep
becuase you can give it a file of search targets to scan for, and is used like
fgrep [-other_options] -f srchTargetsFile file1 file2 ... filen
I need this script to find fake websites (paypal, ebay, etc)
Final solution
you can make a srchFile like
paypal.fake.com
ebay.fake.com
etc.fake.com
and then combined with above, run the following
startDir=/home
find ${startDir} -print | xargs fgrep -il -f srchFile
Some greps require that the -fsrchFile be run together.
Now you are finding all files starting /home, searching with fgrep for paypay, ebay, etc in all files. The -l says it will ONLY print the filename where a match is found. You can remove the -l and then you will see the output of what is found, prepended with the filename.
IHTH.

Resources