How to use grep with large (millions) number of files to search for string and get result in few minutes

How to use grep with large (millions) number of files to search for string and get result in few minutes - linux

This question is related to
How to use grep efficiently?
I am trying to search for a "string" in a folder which has 8-10 million small (~2-3kb) plain text files. I need to know all the files which has "string".
At first I used this
grep "string"
That was super slow.
Then I tried
grep * "string" {} \; -print
Based on linked question, I used this
find . | xargs -0 -n1 -P8 grep -H "string"
I get this error:
xargs: argument line too long
Does anyone know a way to accomplish this task relatively quicker?
I run this search on a server machine which has more than 50GB of available RAM, and 14 cores of CPU. I wish somehow I could use all that processing power to run this search faster.

You should remove -0 argument to xargs and up -n parameter instead:
... | xargs -n16 ...

It's not that big stack of files (kudos to 10⁷ files - a messys dream) but I created 100k files (400 MB overall) with
for i in {1..100000}; do head -c 10 /dev/urandom > dummy_$i; done
and made some tests for pure curiosity (the keyword 10 I was searching is chosen randomly):
> time find . | xargs -n1 -P8 grep -H "10"
real 0m22.626s
user 0m0.572s
sys 0m5.800s
> time find . | xargs -n8 -P8 grep -H "10"
real 0m3.195s
user 0m0.180s
sys 0m0.748s
> time grep "10" *
real 0m0.879s
user 0m0.512s
sys 0m0.328s
> time awk '/10/' *
real 0m1.123s
user 0m0.760s
sys 0m0.348s
> time sed -n '/10/p' *
real 0m1.531s
user 0m0.896s
sys 0m0.616s
> time perl -ne 'print if /10/' *
real 0m1.428s
user 0m1.004s
sys 0m0.408s
Btw. there isn't a big difference in running time if I suppress the output with piping STDOUT to /dev/null. I am using Ubuntu 12.04 on a not so powerful laptop ;)
My CPU is Intel(R) Core(TM) i3-3110M CPU # 2.40GHz.
More curiosity:
> time find . | xargs -n1 -P8 grep -H "10" 1>/dev/null
real 0m22.590s
user 0m0.616s
sys 0m5.876s
> time find . | xargs -n4 -P8 grep -H "10" 1>/dev/null
real m5.604s
user 0m0.196s
sys 0m1.488s
> time find . | xargs -n8 -P8 grep -H "10" 1>/dev/null
real 0m2.939s
user 0m0.140s
sys 0m0.784s
> time find . | xargs -n16 -P8 grep -H "10" 1>/dev/null
real 0m1.574s
user 0m0.108s
sys 0m0.428s
> time find . | xargs -n32 -P8 grep -H "10" 1>/dev/null
real 0m0.907s
user 0m0.084s
sys 0m0.264s
> time find . | xargs -n1024 -P8 grep -H "10" 1>/dev/null
real 0m0.245s
user 0m0.136s
sys 0m0.404s
> time find . | xargs -n100000 -P8 grep -H "10" 1>/dev/null
real 0m0.224s
user 0m0.100s
sys 0m0.520s

8 million files is a lot in one directory! However, 8 million times 2kb is 16GB and you have 50GB of RAM. I am thinking of a RAMdisk...

If you've got that much RAM, why not read it all into memory and use a regular expression library to search? It's a simple C program:
#include <fcntl.h>
#include <regex.h>
...

Related

Limit number of parallel jobs in bash [duplicate]

This question already has answers here:
Bash: limit the number of concurrent jobs? [duplicate]
(14 answers)
Closed 1 year ago.
I want to read links from file, which is passed by argument, and download content from each.
How can I do it in parallel with 20 processes?
I understand how to do it with an unlimited number of processes:
#!/bin/bash
filename="$1"
mkdir -p saved
while read -r line; do
url="$line"
name_download_file_sha="$(echo $url | sha256sum | awk '{print $1}').jpeg"
curl -L $url > saved/$name_download_file_sha &
done < "$filename"
wait

You can add this test :
until [ "$( jobs -lr 2>&1 | wc -l)" -lt 20 ]; do
sleep 1
done
This will maintain maximum 21 instance of curl in parallel .
And wait until you reach 19 or a lower value to start another one .
If you are using GNU sleep , you can do sleep 0.5 , to optimize the wait time
So you code will be
#!/bin/bash
filename="$1"
mkdir -p saved
while read -r line; do
until [ "$( jobs -lr 2>&1 | wc -l)" -lt 20 ]; do
sleep 1
done
url="$line"
name_download_file_sha="$(echo $url | sha256sum | awk '{print $1}').jpeg"
curl -L $url > saved/$name_download_file_sha &
done < "$filename"
wait

xargs -P is the simple solution. It gets somewhat more complicated when you want to save to separate files, but you can use sh -c to add this bit.
: ${processes:=20}
< $filename xargs -P $processes -I% sh -c '
line="$1"
url_file="$line"
name_download_file_sha="$(echo $url_file | sha256sum | awk "{print \$1}").jpeg"
curl -L $url > saved/$name_download_file_sha
' -- %
Based on triplee's suggestions, I've lower-cased the environment variable and changed its name to 'processes' to be more correct.
I've also made the suggested corrections to the awk script to avoid quoting issues.
You may still find it easier to replace the awk script with cut -f1, but you'll need to specify the cut delimeter if it's spaces (not tabs).

Running multiple executable parallel with limitation to 5

I am running some larger amount (100) of calculations. Each calculation needs 5 hours to finish and can run only on one core. So in order to make whole process more efficient i need to run 5 (or more, depends on number of cpus) of them at same time.
I have 100 folders and in each is one executable. How can I run all 100 executable in the way that only 5 are running at the same time?
I have tried xargs as seen: Parallelize Bash script with maximum number of processes
cpus=5
find . -name ./*/*.exe | xargs --max-args=1 --max-procs=$cpus echo
I can't find the way to run them. Echo only prints paths on screen.

You can execute the input as the command by specifying a substitution pattern and using that as the command:
$ printf '%s\n' uname pwd | xargs -n 1 -I{} {}
Linux
/data/data/com.termux/files/home
or more easily but less directly by using env as the command, since that will execute its first argument:
printf '%s\n' uname pwd | xargs -n 1 env
In your case, your full command might be:
find . -name '*.exe' | xargs -n 1 -P 5 env

How fast is your GREP in Cygwin?

I'm using Cygwin in a very fast PC but I find it is ridiculously slow when I want to use grep. It also slow when I want to process a large file (say 25Mb). Here I'm using an example to prove my case.
> time for i in $(seq 1000); do grep "$i" .; done
real 75.865 user 5.442 sys 14.542 pcpu 26.34
I want to know
Show me your score. Have you had similar problem with slowness of cygwin or GNU grep
How can you improve the performance
What your tips of using Cygwin
uname -rvs
CYGWIN_NT-6.1-WOW64 1.7.9(0.237/5/3) 2011-03-29 10:10
which grep
grep is /usr/bin/grep
grep is /bin/grep
grep is /usr/bin/grep`

$ time for i in $(seq 1000); do grep "$i" .; done
real 0m13.741s
user 0m3.520s
sys 0m8.577s
$ uname -rvs
CYGWIN_NT-6.1-WOW64 1.7.15(0.260/5/3) 2012-05-09 10:25
$ which grep
/usr/bin/grep

Check free disk space for current partition in bash

I am writing an installer in bash. The user will go to the target directory and runs the install script, so the first action should be to check that there is enough space. I know that df will report all file systems, but I was wondering if there was a way to get the free space just for the partition that the target directory is on.
Edit - the answer I came up with
df $PWD | awk '/[0-9]%/{print $(NF-2)}'
Slightly odd because df seems to format its output to fit the terminal, so with a long mount point name the output is shifted down a line

Yes:
df -k .
for the current directory.
df -k /some/dir
if you want to check a specific directory.
You might also want to check out the stat(1) command if your system has it. You can specify output formats to make it easier for your script to parse. Here's a little example:
$ echo $(($(stat -f --format="%a*%S" .)))

df command : Report file system disk space usage
du command : Estimate file space usage
Type df -h or df -k to list free disk space:
$ df -h
OR
$ df -k
du shows how much space one or more files or directories is using:
$ du -sh
The -s option summarizes the space a directory is using and -h option provides Human-readable output.

I think this should be a comment or an edit to ThinkingMedia's answer on this very question (Check free disk space for current partition in bash), but I am not allowed to comment (not enough rep) and my edit has been rejected (reason: "this should be a comment or an answer").
So please, powers of the SO universe, don't damn me for repeating and fixing someone else's "answer". But someone on the internet was wrong!™ and they wouldn't let me fix it.
The code
df --output=avail -h "$PWD" | sed '1d;s/[^0-9]//g'
has a substantial flaw:
Yes, it will output 50G free as 50 -- but it will also output 5.0M free as 50 or 3.4G free as 34 or 15K free as 15.
To create a script with the purpose of checking for a certain amount of free disk space you have to know the unit you're checking against. Remove it (as sed does in the example above) the numbers don't make sense anymore.
If you actually want it to work, you will have to do something like:
FREE=`df -k --output=avail "$PWD" | tail -n1` # df -k not df -h
if [[ $FREE -lt 10485760 ]]; then # 10G = 10*1024*1024k
# less than 10GBs free!
fi;
Also for an installer to df -k $INSTALL_TARGET_DIRECTORY might make more sense than df -k "$PWD".
Finally, please note that the --output flag is not available in every version of df / linux.

df --output=avail -B 1 "$PWD" |tail -n 1
you get size in bytes this way.

A complete example for someone who may want to use this to monitor a mount point on a server. This example will check if /var/spool is under 5G and email the person :
#!/bin/bash
# -----------------------------------------------------------------------------------------
# SUMMARY: Check if MOUNT is under certain quota, mail us if this is the case
# DETAILS: If under 5G we have it alert us via email. blah blah
# -----------------------------------------------------------------------------------------
# CRON: 0 0,4,8,12,16 * * * /var/www/httpd-config/server_scripts/clear_root_spool_log.bash
MOUNTP=/var/spool # mount drive to check
LIMITSIZE=5485760 # 5G = 10*1024*1024k # limit size in GB (FLOOR QUOTA)
FREE=$(df -k --output=avail "$MOUNTP" | tail -n1) # df -k not df -h
LOG=/tmp/log-$(basename ${0}).log
MAILCMD=mail
EMAILIDS="dude#wheres.mycar"
MAILMESSAGE=/tmp/tmp-$(basename ${0})
# -----------------------------------------------------------------------------------------
function email_on_failure(){
sMess="$1"
echo "" >$MAILMESSAGE
echo "Hostname: $(hostname)" >>$MAILMESSAGE
echo "Date & Time: $(date)" >>$MAILMESSAGE
# Email letter formation here:
echo -e "\n[ $(date +%Y%m%d_%H%M%S%Z) ] Current Status:\n\n" >>$MAILMESSAGE
cat $sMess >>$MAILMESSAGE
echo "" >>$MAILMESSAGE
echo "*** This email generated by $(basename $0) shell script ***" >>$MAILMESSAGE
echo "*** Please don't reply this email, this is just notification email ***" >>$MAILMESSAGE
# sending email (need to have an email client set up or sendmail)
$MAILCMD -s "Urgent MAIL Alert For $(hostname) AWS Server" "$EMAILIDS" < $MAILMESSAGE
[[ -f $MAILMESSAGE ]] && rm -f $MAILMESSAGE
}
# -----------------------------------------------------------------------------------------
if [[ $FREE -lt $LIMITSIZE ]]; then
echo "Writing to $LOG"
echo "MAIL ERROR: Less than $((($FREE/1000))) MB free (QUOTA) on $MOUNTP!" | tee ${LOG}
echo -e "\nPotential Files To Delete:" | tee -a ${LOG}
find $MOUNTP -xdev -type f -size +500M -exec du -sh {} ';' | sort -rh | head -n20 | tee -a ${LOG}
email_on_failure ${LOG}
else
echo "Currently $(((($FREE-$LIMITSIZE)/1000))) MB of QUOTA available of on $MOUNTP. "
fi

This is one of those questions where everyone has their favorite approach, but since I have returned to this page a few times over the years I will share one of my solutions (inspired by others here).
DISK_SIZE_TOTAL=$(df -kh . | tail -n1 | awk '{print $2}')
DISK_SIZE_FREE=$(df -kh . | tail -n1 | awk '{print $4}')
DISK_PERCENT_USED=$(df -kh . | tail -n1 | awk '{print $5}')
Since it's just using df and pulling row/columns via awk it should be fairly portable.
Then you can use this in a script, like maybe:
"${DISK_SIZE_FREE}" available out of "${DISK_SIZE_TOTAL}" total ("${DISK_PERCENT_USED}" used).
Example: https://github.com/littlebizzy/slickstack/blob/master/bash/ss-install.txt
The final result looks like this:
10GB available out of 20GB total (50% used).

To know the usage of the specific directory in GB's or TB's in linux the command is,
df -h /dir/inner_dir/
or
df -sh /dir/inner_dir/
and to know the usage of the specific directory in bits in linux the command is,
df-k /dir/inner_dir/

Type in the command shell:
df -h
or
df -m
or
df -k
It will show the list of free disk spaces for each mount point.
You can show/view single column also.
Type:
df -m |awk '{print $3}'
Note: Here 3 is the column number. You can choose which column you need.

Finding process count in Linux via command line

I was looking for the best way to find the number of running processes with the same name via the command line in Linux. For example if I wanted to find the number of bash processes running and get "5". Currently I have a script that does a 'pidof ' and then does a count on the tokenized string. This works fine but I was wondering if there was a better way that can be done entirely via the command line. Thanks in advance for your help.

On systems that have pgrep available, the -c option returns a count of the number of processes that match the given name
pgrep -c command_name
Note that this is a grep-style match, not an exact match, so e.g. pgrep sh will also match bash processes. If you want an exact match, also use the -x option.
If pgrep is not available, you can use ps and wc.
ps -C command_name --no-headers | wc -l
The -C option to ps takes command_name as an argument, and the program prints a table of information about processes whose executable name matches the given command name. This is an exact match, not grep-style. The --no-headers option suppresses the headers of the table, which are normally printed as the first line. With --no-headers, you get one line per process matched. Then wc -l counts and prints the number of lines in its input.

result=`ps -Al | grep command-name | wc -l`
echo $result

ps -Al | grep -c bash

You can try :
ps -ef | grep -cw [p]rocess_name
OR
ps aux | grep -cw [p]rocess_name
For e.g.,:
ps -ef | grep -cw [i]nit

Some of the above didn't work for me, but they helped me on my way to this.
ps aux | grep [j]ava -c
For newbies to Linux:
ps aux prints all the currently running processes, grep searches for all processes that match the word java, the [] brackets remove the process you just ran so it wont include that as a running process and finally the -c option stands for count.

List all process names, sort and count
ps --no-headers -A -o comm | sort | uniq -c
You also can list process attached to a tty
ps --no-headers a -o comm | sort | uniq -c
You may filter with:
ps --no-headers -A -o comm | awk '{ list[$1] ++ } END { for (i in list) { if (list[i] > 10) printf ("%20s: %s\n", i, list[i]) } }'

Following bash script can be run as a cron job and you can possibly get email if any process forks itself too much.
for i in `ps -A -o comm= --sort=+comm | uniq`;
do
if (( `ps -C $i --no-headers | wc -l` > 10 )); then
echo `hostname` $i `ps -C $i --no-headers | wc -l` ;
fi
done
Replace 10 with your number of concern.
TODO: "10" could be passed as command line parameter as well. Also, few system processes can be put into exception list.

You can use ps(will show snapshot of processes) with wc(will count number of words, wc -l option will count lines i.e. newline characters).
Which is very easy and simple to remember.
ps -e | grep processName | wc -l
This simple command will print number of processes running on current server.
If you want to find the number of process running on current server for current user then use -U option of ps.
ps -U root | grep processName | wc -l
change root with username.
But as mentioned in lot of other answers you can also use ps -e | grep -c process_name which is more elegant way.

ps aux | wc -l
This command shows number of processes running on the system by all the users.
For a specific user you can use the following command:
ps -u <username> | wc -l
replace with the actual username before running :)

ps -awef | grep CAP | wc -l
Here "CAP" is the word which is in the my Process_Names.
This command output = Number of Processes + 1
This is why When we are running this command , our system read thats "ps -awef | grep CAP | wc -l " is also a process.
So yes our real answer is (Number of Processes) = Command Output - 1
Note : These processes are only those processes who include the name of "CAP"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to use grep with large (millions) number of files to search for string and get result in few minutes - linux

You should remove -0 argument to xargs and up -n parameter instead: ... | xargs -n16 ...

8 million files is a lot in one directory! However, 8 million times 2kb is 16GB and you have 50GB of RAM. I am thinking of a RAMdisk...

If you've got that much RAM, why not read it all into memory and use a regular expression library to search? It's a simple C program: #include <fcntl.h> #include <regex.h> ...

Related

Limit number of parallel jobs in bash [duplicate]

Running multiple executable parallel with limitation to 5

How fast is your GREP in Cygwin?

Check free disk space for current partition in bash

Finding process count in Linux via command line

Categories

Resources