linux + find word in file under directory but quickly

linux + find word in file under directory but quickly - linux

I have the following command
find /var -type f -exec grep "param1" {} \; -print
With this command I can find the param1 string in any file under /var
but the time that it take for this is very long.
I need other possibility to find string in file but much more faster then my example
THX
yael

grep -r "string"
The find is not neccesary.
This is a good link, though outdated.
link text
Also i think this belongs in superuser.com

Take a look at the -l option to the grep command for a speed boost. To speed up the find command use:
find ... -exec sh -c '...' arg0 '{}' +
# grep ... -l: print files with matches, but stop scanning the file on the first match
grep -lsr "param1" /var
find /var -type f -exec sh -c 'grep -ls "param1" "$#"' arg0 '{}' +
find /var -type f -exec sh -c 'grep -ls "$0" "$#"' "param1" '{}' +

find /var -type f | xargs grep "param1"
would be slightly faster (no process spawning for each file)
grep -r "param1" /var
would be slightly more so I think.

Try also using ack, which is "better than grep" in most cases. Among its features the ability to ignore typical garbage files by default (such as .svn or .git directories, core dumps, backup files), the ability to use a large set of predefined file classes, nice output formatting.

You can use locate's index (if you don't depend on files that are added/removed)
grep "param1" $(locate -r '^/var')

some of these command optimizations are helpful, but the biggest jump in speed I got from grepping 2 million files was to use a SSD Hard drive. Same queries took 1/5 of the time.

Related

Moving files with a specific modification date; "find | xargs ls | grep | -exec" fails w/ "-exec: command not found"

Iam using centos 7
If I want to find files that have specific name and specific date then moving these files to another folder iam issuing the command
find -name 'fsimage*' | xargs ls -ali | grep 'Oct 20' | -exec mv {} /hdd/fordelete/ \;
with the following error
-bash: -exec: command not found xargs: ls: terminated by signal 13

As another answer already explains, -exec is an action for find, you can't use it as a shell command. On contrary, xargs and grep are commands, and you can't use them as find actions, just like you can't use pipe | inside find.
But more importantly, even though you could use ls and grep on find's result just to move files older than some amount of time, you shouldn't. Such pipeline is fragile and fails on many corner cases, like symlinks, files with newlines in name, etc.
Instead, use find. You'll find it quite powerful.
For example, to mv files modified more than 7 days ago, use the -mtime test:
find -name 'fsimage*' -mtime +7 -exec mv '{}' /some/dir/ \;
To mv files modified on a specific/reference date, e.g. 2017-10-20, you can use the -newerXY test:
find -name 'fsimage*' -newermt 2017-10-20 ! -newermt 2017-10-21 -exec mv '{}' /some/dir/ \;
Also, if your mv supports the -t option (to give target dir first, multiple files after), you can use {} + placeholder in find for multiple files, reducing the total number of mv command invocations (thanks #CharlesDuffy):
find -name 'fsimage*' -mtime +7 -exec mv -t /some/dir/ '{}' +

the -exec as you wrote it is quite meaningless, moreover it seems you are mixing find syntax with shell oe (-exec as you wrote it should be passed to find)
there are probably more concise ways of doing, but this should do what you expect:
find -name 'fsimage*' -type f | xargs ls -ali | grep 'Oct 20' | awk '{ print $NF }' | while read file; do mv "$file" /hdd/fordelete/ ; done
nevertheless, you should take care of not just copy/paste things you do not really understand from the web, you may wreck you system...

Fastest way to grep through thousands of gz files?

I have thousands of .gz files all in one directory. I need to grep through them for the string Mouse::Handler, is the following the fastest (and most accurate) way to do this?
find . -name "*.gz" -exec zgrep -H 'Mouse::Handler' {} \;
Ideally I would also like to print out the line that I find this string on.
I'm running on a RHEL linux box.

You can search in parallel using
find . -name "*.gz" | xargs -n 1 -P NUM zgrep -H 'Mouse::Handler'
where NUM is around the number of cores you have.

Argument list too long error for rm, cp, mv commands

I have several hundred PDFs under a directory in UNIX. The names of the PDFs are really long (approx. 60 chars).
When I try to delete all PDFs together using the following command:
rm -f *.pdf
I get the following error:
/bin/rm: cannot execute [Argument list too long]
What is the solution to this error?
Does this error occur for mv and cp commands as well? If yes, how to solve for these commands?

The reason this occurs is because bash actually expands the asterisk to every matching file, producing a very long command line.
Try this:
find . -name "*.pdf" -print0 | xargs -0 rm
Warning: this is a recursive search and will find (and delete) files in subdirectories as well. Tack on -f to the rm command only if you are sure you don't want confirmation.
You can do the following to make the command non-recursive:
find . -maxdepth 1 -name "*.pdf" -print0 | xargs -0 rm
Another option is to use find's -delete flag:
find . -name "*.pdf" -delete

tl;dr
It's a kernel limitation on the size of the command line argument. Use a for loop instead.
Origin of problem
This is a system issue, related to execve and ARG_MAX constant. There is plenty of documentation about that (see man execve, debian's wiki, ARG_MAX details).
Basically, the expansion produce a command (with its parameters) that exceeds the ARG_MAX limit.
On kernel 2.6.23, the limit was set at 128 kB. This constant has been increased and you can get its value by executing:
getconf ARG_MAX
# 2097152 # on 3.5.0-40-generic
Solution: Using for Loop
Use a for loop as it's recommended on BashFAQ/095 and there is no limit except for RAM/memory space:
Dry run to ascertain it will delete what you expect:
for f in *.pdf; do echo rm "$f"; done
And execute it:
for f in *.pdf; do rm "$f"; done
Also this is a portable approach as glob have strong and consistant behavior among shells (part of POSIX spec).
Note: As noted by several comments, this is indeed slower but more maintainable as it can adapt more complex scenarios, e.g. where one want to do more than just one action.
Solution: Using find
If you insist, you can use find but really don't use xargs as it "is dangerous (broken, exploitable, etc.) when reading non-NUL-delimited input":
find . -maxdepth 1 -name '*.pdf' -delete
Using -maxdepth 1 ... -delete instead of -exec rm {} + allows find to simply execute the required system calls itself without using an external process, hence faster (thanks to #chepner comment).
References
I'm getting "Argument list too long". How can I process a large list in chunks? # wooledge
execve(2) - Linux man page (search for ARG_MAX) ;
Error: Argument list too long # Debian's wiki ;
Why do I get “/bin/sh: Argument list too long” when passing quoted arguments? # SuperUser

find has a -delete action:
find . -maxdepth 1 -name '*.pdf' -delete

Another answer is to force xargs to process the commands in batches. For instance to delete the files 100 at a time, cd into the directory and run this:
echo *.pdf | xargs -n 100 rm

If you’re trying to delete a very large number of files at one time (I deleted a directory with 485,000+ today), you will probably run into this error:
/bin/rm: Argument list too long.
The problem is that when you type something like rm -rf *, the * is replaced with a list of every matching file, like “rm -rf file1 file2 file3 file4” and so on. There is a relatively small buffer of memory allocated to storing this list of arguments and if it is filled up, the shell will not execute the program.
To get around this problem, a lot of people will use the find command to find every file and pass them one-by-one to the “rm” command like this:
find . -type f -exec rm -v {} \;
My problem is that I needed to delete 500,000 files and it was taking way too long.
I stumbled upon a much faster way of deleting files – the “find” command has a “-delete” flag built right in! Here’s what I ended up using:
find . -type f -delete
Using this method, I was deleting files at a rate of about 2000 files/second – much faster!
You can also show the filenames as you’re deleting them:
find . -type f -print -delete
…or even show how many files will be deleted, then time how long it takes to delete them:
root#devel# ls -1 | wc -l && time find . -type f -delete
100000
real 0m3.660s
user 0m0.036s
sys 0m0.552s

Or you can try:
find . -name '*.pdf' -exec rm -f {} \;

you can try this:
for f in *.pdf
do
rm "$f"
done
EDIT:
ThiefMaster comment suggest me not to disclose such dangerous practice to young shell's jedis, so I'll add a more "safer" version (for the sake of preserving things when someone has a "-rf . ..pdf" file)
echo "# Whooooo" > /tmp/dummy.sh
for f in '*.pdf'
do
echo "rm -i \"$f\""
done >> /tmp/dummy.sh
After running the above, just open the /tmp/dummy.sh file in your favorite editor and check every single line for dangerous filenames, commenting them out if found.
Then copy the dummy.sh script in your working dir and run it.
All this for security reasons.

For somone who doesn't have time.
Run the following command on terminal.
ulimit -S -s unlimited
Then perform cp/mv/rm operation.

I'm surprised there are no ulimit answers here. Every time I have this problem I end up here or here. I understand this solution has limitations but ulimit -s 65536 seems to often do the trick for me.

You could use a bash array:
files=(*.pdf)
for((I=0;I<${#files[#]};I+=1000)); do
rm -f "${files[#]:I:1000}"
done
This way it will erase in batches of 1000 files per step.

you can use this commend
find -name "*.pdf" -delete

The rm command has a limitation of files which you can remove simultaneous.
One possibility you can remove them using multiple times the rm command bases on your file patterns, like:
rm -f A*.pdf
rm -f B*.pdf
rm -f C*.pdf
...
rm -f *.pdf
You can also remove them through the find command:
find . -name "*.pdf" -exec rm {} \;

If they are filenames with spaces or special characters, use:
find -name "*.pdf" -delete
For files in current directory only:
find -maxdepth 1 -name '*.pdf' -delete
This sentence search all files in the current directory (-maxdepth 1) with extension pdf (-name '*.pdf'), and then, delete.

i was facing same problem while copying form source directory to destination
source directory had files ~3 lakcs
i used cp with option -r and it's worked for me
cp -r abc/ def/
it will copy all files from abc to def without giving warning of Argument list too long

Try this also If you wanna delete above 30/90 days (+) or else below 30/90(-) days files/folders then you can use the below ex commands
Ex: For 90days excludes above after 90days files/folders deletes, it means 91,92....100 days
find <path> -type f -mtime +90 -exec rm -rf {} \;
Ex: For only latest 30days files that you wanna delete then use the below command (-)
find <path> -type f -mtime -30 -exec rm -rf {} \;
If you wanna giz the files for more than 2 days files
find <path> -type f -mtime +2 -exec gzip {} \;
If you wanna see the files/folders only from past one month .
Ex:
find <path> -type f -mtime -30 -exec ls -lrt {} \;
Above 30days more only then list the files/folders
Ex:
find <path> -type f -mtime +30 -exec ls -lrt {} \;
find /opt/app/logs -type f -mtime +30 -exec ls -lrt {} \;

And another one:
cd /path/to/pdf
printf "%s\0" *.[Pp][Dd][Ff] | xargs -0 rm
printf is a shell builtin, and as far as I know it's always been as such. Now given that printf is not a shell command (but a builtin), it's not subject to "argument list too long ..." fatal error.
So we can safely use it with shell globbing patterns such as *.[Pp][Dd][Ff], then we pipe its output to remove (rm) command, through xargs, which makes sure it fits enough file names in the command line so as not to fail the rm command, which is a shell command.
The \0 in printf serves as a null separator for the file names wich are then processed by xargs command, using it (-0) as a separator, so rm does not fail when there are white spaces or other special characters in the file names.

Argument list too long
As this question title for cp, mv and rm, but answer stand mostly for rm.
Un*x commands
Read carefully command's man page!
For cp and mv, there is a -t switch, for target:
find . -type f -name '*.pdf' -exec cp -ait "/path to target" {} +
and
find . -type f -name '*.pdf' -exec mv -t "/path to target" {} +
Script way
There is an overall workaroung used in bash script:
#!/bin/bash
folder=( "/path to folder" "/path to anther folder" )
if [ "$1" != "--run" ] ;then
exec find "${folder[#]}" -type f -name '*.pdf' -exec $0 --run {} +
exit 0;
fi
shift
for file ;do
printf "Doing something with '%s'.\n" "$file"
done

What about a shorter and more reliable one?
for i in **/*.pdf; do rm "$i"; done

I had the same problem with a folder full of temporary images that was growing day by day and this command helped me to clear the folder
find . -name "*.png" -mtime +50 -exec rm {} \;
The difference with the other commands is the mtime parameter that will take only the files older than X days (in the example 50 days)
Using that multiple times, decreasing on every execution the day range, I was able to remove all the unnecessary files

You can create a temp folder, move all the files and sub-folders you want to keep into the temp folder then delete the old folder and rename the temp folder to the old folder try this example until you are confident to do it live:
mkdir testit
cd testit
mkdir big_folder tmp_folder
touch big_folder/file1.pdf
touch big_folder/file2.pdf
mv big_folder/file1,pdf tmp_folder/
rm -r big_folder
mv tmp_folder big_folder
the rm -r big_folder will remove all files in the big_folder no matter how many. You just have to be super careful you first have all the files/folders you want to keep, in this case it was file1.pdf

To delete all *.pdf in a directory /path/to/dir_with_pdf_files/
mkdir empty_dir # Create temp empty dir
rsync -avh --delete --include '*.pdf' empty_dir/ /path/to/dir_with_pdf_files/
To delete specific files via rsync using wildcard is probably the fastest solution in case you've millions of files. And it will take care of error you're getting.
(Optional Step): DRY RUN. To check what will be deleted without deleting. `
rsync -avhn --delete --include '*.pdf' empty_dir/ /path/to/dir_with_pdf_files/
.
.
.
Click rsync tips and tricks for more rsync hacks

I found that for extremely large lists of files (>1e6), these answers were too slow. Here is a solution using parallel processing in python. I know, I know, this isn't linux... but nothing else here worked.
(This saved me hours)
# delete files
import os as os
import glob
import multiprocessing as mp
directory = r'your/directory'
os.chdir(directory)
files_names = [i for i in glob.glob('*.{}'.format('pdf'))]
# report errors from pool
def callback_error(result):
print('error', result)
# delete file using system command
def delete_files(file_name):
os.system('rm -rf ' + file_name)
pool = mp.Pool(12)
# or use pool = mp.Pool(mp.cpu_count())
if __name__ == '__main__':
for file_name in files_names:
print(file_name)
pool.apply_async(delete_files,[file_name], error_callback=callback_error)

If you want to remove both files and directories, you can use something like:
echo /path/* | xargs rm -rf

I only know a way around this.
The idea is to export that list of pdf files you have into a file. Then split that file into several parts. Then remove pdf files listed in each part.
ls | grep .pdf > list.txt
wc -l list.txt
wc -l is to count how many line the list.txt contains. When you have the idea of how long it is, you can decide to split it in half, forth or something. Using split -l command
For example, split it in 600 lines each.
split -l 600 list.txt
this will create a few file named xaa,xab,xac and so on depends on how you split it.
Now to "import" each list in those file into command rm, use this:
rm $(<xaa)
rm $(<xab)
rm $(<xac)
Sorry for my bad english.

I ran into this problem a few times. Many of the solutions will run the rm command for each individual file that needs to be deleted. This is very inefficient:
find . -name "*.pdf" -print0 | xargs -0 rm -rf
I ended up writing a python script to delete the files based on the first 4 characters in the file-name:
import os
filedir = '/tmp/' #The directory you wish to run rm on
filelist = (os.listdir(filedir)) #gets listing of all files in the specified dir
newlist = [] #Makes a blank list named newlist
for i in filelist:
if str((i)[:4]) not in newlist: #This makes sure that the elements are unique for newlist
newlist.append((i)[:4]) #This takes only the first 4 charcters of the folder/filename and appends it to newlist
for i in newlist:
if 'tmp' in i: #If statment to look for tmp in the filename/dirname
print ('Running command rm -rf '+str(filedir)+str(i)+'* : File Count: '+str(len(os.listdir(filedir)))) #Prints the command to be run and a total file count
os.system('rm -rf '+str(filedir)+str(i)+'*') #Actual shell command
print ('DONE')
This worked very well for me. I was able to clear out over 2 million temp files in a folder in about 15 minutes. I commented the tar out of the little bit of code so anyone with minimal to no python knowledge can manipulate this code.

I have faced a similar problem when there were millions of useless log files created by an application which filled up all inodes. I resorted to "locate", got all the files "located"d into a text file and then removed them one by one. Took a while but did the job!

I solved with for
I am on macOS with zsh
I moved thousands only jpg files. Within mv in one line command.
Be sure there are no spaces or special characters in the name of the files you are trying to move
for i in $(find ~/old -type f -name "*.jpg"); do mv $i ~/new; done

A bit safer version than using xargs, also not recursive:
ls -p | grep -v '/$' | grep '\.pdf$' | while read file; do rm "$file"; done
Filtering our directories here is a bit unnecessary as 'rm' won't delete it anyway, and it can be removed for simplicity, but why run something that will definitely return error?

Using GNU parallel (sudo apt install parallel) is super easy
It runs the commands multithreaded where '{}' is the argument passed
E.g.
ls /tmp/myfiles* | parallel 'rm {}'

For remove first 100 files:
rm -rf 'ls | head -100'

find string inside a gzipped file in a folder

My current problem is that I have around 10 folders, which contain gzipped files (around on an average 5 each). This makes it 50 files to open and look at.
Is there a simpler method to find out if a gzipped file inside a folder has a particular pattern or not?
zcat ABC/myzippedfile1.txt.gz | grep "pattern match"
zcat ABC/myzippedfile2.txt.gz | grep "pattern match"
Instead of writing a script, can I do the same in a single line, for all the folders and sub folders?
for f in `ls *.gz`; do echo $f; zcat $f | grep <pattern>; done;

zgrep will look in gzipped files, has a -R recursive option, and a -H show me the filename option:
zgrep -R --include=*.gz -H "pattern match" .
OS specific commands as not all arguments work across the board:
Mac 10.5+: zgrep -R --include=\*.gz -H "pattern match" .
Ubuntu 16+: zgrep -i -H "pattern match" *.gz

You don't need zcat here because there is zgrep and zegrep.
If you want to run a command over a directory hierarchy, you use find:
find . -name "*.gz" -exec zgrep ⟨pattern⟩ \{\} \;
And also “ls *.gz” is useless in for and you should just use “*.gz” in the future.

how zgrep don't support -R
I think the solution of "Nietzche-jou" could be a better answer, but I would add the option -H to show the file name something like this
find . -name "*.gz" -exec zgrep -H 'PATTERN' \{\} \;

use the find command
find . -name "*.gz" -exec zcat "{}" + |grep "test"
or try using the recursive option (-r) of zcat

Coming in a bit late on this, had a similar problem and was able to resolve using;
zcat -r /some/dir/here | grep "blah"
As detailed here;
http://manpages.ubuntu.com/manpages/quantal/man1/gzip.1.html
However, this does not show the original file that the result matched from, instead showing "(standard input)" as it's coming in from a pipe. zcat does not seem to support outputting a name either.
In terms of performance, this is what we got;
$ alias dropcache="sync && echo 3 > /proc/sys/vm/drop_caches"
$ find 09/01 | wc -l
4208
$ du -chs 09/01
24M
$ dropcache; time zcat -r 09/01 > /dev/null
real 0m3.561s
$ dropcache; time find 09/01 -iname '*.txt.gz' -exec zcat '{}' \; > /dev/null
0m38.041s
As you can see, using the find|zcat method is significantly slower than using zcat -r when dealing with even a small volume of files. I was also unable to make zcat output the file name (using -v will apparently output the filename, but not on every single line). It would appear that there isn't currently a tool that will provide both speed and name consistency with grep (i.e. the -H option).
If you need to identify the name of the file that the result belongs to, then you'll need to either write your own tool (could be done in 50 lines of Python code) or use the slower method. If you do not need to identify the name, then use zcat -r.
Hope this helps

find . -name "*.gz"|xargs zcat | grep "pattern" should do.

zgrep "string" ./*/*
You can use above command to search for string in .gz files of dir directory where dir has following sub-directories structure:
/dir
/childDir1
/file1.gz
/file2.gz
/childDir2
/file3.gz
/file4.gz
/childDir3
/file5.gz
/file6.gz

You can use this command -
zgrep "foo" $(find . -name "*.gz")

How to play .mp3 songs randomly by searching for them recursively in a directory and its sub directories?

Once I am in the directory containing .mp3 files, I can play songs randomly using
mpg123 -Z *.mp3
But if I want to recursively search a directory and its subfolders for .mp3 files and play them randomly, I tried following command, but it does not work.
mpg123 -Z <(find /media -name *.mp3)
(find /media -name *.mp3), when executed gives all .mp3 files present in /media and its sub directories.
How can I get this to work?

mpg123 -Z $(find -name "*.mp3")
The $(...) means execute the command and paste the output here.
Also, to bypass the command-line length limit laalto mentioned, try:
mpg123 -Z $(find -name "*.mp3" | sort --random-sort| head -n 100)
EDIT: Sorry, try:
find -name "*.mp3" | sort --random-sort| head -n 100|xargs -d '\n' mpg123
That should cope with the spaces correctly, presuming you don't have filenames with embedded newlines.
It will semi-randomly permute your list of MP3s, then pick the first 100 of the random list, then pass those to mpg123.

In both zsh and bash 4.0,
mpg123 -Z **/*.mp3
(Bash users will probably need to shopt -s globstar first.)

Backticks.
mpg123 -Z `find /media -name \*.mp3`
Though if you have a lot of files, you may encounter command line length limitations.

Would something like this work?
find /media -name *.mp3 -print0 | xargs -0 mpg123 -Z

The following works fine.
find /media -name "*.mp3" | xargs -d '\n' -n10 mpg123 -Z.
By '-n' option we can provide no. of arguments for a single invocation of command.
Even after I close the terminal where i wrote this command, the songs continue to play as the process mpg123 becomes an orphan and continues to run.
devikasingh#Interest:~$ ps -e | grep mpg123
7239 ? 00:00:01 mpg123
ps -f 7239
UID PID PPID C STIME TTY STAT TIME CMD
1000 7239 1 0 15:21 ? S 0:01 mpg123 -Z /media/MUSIC & PIC/audio_for_you/For You.mp3 /media/MUSIC & PIC/audio_for_you/In My Place.mp3 /

Thanks for the suggestions, By using them I was able to create the following script:
#!/bin/bash song=$(zenity --width=360 --height=320 --title "Select Folder" --file-selection --directory $HOME) find "$song" -name "*.mp3" | sort --random-sort | head -n 100 | xargs -d '\n' mpg123

Probably its better to use xargs, but I use a while loop in bash on Red Hat.
find . -iname "*.mp3" -print | sort -R --random-source=/dev/urandom | while IFS= read -r filename; do play "$filename"; done
The only problem with it is that it is annoying to kill. To kill it, you must hold down Ctrl-C until the while loop is killed.
while...do...done loops through each field in the sort output.
IFS describes the field separators.
IFS= makes each line a single field.
read copies the current field into the filename variable.
The -r option removes backslash processing, which doesn't seem to be necessary on Linux.
play is a simple way of using sox for playback.

I found this one and IMHO, much cleaner than other solutions. I don't own the credits, they goes to site owner.
find $HOME/mp3s -iname '*.mp3' | mpg123 -Z -# -
Found on https://dannyman.toldme.com/2004/12/28/howto-mpg123-random-mp3s/
I just changed from name to iname as sometimes files can have extension in caps...

I tried almost all and when mpg123 is run througth a pipe it returns this error: "Can't get terminal attributes" and I cannot use terminal control keys.
The only way I found to play a list of files found with the command find and be able to use terminal control keys is this (I have directories and files with spaces):
find /media -type f -iname "*.mp3" > /tmp/mp3list
mpg123 -CZvv -# /tmp/mp3list
It looks that mpg123 use the space as separator if you use $(find /media -type f -iname "*.mp3") and in my case doesn't work because I have spaces in all directory's names and in almost all file's names.
This is a script (playmp3.sh) to only execute find when the file doesn't exist:
#!/bin/sh
if ! [ -f /tmp/mp3list ]; then
find /media -type f -iname "*.mp3" > /tmp/mp3list
fi
mpg123 -CZvv -# /tmp/mp3list

I have my library in a separate partition an in my root dir i have this small script that also plays randomly previous song, i have like 40 gb of music, so they almost never repeat.
# !/bin/sh
cd "/media/$USER/7789f483-c7bf-46bc-9293-e8e05dd62199/musik/"
mpg123 -Z */*/*.mp3;

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string