Multithreaded Bash in while loop - linux

I have the following Bash one liner which should iterate through all the files in the folder named *.xml , check if they have the below string, and if not, rename them to $.empty
find -name '*.xml' | xargs -I{} grep -LZ "state=\"open\"" {} | while IFS= read -rd '' x; do mv "$x" "$x".empty ; done
this process is very slow, and when running this script in folders with over 100k files, it takes well over 15 minutes to complete.
I couldn't find a way to make this process to run multithreadly.
Note that in for loop im hitting the "too many arguments" errors, due to the large number of files.
Can anyone think of a solution ?
Thanks !
Roy

The biggest bottleneck in your code is that you are running a separate mv process (which is just a wrapper around a system call) to rename each file. Let's say you have 100,000 files, and 20,000 of them need to be renamed. Your original code will need 120,000 processes, one grep per file and one mv per rename. (Ignoring the 2 calls to find and xargs.)
A better approach would be to use a language than can access the system call directly. Here is a simple Perl example:
find -name '*.xml' | xargs -I{} grep -LZ "state=\"open\"" {} |
perl -n0e 'rename("$_", "$_.empty")'
This replaces 20,000 calls to mv with a single call to perl.
The other bottleneck is running a single grep process for each file. Instead, you'd like to pass as many files as possible to grep each time. There is no need for xargs here; use the -exec primary to find instead.
find -name '*.xml' -exec grep -LZ "state=\"open\"" {} + |
perl -n0e 'rename("$_", "$_.empty")'
The too many arguments error you were receiving is based on total argument length. Suppose the limit is 4096, and your XML files have an average name length of 20 characters. This means you should be able to pass 200+ files to each call to grep. The -exec ... + primary takes care of passing as many files as possible to each call to grep, so this code at most will require 100,000 / 200 = 500 calls to grep, a vast improvement.
Depending on the size of the files, it might be faster to read each file in the Perl process to check for the string to match. However, grep is very well optimized, and the code to do so, while not terribly complicated, is still more than you can comfortably write in a one-liner. This should be a good balance between speed and simplicity.

Related

Find command - How to cut short after finding first file

I am using the find command to check if a certain pattern of
file exists within a directory tree. (Note, anywhere down the tree)
Once the first file is found the checking can stop because the answer is "yes".
How can I stop "find" from continuing the unnecessary search for other files?
Limiting -maxdepth does not work for the obvious reason that I am checking
any where down the tree.
I tried -exec exit ; and -exec quit ;
Hoping there was a linux command to call via -exec that would stop processing.
Should I write a script (to call via -exec above) that kills the find process
but continues running my script?
Additional detail: I am calling find from a perl script.
I don't necessarily have to use 'find' if there are other tools.
I may have to resolve to walking the dir-path via a longer perl script that I can control
when to stop.
I also looked into -prune option but it seems to be valid only up front (globally)
and can't change it in the middle of processing.
This was one instance of my find command that worked and returned all occurrences of the file pattern.
find /media/pk/mm2020A1/00.staging /media/pk/mm2020A1/x.folders -name hevc -o -name 'HEVC' -o -name '265'
It sounds like you want something along the lines of
find . -name '*.csv' | wc -l
and then ask whether that is -gt 0,
with the detail that we'd like to exit
early if possible, to conserve compute resources.
Well, here's a start:
find . -name '*.csv' | head -1
It doesn't exactly bail after finding first match,
since there's a race condition,
but it keeps you from spending two minutes
recursing down a deep directory tree.
In particular, after receiving 1st result head
will close() stdin, so find won't be able
to write to stdout, and it soon will exit.
I don't know your business use case.
But you may find it convenient and performant
to record find . -ls | sort > files.txt
every now and again,
and have your script consult that file.
It typically takes less time to access those stored results
than to re-run find, that is, to once again
recurse through the directory trees.
Why? It's a random I/O versus sequential access story.
You can exit earlier if you adopt
this
technique:
use Path::Class;
dir('.')->recurse( ...

how to efficiently find if a linux directory including sudirectories has at least 1 file

In my project various jobs are created as files in directories inside subdirectories.
But usually the case is I find that the jobs are mostly in some dirs and not in the most others
currently I use
find $DIR -type f | head -n 1
to know if the directory has atleast 1 file , but this is a waste
how to efficiently find if a linux directory including sudirectories has at least 1 file
Your code is already efficient, but perhaps the reason is not obvious. When you pipe the output of find to head -n 1 you probably assume that find lists all the files and then head discards everything after the first one. But that's not quite what head does.
When find lists the first file, head will print it, but when find lists the second file, head will terminate itself, which sends SIGPIPE to find because the pipe between them is closed. Then find will stop running, because the default signal handler for SIGPIPE terminates the program which receives it.
So the cost of your pipelined commands is only the cost of finding two files, not the cost of finding all files. For most obvious use cases this should be good enough.
Try this
find -type f -printf '%h\n' | uniq
The find part finds all files, but prints only the directory. The uniq part eliminates duplicates.
Pitfall: It doesn't work (like your example) for files containing a NEWLINE in the directory path.
This command finds the first subdiretory containing at least one file and then stop:
find . -mindepth 1 -type d -exec bash -c 'c=$(find {} -maxdepth 1 -type f -print -quit);test "x$c" != x' \; -print -quit
The first find iterates through all subdirectories and second find finds the first file and then stop.

Find and sort files by date modified

I know that there are many answers to this question online. However, I would like to know if this alternate solution would work:
ls -lt `find . -name "*.jpg" -print | head -10`
I'm aware of course that this will only give me the first 10 results. The reason I'm asking is because I'm not sure whether the ls is executing separately for each result of find or not. Thanks
In your solution:
the ls will be executed after the find is evaluated
it is likely that find will yield too many results for ls to process, in which case you might want to look at the xargs command
This should work better:
find . -type f -print0 | xargs -0 stat -f"%m %Sm %N" | sort -rn
The three parts of the command to this:
find all files and print their path
use xargs to process the (long) list of files and print out the modification unixtime, human readable time, and filename for each file
sort the resulting list in reverse numerical order
The main trick is to add the numerical unixtime when the files were last modified to the beginning of the lines, and then sort them.

Iterating over filenames from a pipeline in bash

Consider me frustrated... I've spent the past 2 hours trying to figure out how to have a command that has pipes in it pump that output to a for loop. Quick story on what I'm attempting followed by my code.
I have been using xbmc for years. However, shortly after I started, I had exported my library, which turns out to be more of a hassle than it's worth (especially with me now going through with a set naming scheme of folders and files contained in them). I am wanting to remove all of the files that xbmc added, so I figured I'd write a script that would remove all the necessary files. However, that's where I ran into a problem.
I am trying to use the locate command (because of its speed), followed by a grep (to get rid of all the filesystem .tbn) and an egrep (to remove the .actors folder xbmc creates from the results), followed by a sort (although, the sort isn't necessary, I added it during debugging so the output while testing was nicer). The problem is only the first file is processed and then nothing. I read a lot online and figured out that bash creates a new subshell for every pipe and by the time it has finished the loop once, the variable is now dead. So I did more digging on how to get around that, and everything seemed to show how I can work around it for while loops, but nothing for for loops.
While I like to think I'm competent at scripting, I always have things like this come up that proves that I'm still just learning the basics. Any help from people smarter than me would be greatly appreciated.
#!/bin/bash
for i in "$(locate tbn | grep Movies | egrep -v .actors | sort -t/ +4)"
do
DIR=$(echo $i | awk -F'/' '{print "/" $2 "/" $3 "/" $4 "/" $5 "/"}')
rm -r "$DIR*.tbn" "$DIR*.nfo" "$DIR*.jpg" "$DIR*.txt" "$DIR.actors"
done
After reading through the response below, I'm thinking the better route to accomplish what I want is as follows. I'd love any advice to the new script. Rather than just copying and pasting #Charles Duffy's script, I want to find the right/best way to do this as a learning experience since there is always a better and best way to code something.
#!/bin/bash
for i in "*.tbn" "*.nfo" "*.jpg" "*.txt" "*.rar" #(any other desired extensions)
do
find /share/movies -name "$i" -not -path "/share/movies/.actors" -delete
done
I have the -not -path portion in there first to remove the .actors folder that xbmc puts at the root of the source directory (in this case, /share/movies) from the output so no thumbnails (.tbn files) get removed from there, but I want them removed from any other directories contained within /share/movies (and I would like to remove the thumbnails from within the .actors folder if it is contained inside a specific movie folder). The -delete option is because it was suggested in a gnu.org page that -delete is better than calling /bin/rm due to not needing to fork for the rm process, which keeps things more efficient and prevents overhead.
I'm pretty sure I want the items in the for line to be quoted so it is a literal *.tbn that is used within the find command. To give you an idea of the directory structure, it's pretty simple. I want to remove any of the *.tbn *.jpg and *.nfo files within those directories.
/share/movies/movie 1/movie 1.mkv
/share/movies/movie 1/movie 1.tbn
/share/movies/movie 1/movie 1.jpg
/share/movies/movie 1/movie 1.nfo
/share/movies/movie 2/movie 2.mp4
/share/movies/movie 2/movie 2.srt
/share/movies/movie 2/movie 2 (subs).rar
/share/movies/movie 3/movie 3.avi
/share/movies/movie 3/movie 3.tbn
/share/movies/movie 3/movie 3.jpg
/share/movies/movie 3/movie 3.nfo
/share/movies/movie 3/.actors/actor 1.tbn
/share/movies/movie 3/.actors/actor 2.tbn
/share/movies/movie 3/.actors/actor 3.tbn
This is just a quoting problem. "$(locate tbn | ...)" is a single word because the quotes prevent word splitting. If you leave out the quotes, it becomes multiple words, but then spaces in the filepaths will become problems.
Personally, I'd use find with an -exec clause; it might be slower that locate (locate uses a periodically update database so it trades off accuracy for speed), but it will avoid this sort of quoting problem.
Reading filenames from locate in a script is bad news in general unless your locate command has an option to NUL-delimit names (since every character other than NUL or / is valid in a filename, newlines are actually valid within filenames, making locate's output ambiguous). That said:
#!/bin/bash
# ^^ -- not /bin/sh, since we're using bash-only features here!
while read -u 3 -r i; do
dir=${i%/*}
rm -r "$dir/"*".tbn" "$dir/"*".nfo" "$dir/"*".jpg" "$dir/"*".txt" "$dir/.actors"
done 3< <(locate tbn | grep Movies | egrep -v .actors)
Notice how the *s cannot be inside of the double-quotes if you want them to be expanded, even though the directory names must be inside of double quotes to work if they have whitespace &c. in their names.
In general, I agree with #rici -- using find is by far the more robust approach, especially used with the GNU extension -execdir to prevent race conditions from being used to cause your command to behave in undesirable ways. (Think of a malicious user replacing a directory with a symlink to somewhere else while your script is running).
Your second script, edited into the question, is an improvement. However, there's still room to do better:
#!/bin/bash
exts=( tbn nfo jpg txt rar )
find_args=( )
for ext in "${exts[#]}"; do
find_args+=( -name "*.$ext" -o )
done
find /share/movies -name .actors -prune -o \
'(' "${find_args[#]:0:${#find_args[#]} - 1}" ')' -delete
This will build a command like:
find /share/movies -name .actors -prune -o \
'(' -name '*.tbn' -o -name '*.nfo' -o -name '*.jpg' \
-o -name '*.txt' -o -name '*.rar' ')' -delete
...and thus process all the extension in a single pass.

Find top 500 oldest files

How can I find top 500 oldest files?
What I've tried:
find /storage -name "*.mp4" -o -name "*.flv" -type f | sort | head -n500
Find 500 oldest files using GNU find and GNU sort:
#!/bin/bash
typeset -a files
export LC_{TIME,NUMERIC}=C
n=0
while ((n++ < 500)) && IFS=' ' read -rd '' _ x; do
files+=("$x")
done < <(find /storage -type f \( -name '*.mp4' -o -name '*.flv' \) -printf '%T# %p\0' | sort -zn)
printf '%q\n' "${files[#]}"
Update - some explanation:
As mentioned by Jonathan in the comments, the proper way to handle this involves a lot of non-standard features which allows producing and consuming null-delimited lists so that arbitrary filenames can be handled safely.
GNU find's -printf produces the mtime (using the undocumented %T# format. My guess would be that whether or not this works depends upon your C library) followed by a space, followed by the filename with a terminating \0. Two additional non-standard features process the output: GNU sort's -z option, and the read builtin's -d option, which with an empty option argument delimits input on nulls. The overall effect is to have sort order the elements by the mtime produced by find's -printf string, then read the first 500 results into an array, using IFS to split read's input on space and discard the first element into the _ variable, leaving only the filename.
Finally, we print out the array using the %q format just to display the results unambiguously with a guarantee of one file per line.
The process substitution (<(...) syntax) isn't completely necessary but avoids the subshell induced by the pipe in versions that lack the lastpipe option. That can be an advantage should you decide to make the script more complicated than merely printing out the results.
None of these features are unique to GNU. All of this can be done using e.g. AST find(1), openbsd sort(1), and either Bash, mksh, zsh, or ksh93 (v or greater). Unfortunately the find format strings are incompatible.
The following finds the oldest 500 files with the oldest file at the top of the list:
find . -regex '.*.\(mp4\|flv\)' -type f -print0 | xargs -0 ls -drt --quoting-style=shell-always 2>/dev/null | head -n500
The above is a pipeline. The first step is to find the file names which is done by find. Any of find's options can be used to select the files of interest to you. The second step does the sorting. This is accomplished with xargs passing the file names to ls with sorts on time in reverse order so that the oldest files are at the top. The last step is head -n500 which takes just the first 500 file names. The first of those names will be the oldest file.
If there are more than 500 files, then head terminates before ls. If this happens, ls will issue a message: terminated by signal 13. I redirected stderr from the xargs command to eliminate this harmless message.
The above solution assumes that all the filenames can fit on one command line in your shell.

Resources