Iterating over filenames from a pipeline in bash

Iterating over filenames from a pipeline in bash - linux

Consider me frustrated... I've spent the past 2 hours trying to figure out how to have a command that has pipes in it pump that output to a for loop. Quick story on what I'm attempting followed by my code.
I have been using xbmc for years. However, shortly after I started, I had exported my library, which turns out to be more of a hassle than it's worth (especially with me now going through with a set naming scheme of folders and files contained in them). I am wanting to remove all of the files that xbmc added, so I figured I'd write a script that would remove all the necessary files. However, that's where I ran into a problem.
I am trying to use the locate command (because of its speed), followed by a grep (to get rid of all the filesystem .tbn) and an egrep (to remove the .actors folder xbmc creates from the results), followed by a sort (although, the sort isn't necessary, I added it during debugging so the output while testing was nicer). The problem is only the first file is processed and then nothing. I read a lot online and figured out that bash creates a new subshell for every pipe and by the time it has finished the loop once, the variable is now dead. So I did more digging on how to get around that, and everything seemed to show how I can work around it for while loops, but nothing for for loops.
While I like to think I'm competent at scripting, I always have things like this come up that proves that I'm still just learning the basics. Any help from people smarter than me would be greatly appreciated.
#!/bin/bash
for i in "$(locate tbn | grep Movies | egrep -v .actors | sort -t/ +4)"
do
DIR=$(echo $i | awk -F'/' '{print "/" $2 "/" $3 "/" $4 "/" $5 "/"}')
rm -r "$DIR*.tbn" "$DIR*.nfo" "$DIR*.jpg" "$DIR*.txt" "$DIR.actors"
done
After reading through the response below, I'm thinking the better route to accomplish what I want is as follows. I'd love any advice to the new script. Rather than just copying and pasting #Charles Duffy's script, I want to find the right/best way to do this as a learning experience since there is always a better and best way to code something.
#!/bin/bash
for i in "*.tbn" "*.nfo" "*.jpg" "*.txt" "*.rar" #(any other desired extensions)
do
find /share/movies -name "$i" -not -path "/share/movies/.actors" -delete
done
I have the -not -path portion in there first to remove the .actors folder that xbmc puts at the root of the source directory (in this case, /share/movies) from the output so no thumbnails (.tbn files) get removed from there, but I want them removed from any other directories contained within /share/movies (and I would like to remove the thumbnails from within the .actors folder if it is contained inside a specific movie folder). The -delete option is because it was suggested in a gnu.org page that -delete is better than calling /bin/rm due to not needing to fork for the rm process, which keeps things more efficient and prevents overhead.
I'm pretty sure I want the items in the for line to be quoted so it is a literal *.tbn that is used within the find command. To give you an idea of the directory structure, it's pretty simple. I want to remove any of the *.tbn *.jpg and *.nfo files within those directories.
/share/movies/movie 1/movie 1.mkv
/share/movies/movie 1/movie 1.tbn
/share/movies/movie 1/movie 1.jpg
/share/movies/movie 1/movie 1.nfo
/share/movies/movie 2/movie 2.mp4
/share/movies/movie 2/movie 2.srt
/share/movies/movie 2/movie 2 (subs).rar
/share/movies/movie 3/movie 3.avi
/share/movies/movie 3/movie 3.tbn
/share/movies/movie 3/movie 3.jpg
/share/movies/movie 3/movie 3.nfo
/share/movies/movie 3/.actors/actor 1.tbn
/share/movies/movie 3/.actors/actor 2.tbn
/share/movies/movie 3/.actors/actor 3.tbn

This is just a quoting problem. "$(locate tbn | ...)" is a single word because the quotes prevent word splitting. If you leave out the quotes, it becomes multiple words, but then spaces in the filepaths will become problems.
Personally, I'd use find with an -exec clause; it might be slower that locate (locate uses a periodically update database so it trades off accuracy for speed), but it will avoid this sort of quoting problem.

Reading filenames from locate in a script is bad news in general unless your locate command has an option to NUL-delimit names (since every character other than NUL or / is valid in a filename, newlines are actually valid within filenames, making locate's output ambiguous). That said:
#!/bin/bash
# ^^ -- not /bin/sh, since we're using bash-only features here!
while read -u 3 -r i; do
dir=${i%/*}
rm -r "$dir/"*".tbn" "$dir/"*".nfo" "$dir/"*".jpg" "$dir/"*".txt" "$dir/.actors"
done 3< <(locate tbn | grep Movies | egrep -v .actors)
Notice how the *s cannot be inside of the double-quotes if you want them to be expanded, even though the directory names must be inside of double quotes to work if they have whitespace &c. in their names.
In general, I agree with #rici -- using find is by far the more robust approach, especially used with the GNU extension -execdir to prevent race conditions from being used to cause your command to behave in undesirable ways. (Think of a malicious user replacing a directory with a symlink to somewhere else while your script is running).

Your second script, edited into the question, is an improvement. However, there's still room to do better:
#!/bin/bash
exts=( tbn nfo jpg txt rar )
find_args=( )
for ext in "${exts[#]}"; do
find_args+=( -name "*.$ext" -o )
done
find /share/movies -name .actors -prune -o \
'(' "${find_args[#]:0:${#find_args[#]} - 1}" ')' -delete
This will build a command like:
find /share/movies -name .actors -prune -o \
'(' -name '*.tbn' -o -name '*.nfo' -o -name '*.jpg' \
-o -name '*.txt' -o -name '*.rar' ')' -delete
...and thus process all the extension in a single pass.

Related

Customized deleting files from a folder

I have a folder where different files can be located. I would like to check if it contains other files than .gitkeep and delete them, keeping .gitkeep at once. How can I do this ? (I'm a newbie when it comes to bash)

As always, there are multiple ways to do this, I am just sharing what little I know of linux :
1)find <path-to-the-folder> -maxdepth 1 -type f ! -iname '\.gitkeep' -delete
maxdepth of 1 specifies to search only the current directory. If you remove maxdepth, it will recursively find all files other than '.gitkeep' in all directories under your path. You can increase maxdepth to however deep you want find to go into directories from your path.
'-type f' specifies that we are just looking for files . If you want to find directories as well (or links, other types ) then you can omit this option.
-iname '.gitkeep' specifies a case insensitive math for '.gitkeep', the '\' is used for escaping the '.', since in bash, '.' is a regular expression.
You can leave it to be -name instead of -iname for case sensitive match.
The '!' before -iname, is to do an inverse match, i.e to find all files that don't have the name '.gitkeep', if you remove the '!', then you will get all files that match '.gitkeep'.
finally, '-delete' will delete the files that match this specification.
If you want to see what all files will be deleted before executing -delete, you can remove that flag and it will show you all the files :
find <path-to-the-folder> -maxdepth 1 -type f ! -iname '\.gitkeep'
(you can also use -print at the end, which is just redundant)
2) for i in `ls -a | grep -v '\.gitkeep'` ; do rm -rf $i ; done
Not really recommended to do it this way, since rm -rf is always a bad idea (IMO). You can change that to rm -f (to ensure it just works on file and not directories).
To be on the safe side, it is recommended to do an echo of the file list first to see if you are ready to delete all the files shown :
for i in `ls -a | grep -v '\.gitkeep'` ; do echo $i ; done
This will iterate thru all the files that don't match '.gitkeep' and delete them one by one ... not the best way I suppose to delete files
3)rm -rf $(ls -a | grep -v '\.gitkeep')
Again, careful with rm -rf, instead of rm -rf above, you can again do an echo to find out the files that will get deleted
I am sure there are more ways, but just a glimpse of the array of possibilities :)
Good Luck,
Ash
================================================================
EDIT :
=> manpages are your friend when you are trying to learn something new, if you don't understand how a command works or what options it can take and do, always lookup man for details.
ex : man find
=> I understand that you are trying to learn something out of your comfort zone, which is always commendable, but stack overflow doesn't like people asking questions without researching.
If you did research, you are expected to mention it in your question, letting people know what you have done to find answers on your own.
A simple google search or a deep dive into stack overflow questions would have provided you with a similar or even a better answer to your question. So be careful :)
Forewarned is forearmed :)

You can use find:
find /path/to/folder -maxdepth 1 ! -name .gitkeep -delete

Multithreaded Bash in while loop

I have the following Bash one liner which should iterate through all the files in the folder named *.xml , check if they have the below string, and if not, rename them to $.empty
find -name '*.xml' | xargs -I{} grep -LZ "state=\"open\"" {} | while IFS= read -rd '' x; do mv "$x" "$x".empty ; done
this process is very slow, and when running this script in folders with over 100k files, it takes well over 15 minutes to complete.
I couldn't find a way to make this process to run multithreadly.
Note that in for loop im hitting the "too many arguments" errors, due to the large number of files.
Can anyone think of a solution ?
Thanks !
Roy

The biggest bottleneck in your code is that you are running a separate mv process (which is just a wrapper around a system call) to rename each file. Let's say you have 100,000 files, and 20,000 of them need to be renamed. Your original code will need 120,000 processes, one grep per file and one mv per rename. (Ignoring the 2 calls to find and xargs.)
A better approach would be to use a language than can access the system call directly. Here is a simple Perl example:
find -name '*.xml' | xargs -I{} grep -LZ "state=\"open\"" {} |
perl -n0e 'rename("$_", "$_.empty")'
This replaces 20,000 calls to mv with a single call to perl.
The other bottleneck is running a single grep process for each file. Instead, you'd like to pass as many files as possible to grep each time. There is no need for xargs here; use the -exec primary to find instead.
find -name '*.xml' -exec grep -LZ "state=\"open\"" {} + |
perl -n0e 'rename("$_", "$_.empty")'
The too many arguments error you were receiving is based on total argument length. Suppose the limit is 4096, and your XML files have an average name length of 20 characters. This means you should be able to pass 200+ files to each call to grep. The -exec ... + primary takes care of passing as many files as possible to each call to grep, so this code at most will require 100,000 / 200 = 500 calls to grep, a vast improvement.
Depending on the size of the files, it might be faster to read each file in the Perl process to check for the string to match. However, grep is very well optimized, and the code to do so, while not terribly complicated, is still more than you can comfortably write in a one-liner. This should be a good balance between speed and simplicity.

Linux: how to replace all instances of a string with another in all files of a single type

I want to replace for example all instances of "123" with "321" contained within all .txt files in a folder (recursively).
I thought of doing this
sed -i 's/123/321/g' | find . -name \*.txt
but before possibly screwing all my files I would like to ask if it will work.

You have the sed and the find back to front. With GNU sed and the -i option, you could use:
find . -name '*.txt' -type f -exec sed -i s/123/321/g {} +
The find finds files with extension .txt and runs the sed -i command on groups of them (that's the + at the end; it's standard in POSIX 2008, but not all versions of find necessarily support it). In this example substitution, there's no danger of misinterpretation of the s/123/321/g command so I've not enclosed it in quotes. However, for simplicity and general safety, it is probably better to enclose the sed script in single quotes whenever possible.
You could also use xargs (and again using GNU extensions -print0 to find and -0 and -r to xargs):
find . -name '*.txt' -type f -print0 | xargs -0 -r sed -i 's/123/321/g'
The -r means 'do not run if there are no arguments' (so the find doesn't find anything). The -print0 and -0 work in tandem, generating file names ending with the C null byte '\0' instead of a newline, and avoiding misinterpretation of file names containing newlines, blanks and so on.
Note that before running the script on the real data, you can and should test it. Make a dummy directory (I usually call it junk), copy some sample files into the junk directory, change directory into the junk directory, and test your script on those files. Since they're copies, there's no harm done if something goes wrong. And you can simply remove everything in the directory afterwards: rm -fr junk should never cause you anguish.

Find in Linux combined with a search to return a particular line

I'm trying to return a particular line from files found from this search:
find . -name "database.php"
Each of these files contains a database name, next to a php variable like $dname=
I've been trying to use -exec to execute a grep search on this file with no success
-exec "grep {\}\ dbname"
Can anyone provide me with some understanding of how to accomplish this task?
I'm running CentOS 5, and there are about 100 database.php files stored in subdirectories on my server.
Thanks
Jason

You have the arguments to grep inverted, and you need them as separate arguments:
find . -name "database.php" -exec grep '$dbname' /dev/null {} +
The presence of /dev/null ensures that the file name(s) that match are listed as well as the lines that match.

I think this will do it. Not sure if you need to make any adjustments for CentOS.
find . -name "database.php" -exec grep dbname {} \;

I worked it out using xargs
find . -name "database.php" -print | xargs grep \'database\'\=\> > list_of_databases
Feel free to post a better way if you find one (or what some rep for a good answer)

I tend to habitually avoid find because I've never learned how to use it properly, so the way I'd accomplish your task would be:
grep dbname **/database.php
Edit: This command won't be viable in all cases because it can potentially generate a very long argument list, whereas find executes its command on found files one by one like xargs. And, as I noted in my comment, it's possibly not very portable. But it's damn short ;)

How do I write a bash script to replace words in files and then rename files?

I have a folder structure, as shown below:
I need to create a bash script that does 4 things:
It searches all the files in the generic directory and finds the string 'generic' and makes it into 'something'
As above, but changes "GENERIC" to "SOMETHING"
As above, but changes "Generic" to "Something"
Renames any filename that has "generic" in it with "something"
Right now I am doing this process manually by using the search and replace in net beans. I dont know much about bash scripting, but i'm sure this can be done. I'm thinking of something that I would run and it would take "Something" as the input.
Where would I start? what functions should I use? overall guidance would be great. thanks.
I am using Ubuntu 10.5 desktop edition.

Editing
The substitution part is a sed script - call it mapname:
sed -i.bak \
-e 's/generic/something/g' \
-e 's/GENERIC/SOMETHING/g' \
-e 's/Generic/Something/g "$#"
Note that this will change words in comments and strings too, and it will change 'generic' as part of a word rather than just the whole word. If you want just the word, then you use end-word markers around the terms: 's/\<generic\>/something/g'. The -i.bak creates backups.
You apply that with:
find . -type f -exec mapname {} +
That creates a command with a list of files and executes it. Clearly, you can, if you prefer, avoid the intermediate mapname shell/sed script (by writing the sed script in place of the word mapname in the find command). Personally, I prefer to debug things separately.
Renaming
The renaming of the files is best done with the rename command - of which there are two variants, so you'll need to read your manual. Use one of these two:
find . -name '*generic*' -depth -exec rename generic something {} +
find . -name '*generic*' -depth -exec rename s/generic/something/g {} +
(Thanks to Stephen P for pointing out that I was using a more powerful Perl-based variant of rename with full Perl regexp capacity, and to Zack and Jefromi for pointing out that the Perl one is found in the real world* too.)
Notes:
This renames directories.
It is probably worth keeping the -depth in there so that the contents of the directories are renamed before the directories; you could otherwise get messages because you rename the directory and then can't locate the files in it (because find gave you the old name to work with).
The more basic rename will move ./generic/do_generic.java to ./something/do_generic.java only. You'd need to run the command more than once to get every component of every file name changed.
* The version of rename that I use is adapted from code in the 1st Edition of the Camel book.

Steps 1-3 can be done like this:
find .../path/to/generic -type f -print0 |
xargs -0 perl -pi~ -e \
's/\bgeneric\b/something/g;
s/\bGENERIC\b/SOMETHING/g;
s/\bGeneric\b/Something/g;'
I don't understand exactly what you want to happen in step 4 so I can't help with that part.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string