Remove all files that does not have the following extensions in Linux - linux

I have a list of extensions:
avi,mkv,wmv,mp4,mp5,flv,M4V,mpeg,mov,m1v,m2v,3gp,avchd
I want to remove all files without the following extensions aswell as files without extension in a directory in linux.
How can I do this using the rm linux command?

You will first have to find out files that do not contain those extension. You can do this very easily with the find command. You can build on the following command -
find /path/to/files ! -name "*.avi" -type f -exec rm -i {} \;
You can also use -regex instead of -name to feed in complex search pattern. ! is to negate the search. So it will effectively list out those files that do not contain those extensions.
It is good to do rm -i as it will list out all the files before deleting. It may become tedious if your list is comprehensive so you can decide yourself to include it or not.
Deleting tons of files using this can be dangerous. Once deleted you can never get them back. So make sure you run the find command without the rm first to inspect the list throughly before deleting them.
Update:
As stated in the comments by aculich, you can also do the following -
find /path/to/files ! -name "*.avi" -type f -delete
-type f will ensure that it will only find and delete regular files and will not touch any directories, sym links etc.

You can use a quick and dirty rm command to accomplish what you want, but keep in mind it is error-prone, non-portable, dangerous, and has severe limitations.
As others have suggested, you can use the find command. I would recommend using find rather than rm in almost all cases.
Since you mention that you are on a Linux system I will use the GNU implementation that is part of the findutils package in my examples because it is the default on most Linux systems and is what I generally recommend learning since it has a much richer and more advanced set of features than many other implementations.
Though it can be daunting and seemingly over-complicated it is worth spending time to master the find command because it gives you a kind of precise expressiveness and safety that you won't find with most other methods without essentially (poorly) re-inventing this command!
Find Example
People often suggest using the find command in inefficient, error-prone and dangerous ways, so below I outline a safe and effective way to accomplish exactly what you asked for in your example.
Before deleting files I recommend previewing the file list first (or at least part of the list if it is very long):
find path/to/files -type f -regextype posix-extended -iregex '.*\.(avi|mkv|wmv|mp4|mp5|flv|M4V|mpeg|mov|m1v|m2v|3gp|avchd)$'
The above command will show you the list of files that you will be deleting. To actually delete the files you can simply add the -delete action like so:
find path/to/files -type f -regextype posix-extended -iregex '.*\.(avi|mkv|wmv|mp4|mp5|flv|M4V|mpeg|mov|m1v|m2v|3gp|avchd)$' -delete
If you would like to see what will remain you can invert the matches in the preview by adding ! to the preview command (without the -delete) like so:
find path/to/files -type f -regextype posix-extended ! -iregex '.*\.(avi|mkv|wmv|mp4|mp5|flv|M4V|mpeg|mov|m1v|m2v|3gp|avchd)$'
The output of this inverse match should be the same as the output you will see when listing the files after performing the delete unless errors occurred due to permissions problems or unwritable filesystems:
find path/to/files -type f
Explanation
Here I will explain in some depth the options I chose and why:
I added -type f to restrict the matches to files-only; without that it will match non-files such as directories which you probably don't want. Also note that I put it at the beginning rather than the end because order of predicates can matter for speed; with -type f first it will execute the regex check against files-only instead of against everything... in practice it may not matter much unless you have lots of directories or non-files. Still, it's worth keeping order of predicates in mind since it can have a significant impact in some cases.
I use the case-insensitive -iregex option as opposed to the case-sensitive -regex option because I assumed that you wanted to use case-insensitive matching so it will include both .wmv and .WMV files.
You'll probably want to use extend POSIX regular expressions for simplicity and brevity. Unfortunately there is not yet a short-hand for -regextype posix-extended, but even still I would recommend using it because you can avoid the problem of having to add lots of \ backslashes to escape things in longer, more complex regexes and it has more advanced (modern) features. The GNU implementation defaults to emacs-style regexes which can be confusing if you're not used to them.
The -delete option should make obvious sense, however sometimes people suggest using the slower and more complex -exec rm {} \; option, but usually that is because they are not aware of the safer, faster, and easier -delete option (and in rare cases you may encounter old systems with an ancient version of find that does not have this option). It is useful to know that -exec exists, but use -delete where you can for deleting files. Also, do not pipe | the output of find to another program unless you use and understand the -print0 option, otherwise you're in for a world of hurt when you encounter files with spaces.
The path/to/files argument I included explicitly. If you leave it out it will implicitly use . as the path, however it is safer (especially with a -delete) to state the path explicitly.
Alternate find Implementations
Even though you said you're on a Linux system I will also mention the differences that you'll encounter with the BSD implementations which includes Mac OS X! For other systems (like older Solaris boxes), good luck! Upgrade to one of the more modern find variants!
The main difference in this example is regarding regular expressions. The BSD variants use basic POSIX regular expressions by default. To avoid burdensome extra escaping in regexes required for basic-PRE you can take advantage of more modern features of extended-PRE by specifying the -E option with the BSD variant to achieve the same behavior as the GNU variant that uses -regextype posix-extended.
find -E path/to/files -iregex '.*\.(avi|mkv|wmv|mp4|mp5|flv|M4V|mpeg|mov|m1v|m2v|3gp|avchd)$' -type f
Note in this case that the -E option comes before the path/to/files whereas the -regextype posix-extended option for GNU comes after the path.
It is too bad that GNU does not yet provide a -E option (yet!); since I think it would be a useful option to have parity with the BSD variants I will submit a patch to findutils to add this option and if it is accepted I will update this answer accordingly.
rm - Not Recommended
Though I strongly recommend against using rm, I will give examples of how to accomplish more or less what your question specifically asked (with some caveats).
Assuming you use a shell with Bourne syntax (which is usually what you find on Linux system which default to the Bash shell) you can use this command:
for ext in avi mkv wmv mp4 mp5 flv M4V mpeg mov m1v m2v 3gp avchd; do rm -f path/to/files/*.$ext; done
If you use Bash and have extended globbing turned on with shopt -s extglob then you can use Pattern Matching with Filename Expansion:
rm -f path/to/files/*.+(avi|mkv|wmv|mp4|mp5|flv|M4V|mpeg|mov|m1v|m2v|3gp|avchd)
The +(pattern-list) extended globbing syntax will match one or more occurrences of the given patterns.
However, I strongly recommend against using rm because:
It is error-prone and dangerous because it is easy to accidentally put a space between the *'s which means you will delete everything; you cannot preview the result of the command ahead of time; it is fire-and-forget, so good luck with the aftermath.
It is non-portable because even if it happens to work in your particular shell, the same command line may not work in other shells (including other Bourne-shell variants if you are prone to using Bash-isms).
It has severe limitations because if you have files that are nested in subdirectories or even just lots of files in a single directory, then you will quickly hit the limits on command line length when using file globbing.
I wish the rm command would just rm itself into oblivion because I can think of few places where I'd rather use rm instead of (even ancient implementations of) find.

With Bash, you could first enable extglob option:
$ shopt -s extglob
And do the following:
$ rm -i !(*.avi | *.mkv | *.wmv | *.mp4)

Related

the best way to get "find" style output from "ls -fR"

My goal is to find the fastest way to list all available files in a directory (call it the master directory). The master directory contains about 5 million files, organized using subdirectories but it's unclear how subdirectories are arranged. After some research I realized that the fastest way to do so is using ls -fR (-f disables sorting)
The default output from ls -fR is something like this:
$ ls -fR dir1
dir1:
. subdir1 ..
dir1/subdir1:
. file1 ..
My desired output is the one produced by find (find takes twice as long though):
$ find dir1/ -type f
dir1/subdir1/file1
Although I can potentially parse the ls -fR result, I was wondering if there is a simple way to make ls -fR output in "find" style. I hope there is a very easy toggle and I'm just being blind to it
find takes twice as long
Interesting. Are you really sure though?
ls -fR ignores hidden files and directories. Maybe ls just skips some of the work. Try ls -fRA too.
If you run find; ls -fR the latter will have a huge advantage due to caching. Try swapping the order or clear the cache (sync; echo 3 | sudo tee /proc/sys/vm/drop_caches) before each command.
I hope there is a very easy toggle and I'm just being blind to it
Not that I would know. Posix ls certainly has no such thing. As far as I can tell from man ls, even GNU ls 8.32 has no such option.
You could adapt the output of ls to match that of find using
ls -fRpA | awk '/:$/ {sub(/:$/,"/"); p=$0; next} length() && !/\// {print p $0}'
Even though that would break on files/directories with linebreaks and files ending with a :. Also, you will slow down the script a bit. The longer the paths, the slower it gets, I'd assume. This could also explain partially why find is slower than ls. The former just prints a lot more text because it has to repeat the name of the top level directories over and over again.
I strongly advise against using above script. It is fragile and unreadable, likely just for the sake of premature optimization: Most certainly you want to do something with the printed list. That something will probably take more time than generating the list. Also, with different implementations running on different systems find may be faster than ls – you never know.
Also, don't parse the output of ls/find, instead use find -exec to do the actual task. If you really must, find -print0 would be the safe option (can be replaced by find -exec printf %s\\0 {} + if not available on your system).
Depending on the task, locate might be a fast alternative to find. If not, try parallelizing find using something like printf %s\\0 ./* | xargs -0 -I_ -P0 find _ -type f or a tool like fd that has built-in parallelization.

Customized deleting files from a folder

I have a folder where different files can be located. I would like to check if it contains other files than .gitkeep and delete them, keeping .gitkeep at once. How can I do this ? (I'm a newbie when it comes to bash)
As always, there are multiple ways to do this, I am just sharing what little I know of linux :
1)find <path-to-the-folder> -maxdepth 1 -type f ! -iname '\.gitkeep' -delete
maxdepth of 1 specifies to search only the current directory. If you remove maxdepth, it will recursively find all files other than '.gitkeep' in all directories under your path. You can increase maxdepth to however deep you want find to go into directories from your path.
'-type f' specifies that we are just looking for files . If you want to find directories as well (or links, other types ) then you can omit this option.
-iname '.gitkeep' specifies a case insensitive math for '.gitkeep', the '\' is used for escaping the '.', since in bash, '.' is a regular expression.
You can leave it to be -name instead of -iname for case sensitive match.
The '!' before -iname, is to do an inverse match, i.e to find all files that don't have the name '.gitkeep', if you remove the '!', then you will get all files that match '.gitkeep'.
finally, '-delete' will delete the files that match this specification.
If you want to see what all files will be deleted before executing -delete, you can remove that flag and it will show you all the files :
find <path-to-the-folder> -maxdepth 1 -type f ! -iname '\.gitkeep'
(you can also use -print at the end, which is just redundant)
2) for i in `ls -a | grep -v '\.gitkeep'` ; do rm -rf $i ; done
Not really recommended to do it this way, since rm -rf is always a bad idea (IMO). You can change that to rm -f (to ensure it just works on file and not directories).
To be on the safe side, it is recommended to do an echo of the file list first to see if you are ready to delete all the files shown :
for i in `ls -a | grep -v '\.gitkeep'` ; do echo $i ; done
This will iterate thru all the files that don't match '.gitkeep' and delete them one by one ... not the best way I suppose to delete files
3)rm -rf $(ls -a | grep -v '\.gitkeep')
Again, careful with rm -rf, instead of rm -rf above, you can again do an echo to find out the files that will get deleted
I am sure there are more ways, but just a glimpse of the array of possibilities :)
Good Luck,
Ash
================================================================
EDIT :
=> manpages are your friend when you are trying to learn something new, if you don't understand how a command works or what options it can take and do, always lookup man for details.
ex : man find
=> I understand that you are trying to learn something out of your comfort zone, which is always commendable, but stack overflow doesn't like people asking questions without researching.
If you did research, you are expected to mention it in your question, letting people know what you have done to find answers on your own.
A simple google search or a deep dive into stack overflow questions would have provided you with a similar or even a better answer to your question. So be careful :)
Forewarned is forearmed :)
You can use find:
find /path/to/folder -maxdepth 1 ! -name .gitkeep -delete

How to find all directories which include sources?

I have a project with few directories (not all of them known in advance). I want to issue a command to find all directories which include sources. Something like find . -name "*.cpp" this will give me a list of sources while I want just a list of directories which include them. The project structure is not know in advance, some sources may exist in directory X and others in a sub directory X/Y. What will be the command which will print the list of all directories which include sources?
find . -name "*.cpp" -exec dirname {} \; | sort -u
If (a) you have GNU find or a recent version of BSD find and (b) you have a recent version of dirname (such as GNU coreutils 8.21 or FreeBSD 10 but not OSX 10.10), then, for greater efficiency, use (Hat tip: Jochen and mklement0):
find . -name "*.cpp" -exec dirname {} + | sort -u
John1024's answer is elegant and fast, IF your version of dirname supports multiple arguments and you can invoke it with -exec dirname {} +.
Otherwise, with -exec dirname {} \;, a child process is forked for each and every input filename, which is quite slow.
If:
your dirname doesn't support multiple arguments
and performance matters
and you're using bash 4 or higher
consider the following solution:
shopt -s globstar; printf '%s\n' ./**/*.cpp | sed 's|/[^/]*$||' | sort -u
shopt -s globstar activates support for cross-directory pathname expansion (globbing)
./**/**.cpp then matches .cpp files anywhere in the current directory's subtree
Note that the glob intentionally starts with ./, so that the sed command below also properly reports the top-level directory itself, should it contain matching files.
sed 's|/[^/]*$||' effectively performs the same operation as dirname, but on all input lines with a single invocation of sed.
sort -u sorts the result and outputs only unique directory names.
find . -name "*.cpp" | while read f; do dirname "$f" ; done | sort -u
should do what you need
find . -name '*.cpp' | sed -e 's/\/[^/]*$//' | sort | uniq
To simply find non-empty directories:
$ find . \! -empty -type d
For directories with only specific filetypes in it, I would use something like this:
find . -name \*.cpp | while read line; do dirname "${line}" ; done | sort -u
This finds all *.cpp files and calls dirname on each filename. The result is then sorted and made unique. There are definitely faster ways to do this using shell-builtins that don't require to spawn a new process for each *.cpp file. But that probably shouldn't matter for most projects.
You should define what is a source file.
Notice that some C or C++ files are generated (e.g. by parser-generators like bison or yacc, by ad-hoc awk or python or shell scripts, by generators particular to the project, etc...), and that some included C or C++ files are not named .h or .cc (read about X-macros). Within GCC a significant amount of files are generated (e.g. from *.md machine description files, which are the authentic source files)
Most large software projects (e.g. of many millions lines of C++ or C code) have or are using some C or C++ code generators somewhere.
In the free software world, a source code is simply the preferred form of the code on which the developer is working.
Notice that source code might not even sit in a file; it could sit in a database, in some heap image, e.g. if the developer is interacting with a specific program to work. (Remember Smalltalk machines of the 1980s, or Mentor structured editor at INRIA in 1980). As another example, J.Pitrat's CAIA system has its C code entirely self generated. Look also inside Scheme48
Perhaps (as an approximate heuristic only) you should consider as a C++ source file any file named .h or  .cc or .cpp or .cxx or perhaps .def or .inc or .tcc which does not contain the GENERATED FILE words (usually inside some comments).
To understand what are the generated files you should dive into the build procedure (described by Makefile, CMake*, Makefile.am with autoconf etc etc...). There is no fool-proof way of detecting or guessing generated C++ files; so you won't be able to reliably automate their detection.
At last, bootstrapped languages have often a (version control) repository which contain some generated files. Ocaml has a boot/ subdirectory, and MELT has a melt/generated/ directory (containing C++ files needed to regenerate MELT in C++ form from *.melt source code files).
I would suggest to use the project version control repository and get the non-empty directories there. Details depend upon the version control tool (e.g. git, or svn, or hg, etc...). You should use some version control (or revision control) tool. I recommend git

renaming with find

I managed to find several files with the find command.
the files are of the type file_sakfksanf.txt, file_afsjnanfs.pdf, file_afsnjnjans.cpp,
now I want to rename them with the rename and -exec command to
mywish_sakfksanf.txt, mywish_afsjnanfs.pdf, mywish_afsnjnjans.cpp
that only the first prefix is changed. I am trying for some time, so don't blame me for being stupid.
If you read through the -exec section of the man pages for find you will come across the {} string that allows you to use the matches as arguments within -exec. This will allow you to use rename on your find matches in the following way:
find . -name 'file_*' -exec rename 's/file_/mywish_/' {} \;
From the manual:
-exec command ;
Execute command; true if 0 status is returned. All following
arguments to find are taken to be arguments to the command until an
argument consisting of ;' is encountered. The string{}' is replaced
by the current file name being processed everywhere it occurs in the
arguments to the command, not just in arguments where it is alone, as
in some versions of find. Both of these constructions might need to
be escaped (with a `\') or quoted to protect them from expansion by
the shell. See the EXAMPLES section for examples of the use of the
-exec option. The specified command is run once for each matched file. The command is executed in the starting directory.There are
unavoidable security problems surrounding use of the -exec action;
you should use the -execdir option instead.
Although you asked for a find/exec solution, as Mark Reed suggested, you might want to consider piping your results to xargs. If you do, make sure to use the -print0 option with find and either the -0 or -null option with xargs to avoid unexpected behaviour resulting from whitespace or shell metacharacters appearing in your file names. Also, consider using the + version of -exec (also in the manual) as this is the POSIX spec for find and should therefore be more portable if you are wanting to run your command elsewhere (not always true); it also builds its command line in a way similar to xargs which should result in less invocations of rename.
Don't think there's a way you can do this with just find, you'll need to create a script:
#!/bin/bash
NEW=`echo $1 | sed -e 's/file_/mywish_/'`
mv $1 ${NEW}
THen you can:
find ./ -name 'file_*' -exec my_script {} \;

How do I write a bash script to replace words in files and then rename files?

I have a folder structure, as shown below:
I need to create a bash script that does 4 things:
It searches all the files in the generic directory and finds the string 'generic' and makes it into 'something'
As above, but changes "GENERIC" to "SOMETHING"
As above, but changes "Generic" to "Something"
Renames any filename that has "generic" in it with "something"
Right now I am doing this process manually by using the search and replace in net beans. I dont know much about bash scripting, but i'm sure this can be done. I'm thinking of something that I would run and it would take "Something" as the input.
Where would I start? what functions should I use? overall guidance would be great. thanks.
I am using Ubuntu 10.5 desktop edition.
Editing
The substitution part is a sed script - call it mapname:
sed -i.bak \
-e 's/generic/something/g' \
-e 's/GENERIC/SOMETHING/g' \
-e 's/Generic/Something/g "$#"
Note that this will change words in comments and strings too, and it will change 'generic' as part of a word rather than just the whole word. If you want just the word, then you use end-word markers around the terms: 's/\<generic\>/something/g'. The -i.bak creates backups.
You apply that with:
find . -type f -exec mapname {} +
That creates a command with a list of files and executes it. Clearly, you can, if you prefer, avoid the intermediate mapname shell/sed script (by writing the sed script in place of the word mapname in the find command). Personally, I prefer to debug things separately.
Renaming
The renaming of the files is best done with the rename command - of which there are two variants, so you'll need to read your manual. Use one of these two:
find . -name '*generic*' -depth -exec rename generic something {} +
find . -name '*generic*' -depth -exec rename s/generic/something/g {} +
(Thanks to Stephen P for pointing out that I was using a more powerful Perl-based variant of rename with full Perl regexp capacity, and to Zack and Jefromi for pointing out that the Perl one is found in the real world* too.)
Notes:
This renames directories.
It is probably worth keeping the -depth in there so that the contents of the directories are renamed before the directories; you could otherwise get messages because you rename the directory and then can't locate the files in it (because find gave you the old name to work with).
The more basic rename will move ./generic/do_generic.java to ./something/do_generic.java only. You'd need to run the command more than once to get every component of every file name changed.
* The version of rename that I use is adapted from code in the 1st Edition of the Camel book.
Steps 1-3 can be done like this:
find .../path/to/generic -type f -print0 |
xargs -0 perl -pi~ -e \
's/\bgeneric\b/something/g;
s/\bGENERIC\b/SOMETHING/g;
s/\bGeneric\b/Something/g;'
I don't understand exactly what you want to happen in step 4 so I can't help with that part.

Resources