the best way to get "find" style output from "ls -fR"

the best way to get "find" style output from "ls -fR" - linux

My goal is to find the fastest way to list all available files in a directory (call it the master directory). The master directory contains about 5 million files, organized using subdirectories but it's unclear how subdirectories are arranged. After some research I realized that the fastest way to do so is using ls -fR (-f disables sorting)
The default output from ls -fR is something like this:
$ ls -fR dir1
dir1:
. subdir1 ..
dir1/subdir1:
. file1 ..
My desired output is the one produced by find (find takes twice as long though):
$ find dir1/ -type f
dir1/subdir1/file1
Although I can potentially parse the ls -fR result, I was wondering if there is a simple way to make ls -fR output in "find" style. I hope there is a very easy toggle and I'm just being blind to it

find takes twice as long
Interesting. Are you really sure though?
ls -fR ignores hidden files and directories. Maybe ls just skips some of the work. Try ls -fRA too.
If you run find; ls -fR the latter will have a huge advantage due to caching. Try swapping the order or clear the cache (sync; echo 3 | sudo tee /proc/sys/vm/drop_caches) before each command.
I hope there is a very easy toggle and I'm just being blind to it
Not that I would know. Posix ls certainly has no such thing. As far as I can tell from man ls, even GNU ls 8.32 has no such option.
You could adapt the output of ls to match that of find using
ls -fRpA | awk '/:$/ {sub(/:$/,"/"); p=$0; next} length() && !/\// {print p $0}'
Even though that would break on files/directories with linebreaks and files ending with a :. Also, you will slow down the script a bit. The longer the paths, the slower it gets, I'd assume. This could also explain partially why find is slower than ls. The former just prints a lot more text because it has to repeat the name of the top level directories over and over again.
I strongly advise against using above script. It is fragile and unreadable, likely just for the sake of premature optimization: Most certainly you want to do something with the printed list. That something will probably take more time than generating the list. Also, with different implementations running on different systems find may be faster than ls – you never know.
Also, don't parse the output of ls/find, instead use find -exec to do the actual task. If you really must, find -print0 would be the safe option (can be replaced by find -exec printf %s\\0 {} + if not available on your system).
Depending on the task, locate might be a fast alternative to find. If not, try parallelizing find using something like printf %s\\0 ./* | xargs -0 -I_ -P0 find _ -type f or a tool like fd that has built-in parallelization.

Related

linux diff on folder and file structure [duplicate]

I have two directories with the same list of files. I need to compare all the files present in both the directories using the diff command. Is there a simple command line option to do it, or do I have to write a shell script to get the file listing and then iterate through them?

You can use the diff command for that:
diff -bur folder1/ folder2/
This will output a recursive diff that ignore spaces, with a unified context:
b flag means ignoring whitespace
u flag means a unified context (3 lines before and after)
r flag means recursive

If you are only interested to see the files that differ, you may use:
diff -qr dir_one dir_two | sort
Option "q" will only show the files that differ but not the content that differ, and "sort" will arrange the output alphabetically.

Diff has an option -r which is meant to do just that.
diff -r dir1 dir2

diff can not only compare two files, it can, by using the -r option, walk entire directory trees, recursively checking differences between subdirectories and files that occur at comparable points in each tree.
$ man diff
...
-r --recursive
Recursively compare any subdirectories found.
...
Another nice option is the über-diff-tool diffoscope:
$ diffoscope a b
It can also emit diffs as JSON, html, markdown, ...

If you specifically don't want to compare contents of files and only check which one are not present in both of the directories, you can compare lists of files, generated by another command.
diff <(find DIR1 -printf '%P\n' | sort) <(find DIR2 -printf '%P\n' | sort) | grep '^[<>]'
-printf '%P\n' tells find to not prefix output paths with the root directory.
I've also added sort to make sure the order of files will be the same in both calls of find.
The grep at the end removes information about identical input lines.

If it's GNU diff then you should just be able to point it at the two directories and use the -r option.
Otherwise, try using
for i in $(\ls -d ./dir1/*); do diff ${i} dir2; done
N.B. As pointed out by Dennis in the comments section, you don't actually need to do the command substitution on the ls. I've been doing this for so long that I'm pretty much doing this on autopilot and substituting the command I need to get my list of files for comparison.
Also I forgot to add that I do '\ls' to temporarily disable my alias of ls to GNU ls so that I lose the colour formatting info from the listing returned by GNU ls.

When working with git/svn or multiple git/svn instances on disk this has been one of the most useful things for me over the past 5-10 years, that somebody might find useful:
diff -burN /path/to/directory1 /path/to/directory2 | grep +++
or:
git diff /path/to/directory1 | grep +++
It gives you a snapshot of the different files that were touched without having to "less" or "more" the output. Then you just diff on the individual files.

In practice the question often arises together with some constraints. In that case following solution template may come in handy.
cd dir1
find . \( -name '*.txt' -o -iname '*.md' \) | xargs -i diff -u '{}' 'dir2/{}'

Here is a script to show differences between files in two folders. It works recursively. Change dir1 and dir2.
(search() { for i in $1/*; do [ -f "$i" ] && (diff "$1/${i##*/}" "$2/${i##*/}" || echo "files: $1/${i##*/} $2/${i##*/}"); [ -d "$i" ] && search "$1/${i##*/}" "$2/${i##*/}"; done }; search "dir1" "dir2" )

Try this:
diff -rq /path/to/folder1 /path/to/folder2

Customized deleting files from a folder

I have a folder where different files can be located. I would like to check if it contains other files than .gitkeep and delete them, keeping .gitkeep at once. How can I do this ? (I'm a newbie when it comes to bash)

As always, there are multiple ways to do this, I am just sharing what little I know of linux :
1)find <path-to-the-folder> -maxdepth 1 -type f ! -iname '\.gitkeep' -delete
maxdepth of 1 specifies to search only the current directory. If you remove maxdepth, it will recursively find all files other than '.gitkeep' in all directories under your path. You can increase maxdepth to however deep you want find to go into directories from your path.
'-type f' specifies that we are just looking for files . If you want to find directories as well (or links, other types ) then you can omit this option.
-iname '.gitkeep' specifies a case insensitive math for '.gitkeep', the '\' is used for escaping the '.', since in bash, '.' is a regular expression.
You can leave it to be -name instead of -iname for case sensitive match.
The '!' before -iname, is to do an inverse match, i.e to find all files that don't have the name '.gitkeep', if you remove the '!', then you will get all files that match '.gitkeep'.
finally, '-delete' will delete the files that match this specification.
If you want to see what all files will be deleted before executing -delete, you can remove that flag and it will show you all the files :
find <path-to-the-folder> -maxdepth 1 -type f ! -iname '\.gitkeep'
(you can also use -print at the end, which is just redundant)
2) for i in `ls -a | grep -v '\.gitkeep'` ; do rm -rf $i ; done
Not really recommended to do it this way, since rm -rf is always a bad idea (IMO). You can change that to rm -f (to ensure it just works on file and not directories).
To be on the safe side, it is recommended to do an echo of the file list first to see if you are ready to delete all the files shown :
for i in `ls -a | grep -v '\.gitkeep'` ; do echo $i ; done
This will iterate thru all the files that don't match '.gitkeep' and delete them one by one ... not the best way I suppose to delete files
3)rm -rf $(ls -a | grep -v '\.gitkeep')
Again, careful with rm -rf, instead of rm -rf above, you can again do an echo to find out the files that will get deleted
I am sure there are more ways, but just a glimpse of the array of possibilities :)
Good Luck,
Ash
================================================================
EDIT :
=> manpages are your friend when you are trying to learn something new, if you don't understand how a command works or what options it can take and do, always lookup man for details.
ex : man find
=> I understand that you are trying to learn something out of your comfort zone, which is always commendable, but stack overflow doesn't like people asking questions without researching.
If you did research, you are expected to mention it in your question, letting people know what you have done to find answers on your own.
A simple google search or a deep dive into stack overflow questions would have provided you with a similar or even a better answer to your question. So be careful :)
Forewarned is forearmed :)

You can use find:
find /path/to/folder -maxdepth 1 ! -name .gitkeep -delete

Remove all files that does not have the following extensions in Linux

I have a list of extensions:
avi,mkv,wmv,mp4,mp5,flv,M4V,mpeg,mov,m1v,m2v,3gp,avchd
I want to remove all files without the following extensions aswell as files without extension in a directory in linux.
How can I do this using the rm linux command?

You will first have to find out files that do not contain those extension. You can do this very easily with the find command. You can build on the following command -
find /path/to/files ! -name "*.avi" -type f -exec rm -i {} \;
You can also use -regex instead of -name to feed in complex search pattern. ! is to negate the search. So it will effectively list out those files that do not contain those extensions.
It is good to do rm -i as it will list out all the files before deleting. It may become tedious if your list is comprehensive so you can decide yourself to include it or not.
Deleting tons of files using this can be dangerous. Once deleted you can never get them back. So make sure you run the find command without the rm first to inspect the list throughly before deleting them.
Update:
As stated in the comments by aculich, you can also do the following -
find /path/to/files ! -name "*.avi" -type f -delete
-type f will ensure that it will only find and delete regular files and will not touch any directories, sym links etc.

You can use a quick and dirty rm command to accomplish what you want, but keep in mind it is error-prone, non-portable, dangerous, and has severe limitations.
As others have suggested, you can use the find command. I would recommend using find rather than rm in almost all cases.
Since you mention that you are on a Linux system I will use the GNU implementation that is part of the findutils package in my examples because it is the default on most Linux systems and is what I generally recommend learning since it has a much richer and more advanced set of features than many other implementations.
Though it can be daunting and seemingly over-complicated it is worth spending time to master the find command because it gives you a kind of precise expressiveness and safety that you won't find with most other methods without essentially (poorly) re-inventing this command!
Find Example
People often suggest using the find command in inefficient, error-prone and dangerous ways, so below I outline a safe and effective way to accomplish exactly what you asked for in your example.
Before deleting files I recommend previewing the file list first (or at least part of the list if it is very long):
find path/to/files -type f -regextype posix-extended -iregex '.*\.(avi|mkv|wmv|mp4|mp5|flv|M4V|mpeg|mov|m1v|m2v|3gp|avchd)$'
The above command will show you the list of files that you will be deleting. To actually delete the files you can simply add the -delete action like so:
find path/to/files -type f -regextype posix-extended -iregex '.*\.(avi|mkv|wmv|mp4|mp5|flv|M4V|mpeg|mov|m1v|m2v|3gp|avchd)$' -delete
If you would like to see what will remain you can invert the matches in the preview by adding ! to the preview command (without the -delete) like so:
find path/to/files -type f -regextype posix-extended ! -iregex '.*\.(avi|mkv|wmv|mp4|mp5|flv|M4V|mpeg|mov|m1v|m2v|3gp|avchd)$'
The output of this inverse match should be the same as the output you will see when listing the files after performing the delete unless errors occurred due to permissions problems or unwritable filesystems:
find path/to/files -type f
Explanation
Here I will explain in some depth the options I chose and why:
I added -type f to restrict the matches to files-only; without that it will match non-files such as directories which you probably don't want. Also note that I put it at the beginning rather than the end because order of predicates can matter for speed; with -type f first it will execute the regex check against files-only instead of against everything... in practice it may not matter much unless you have lots of directories or non-files. Still, it's worth keeping order of predicates in mind since it can have a significant impact in some cases.
I use the case-insensitive -iregex option as opposed to the case-sensitive -regex option because I assumed that you wanted to use case-insensitive matching so it will include both .wmv and .WMV files.
You'll probably want to use extend POSIX regular expressions for simplicity and brevity. Unfortunately there is not yet a short-hand for -regextype posix-extended, but even still I would recommend using it because you can avoid the problem of having to add lots of \ backslashes to escape things in longer, more complex regexes and it has more advanced (modern) features. The GNU implementation defaults to emacs-style regexes which can be confusing if you're not used to them.
The -delete option should make obvious sense, however sometimes people suggest using the slower and more complex -exec rm {} \; option, but usually that is because they are not aware of the safer, faster, and easier -delete option (and in rare cases you may encounter old systems with an ancient version of find that does not have this option). It is useful to know that -exec exists, but use -delete where you can for deleting files. Also, do not pipe | the output of find to another program unless you use and understand the -print0 option, otherwise you're in for a world of hurt when you encounter files with spaces.
The path/to/files argument I included explicitly. If you leave it out it will implicitly use . as the path, however it is safer (especially with a -delete) to state the path explicitly.
Alternate find Implementations
Even though you said you're on a Linux system I will also mention the differences that you'll encounter with the BSD implementations which includes Mac OS X! For other systems (like older Solaris boxes), good luck! Upgrade to one of the more modern find variants!
The main difference in this example is regarding regular expressions. The BSD variants use basic POSIX regular expressions by default. To avoid burdensome extra escaping in regexes required for basic-PRE you can take advantage of more modern features of extended-PRE by specifying the -E option with the BSD variant to achieve the same behavior as the GNU variant that uses -regextype posix-extended.
find -E path/to/files -iregex '.*\.(avi|mkv|wmv|mp4|mp5|flv|M4V|mpeg|mov|m1v|m2v|3gp|avchd)$' -type f
Note in this case that the -E option comes before the path/to/files whereas the -regextype posix-extended option for GNU comes after the path.
It is too bad that GNU does not yet provide a -E option (yet!); since I think it would be a useful option to have parity with the BSD variants I will submit a patch to findutils to add this option and if it is accepted I will update this answer accordingly.
rm - Not Recommended
Though I strongly recommend against using rm, I will give examples of how to accomplish more or less what your question specifically asked (with some caveats).
Assuming you use a shell with Bourne syntax (which is usually what you find on Linux system which default to the Bash shell) you can use this command:
for ext in avi mkv wmv mp4 mp5 flv M4V mpeg mov m1v m2v 3gp avchd; do rm -f path/to/files/*.$ext; done
If you use Bash and have extended globbing turned on with shopt -s extglob then you can use Pattern Matching with Filename Expansion:
rm -f path/to/files/*.+(avi|mkv|wmv|mp4|mp5|flv|M4V|mpeg|mov|m1v|m2v|3gp|avchd)
The +(pattern-list) extended globbing syntax will match one or more occurrences of the given patterns.
However, I strongly recommend against using rm because:
It is error-prone and dangerous because it is easy to accidentally put a space between the *'s which means you will delete everything; you cannot preview the result of the command ahead of time; it is fire-and-forget, so good luck with the aftermath.
It is non-portable because even if it happens to work in your particular shell, the same command line may not work in other shells (including other Bourne-shell variants if you are prone to using Bash-isms).
It has severe limitations because if you have files that are nested in subdirectories or even just lots of files in a single directory, then you will quickly hit the limits on command line length when using file globbing.
I wish the rm command would just rm itself into oblivion because I can think of few places where I'd rather use rm instead of (even ancient implementations of) find.

With Bash, you could first enable extglob option:
$ shopt -s extglob
And do the following:
$ rm -i !(*.avi | *.mkv | *.wmv | *.mp4)

Exclude .svn directories from grep [duplicate]

This question already has answers here:
How can I exclude directories from grep -R?
(14 answers)
Closed 6 years ago.
When I grep my Subversion working copy directory, the results include a lot of files from the .svn directories. Is it possible to recursively grep a directory, but exclude all results from .svn directories?

If you have GNU Grep, it should work like this:
grep --exclude-dir=".svn"
If happen to be on a Unix System without GNU Grep, try the following:
grep -R "whatever you like" *|grep -v "\.svn/*"

For grep >=2.5.1a
You can put this into your environment (e.g. .bashrc)
export GREP_OPTIONS='--exclude-dir=".svn"'
PS: thanks to Adrinan, there are extra quotes in my version:
export GREP_OPTIONS='--exclude-dir=.svn'
PPS: This env option is marked for deprecation: https://www.gnu.org/software/grep/manual/html_node/Environment-Variables.html "As this causes problems when writing portable scripts, this feature will be removed in a future release of grep, and grep warns if it is used. Please use an alias or script instead."

If you use ack (a 'better grep') it will handle this automatically (and do a lot of other clever things too!). It's well worth checking out.

psychoschlumpf is correct, but it only works if you have the latest version of grep. Earlier versions do not have the --exclude-dir option. However, if you have a very large codebase, double-grep-ing can take forever. Drop this in your .bashrc for a portable .svn-less grep:
alias sgrep='find . -path "*/.svn" -prune -o -print0 | xargs -0 grep'
Now you can do this:
sgrep some_var
... and get expected results.
Of course, if you're an insane person like me who just has to use the same .bashrc everywhere, you could spend 4 hours writing an overcomplicated bash function to put there instead. Or, you could just wait for an insane person like me to post it online:
http://gist.github.com/573928

grep --exclude-dir=".svn"
works because the name ".svn" is rather unique. But this might fail on a more generalized name.
grep --exclude-dir="work"
is not bulletproof, if you have "/home/user/work" and "/home/user/stuff/work" it will skip both. It is not possible to define "/*/work/*"
to restrict the exclusion to only the former folder name.
As far as I could experiment, in GNU grep the simple --exclude won't exclude directories.

On my GNU grep 2.5, --exclude-dirs is not a valid option. As an alternative, this worked well for me:
grep --exclude="*.svn-base"
This should be a better solution than excluding all lines which contain .svn/ since it wouldn't accidentally filter out such lines in a real file.

Two greps will do the trick:
The first grep will get everything.
The second grep will use output of first grep as input (via piping). By using the -v flag, grep will select the lines which DON'T match the search terms. Voila. You are left with all the ouputs from the first grep which do not contain .svn in the filepath.
-v, --invert-match
Invert the sense of matching, to select non-matching lines.
grep the_text_you_want_to_search_for * | grep -v .svn

I tried double grep'in on my huge code base and it took forever so I got this solution with the help of my co-worker
Pruning is much faster as it stops find from processing those directories compared to 'grep -v' which processes everything and only excludes displaying results
find . -name .svn -prune -o -type f -print0 | xargs -0 egrep 'YOUR STRING'
You can also alias this command in your .bashrc as
alias sgrep='find . -name .svn build -prune -o -type f -print0 | xargs -0 egrep '
Now simply use
sgrep 'whatever'

Another option, albeit one that may not be perceived as an acceptable answer is to clone the repo into git and use git grep.
Rarely, I run into svn repositories that are so massive, it's just impractical to clone via git-svn. In these rare cases, I use a double grep solution, svngrep, but as many answers here indicate, this could be slow on large repositories, and exclude '.svn' occurrences that aren't directories. I would argue that these would be extremely seldom though...
Also regarding slow performance of multiple greps, once you've used something like git, pretty much everything seems slow in svn!
One last thing.., my variation of svngrep passes through colorization, beware, the implementation is ugly! Roughly grep -rn "$what" $where | egrep -v "$ignore" | grep --color "$what"

For grep version 2.5.1 you can add multiple --exclude items to filter out the .svn files.
$ grep -V | grep grep
grep (GNU grep) 2.5.1
GREP_OPTIONS="--exclude=*.svn-base --exclude=entries --exclude=all-wcprops" grep -l -R whatever ./

I think the --exclude option of recursion is what you are searching for.

how do I check that two folders are the same in linux

I have moved a web site from one server to another and I copied the files using SCP
I now wish to check that all the files have been copied OK.
How do I compare the sites?
Count files for a folder?
Get the total files size for folder tree?
or is there a better way to compare the sites?
Paul

Using diff with the recursive -r and quick -q option. It is the best and by far the fastest way to do this.
diff -r -q /path/to/dir1 /path/to/dir2
It won't tell you what the differences are (remove the -q option to see that), but it will very quickly tell you if all the files are the same.
If it shows no output, all the files are the same, otherwise it will list the files that are different.

If you were using scp, you could probably have used rsync.
rsync won't transfer files that are already up to date, so you can use it to verify a copy is current by simply running rsync again.
If you were doing something like this on the old host:
scp -r from/my/dir newhost:/to/new/dir
Then you could do something like
rsync -a --progress from/my/dir newhost:/to/new/dir
The '-a' is short for 'archive' which does a recursive copy and preserves permissions, ownerships etc. Check the man page for more info, as it can do a lot of clever things.

cd website
find . -type f -print | sort | xargs sha1sum
will produce a list of checksums for the files. You can then diff those to see if there are any missing/added/different files.

maybe you can use something similar to this:
find <original root dir> | xargs md5sum > original
find <new root dir> | xargs md5sum > new
diff original new

To add on reply from Sidney.
It is not very necessary to filter out -type f, and produce hash code.
In reply to zidarsk8, you don't need to sort, since find, same as ls, sorts the filenames alphabetically by default. It works for empty directories as well.
To summarize, top 3 best answers would be:
(P.S. Nice to do a dry run with rsync)
diff -r -q /path/to/dir1 /path/to/dir2
diff <(cd dir1 && find) <(cd dir2 && find)
rsync --dry-run -avh from/my/dir newhost:/to/new/dir

Make checksums for all files, for example using md5sum. If they're all the same for all the files and no file is missing, everything's OK.

If you used scp, you probably can also use rsync over ssh.
rsync -avH --delete-after 1.example.com:/path/to/your/dir 2.example.com:/path/to/your/
rsync does the checksums for you.
Be sure to use the -n option to perform a dry-run. Check the manual page.
I prefer rsync over scp or even local cp, every time I can use it.
If rsync is not an option, md5sum can generate md5 digests and md5sumc --check will check them.

Try diffing your directory recursively. You'll get a nice summary if something is different in one of the directories.

I have been move a web site from one server to another I copied the files using SCP
You could do this with rsync, it is great if you just want to mirror something.
/Johan
Update : Seems like #rjack beat me with the rsync answer with 6 seconds :-)

I would add this to Douglas Leeder or Eineki, but sadly, don't have enough reputation to comment. Anyway, their answers are both great, excepting that they don't work for file names with spaces. To make that work, do
find [dir1] -type f -print0 | xargs -0 [preferred hash function] > [file1]
find [dir2] -type f -print0 | xargs -0 [preferred hash function] > [file2]
diff -y [file1] [file2]
Just from experimenting, I also like to use the -W ### arguement on diff and output it to a file, easier to parse and understand in the terminal.

...when comparing two folders across a network drive or on separate computers
If comparing two folders on the same computer, diff is fine, as explained by the main answer.
However, if trying to compare two folders on different computers, or across a network, don't do that! If across a network, it will take forever since it has to actually transmit every byte of every file in the folder across the network. So, if you are comparing a 3 GB dir, all 3 GB have to be transferred across the network just to see if the remote dir and local dir are the same.
Instead, use a SHA256 hash. Hash the dir on one computer on that computer, and on the other computer on that computer. Here is how:
(From my answer here: How to hash all files in an entire directory, including the filenames as well as their contents):
# 1. First, cd to the dir in which the dir of interest is found. This is
# important! If you don't do this, then the paths output by find will differ
# between the two computers since the absolute paths to `mydir` differ. We are
# going to hash the paths too, not just the file contents, so this matters.
cd /home/gabriel # example on computer 1
cd /home/gabriel/dev/repos # example on computer 2
# 2. hash all files inside `mydir`, then hash the list of all hashes and their
# respective file paths. This obtains one single final hash. Sorting is
# necessary by piping to `sort` to ensure we get a consistent file order in
# order to ensure a consistent final hash result. Piping to awk extracts
# just the hash.
find mydir -type f -exec sha256sum {} + | sort | sha256sum | awk '{print $1}'
Example run and doutput:
$ find eclipse-workspace -type f -exec sha256sum {} + | sort | sha256sum | awk '{print $1}'
8f493478e7bb77f1d025cba31068c1f1c8e1eab436f8a3cf79d6e60abe2cd2e4
Do this on each computer, then ensure the hashes are the same to know if the directories are the same.
Note that the above commands ignore empty directories, file permissions, timestamps of when files were last edited, etc. For most cases though that's ok.
You can also use rsync to basically do this same thing for you, even when copying or comparing across a network.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string