How to find all directories which include sources?

How to find all directories which include sources? - linux

I have a project with few directories (not all of them known in advance). I want to issue a command to find all directories which include sources. Something like find . -name "*.cpp" this will give me a list of sources while I want just a list of directories which include them. The project structure is not know in advance, some sources may exist in directory X and others in a sub directory X/Y. What will be the command which will print the list of all directories which include sources?

find . -name "*.cpp" -exec dirname {} \; | sort -u
If (a) you have GNU find or a recent version of BSD find and (b) you have a recent version of dirname (such as GNU coreutils 8.21 or FreeBSD 10 but not OSX 10.10), then, for greater efficiency, use (Hat tip: Jochen and mklement0):
find . -name "*.cpp" -exec dirname {} + | sort -u

John1024's answer is elegant and fast, IF your version of dirname supports multiple arguments and you can invoke it with -exec dirname {} +.
Otherwise, with -exec dirname {} \;, a child process is forked for each and every input filename, which is quite slow.
If:
your dirname doesn't support multiple arguments
and performance matters
and you're using bash 4 or higher
consider the following solution:
shopt -s globstar; printf '%s\n' ./**/*.cpp | sed 's|/[^/]*$||' | sort -u
shopt -s globstar activates support for cross-directory pathname expansion (globbing)
./**/**.cpp then matches .cpp files anywhere in the current directory's subtree
Note that the glob intentionally starts with ./, so that the sed command below also properly reports the top-level directory itself, should it contain matching files.
sed 's|/[^/]*$||' effectively performs the same operation as dirname, but on all input lines with a single invocation of sed.
sort -u sorts the result and outputs only unique directory names.

find . -name "*.cpp" | while read f; do dirname "$f" ; done | sort -u
should do what you need

find . -name '*.cpp' | sed -e 's/\/[^/]*$//' | sort | uniq

To simply find non-empty directories:
$ find . \! -empty -type d
For directories with only specific filetypes in it, I would use something like this:
find . -name \*.cpp | while read line; do dirname "${line}" ; done | sort -u
This finds all *.cpp files and calls dirname on each filename. The result is then sorted and made unique. There are definitely faster ways to do this using shell-builtins that don't require to spawn a new process for each *.cpp file. But that probably shouldn't matter for most projects.

You should define what is a source file.
Notice that some C or C++ files are generated (e.g. by parser-generators like bison or yacc, by ad-hoc awk or python or shell scripts, by generators particular to the project, etc...), and that some included C or C++ files are not named .h or .cc (read about X-macros). Within GCC a significant amount of files are generated (e.g. from *.md machine description files, which are the authentic source files)
Most large software projects (e.g. of many millions lines of C++ or C code) have or are using some C or C++ code generators somewhere.
In the free software world, a source code is simply the preferred form of the code on which the developer is working.
Notice that source code might not even sit in a file; it could sit in a database, in some heap image, e.g. if the developer is interacting with a specific program to work. (Remember Smalltalk machines of the 1980s, or Mentor structured editor at INRIA in 1980). As another example, J.Pitrat's CAIA system has its C code entirely self generated. Look also inside Scheme48
Perhaps (as an approximate heuristic only) you should consider as a C++ source file any file named .h or .cc or .cpp or .cxx or perhaps .def or .inc or .tcc which does not contain the GENERATED FILE words (usually inside some comments).
To understand what are the generated files you should dive into the build procedure (described by Makefile, CMake*, Makefile.am with autoconf etc etc...). There is no fool-proof way of detecting or guessing generated C++ files; so you won't be able to reliably automate their detection.
At last, bootstrapped languages have often a (version control) repository which contain some generated files. Ocaml has a boot/ subdirectory, and MELT has a melt/generated/ directory (containing C++ files needed to regenerate MELT in C++ form from *.melt source code files).
I would suggest to use the project version control repository and get the non-empty directories there. Details depend upon the version control tool (e.g. git, or svn, or hg, etc...). You should use some version control (or revision control) tool. I recommend git

Related

linux include all directories

how would I type a file path in ubuntu terminal to include all files in all sub-directories?
If I had a main directory called "books" but had a ton of subdirectories with all sorts of different names containing files, how would I type a path to include all files in all subdirectories?
/books/???

From within the books top directory, you can use the command:
find . -type f
Then, if you wanted to, say run each file through cat, you could use the xargs command:
find . -type f | xargs cat
For more info, use commands:
man find
man xargs

It is unclear what you actually want ... Probably you will get a better solution to your problem, if you ask directly for it, not for one other problem you've come accross trying to circumvent the original problem.
do you mean something like the following?
file */*
where the first * expands for all subdirectories and the second * for all contained files ?
I have chosen the file command arbitrarily. You can choose whatever command you want to run on the files you get shell-expanded.
Also note that directories will also be included (if not excluded by name, e.g. *.png or *.txt).
The wildcard * is not exactly the file path to include all files in all subdirectories but it expands to all files (or directories) matching the wildcard expression as a list, e.g. file1 file2 file3 file4. See also this tutorial on shell expansion.
Note that there may be easy solutions to related problems. Like to copy all files in all subdirectories (cp -a for example, see man cp).
I also like find very much. It's quite easy to generate more flexible search patterns in combination with grep. To provide a random example:
du `find . | grep some_pattern_to_occur | grep -v some_pattern_to_not_occur`

./books/*
For example, assuming i'm in the parent directory of 'books':
ls ./books/*
EDIT:
Actually, to list all the tree recursively you should use:
ls -R ./books/*

Remove all files that does not have the following extensions in Linux

I have a list of extensions:
avi,mkv,wmv,mp4,mp5,flv,M4V,mpeg,mov,m1v,m2v,3gp,avchd
I want to remove all files without the following extensions aswell as files without extension in a directory in linux.
How can I do this using the rm linux command?

You will first have to find out files that do not contain those extension. You can do this very easily with the find command. You can build on the following command -
find /path/to/files ! -name "*.avi" -type f -exec rm -i {} \;
You can also use -regex instead of -name to feed in complex search pattern. ! is to negate the search. So it will effectively list out those files that do not contain those extensions.
It is good to do rm -i as it will list out all the files before deleting. It may become tedious if your list is comprehensive so you can decide yourself to include it or not.
Deleting tons of files using this can be dangerous. Once deleted you can never get them back. So make sure you run the find command without the rm first to inspect the list throughly before deleting them.
Update:
As stated in the comments by aculich, you can also do the following -
find /path/to/files ! -name "*.avi" -type f -delete
-type f will ensure that it will only find and delete regular files and will not touch any directories, sym links etc.

You can use a quick and dirty rm command to accomplish what you want, but keep in mind it is error-prone, non-portable, dangerous, and has severe limitations.
As others have suggested, you can use the find command. I would recommend using find rather than rm in almost all cases.
Since you mention that you are on a Linux system I will use the GNU implementation that is part of the findutils package in my examples because it is the default on most Linux systems and is what I generally recommend learning since it has a much richer and more advanced set of features than many other implementations.
Though it can be daunting and seemingly over-complicated it is worth spending time to master the find command because it gives you a kind of precise expressiveness and safety that you won't find with most other methods without essentially (poorly) re-inventing this command!
Find Example
People often suggest using the find command in inefficient, error-prone and dangerous ways, so below I outline a safe and effective way to accomplish exactly what you asked for in your example.
Before deleting files I recommend previewing the file list first (or at least part of the list if it is very long):
find path/to/files -type f -regextype posix-extended -iregex '.*\.(avi|mkv|wmv|mp4|mp5|flv|M4V|mpeg|mov|m1v|m2v|3gp|avchd)$'
The above command will show you the list of files that you will be deleting. To actually delete the files you can simply add the -delete action like so:
find path/to/files -type f -regextype posix-extended -iregex '.*\.(avi|mkv|wmv|mp4|mp5|flv|M4V|mpeg|mov|m1v|m2v|3gp|avchd)$' -delete
If you would like to see what will remain you can invert the matches in the preview by adding ! to the preview command (without the -delete) like so:
find path/to/files -type f -regextype posix-extended ! -iregex '.*\.(avi|mkv|wmv|mp4|mp5|flv|M4V|mpeg|mov|m1v|m2v|3gp|avchd)$'
The output of this inverse match should be the same as the output you will see when listing the files after performing the delete unless errors occurred due to permissions problems or unwritable filesystems:
find path/to/files -type f
Explanation
Here I will explain in some depth the options I chose and why:
I added -type f to restrict the matches to files-only; without that it will match non-files such as directories which you probably don't want. Also note that I put it at the beginning rather than the end because order of predicates can matter for speed; with -type f first it will execute the regex check against files-only instead of against everything... in practice it may not matter much unless you have lots of directories or non-files. Still, it's worth keeping order of predicates in mind since it can have a significant impact in some cases.
I use the case-insensitive -iregex option as opposed to the case-sensitive -regex option because I assumed that you wanted to use case-insensitive matching so it will include both .wmv and .WMV files.
You'll probably want to use extend POSIX regular expressions for simplicity and brevity. Unfortunately there is not yet a short-hand for -regextype posix-extended, but even still I would recommend using it because you can avoid the problem of having to add lots of \ backslashes to escape things in longer, more complex regexes and it has more advanced (modern) features. The GNU implementation defaults to emacs-style regexes which can be confusing if you're not used to them.
The -delete option should make obvious sense, however sometimes people suggest using the slower and more complex -exec rm {} \; option, but usually that is because they are not aware of the safer, faster, and easier -delete option (and in rare cases you may encounter old systems with an ancient version of find that does not have this option). It is useful to know that -exec exists, but use -delete where you can for deleting files. Also, do not pipe | the output of find to another program unless you use and understand the -print0 option, otherwise you're in for a world of hurt when you encounter files with spaces.
The path/to/files argument I included explicitly. If you leave it out it will implicitly use . as the path, however it is safer (especially with a -delete) to state the path explicitly.
Alternate find Implementations
Even though you said you're on a Linux system I will also mention the differences that you'll encounter with the BSD implementations which includes Mac OS X! For other systems (like older Solaris boxes), good luck! Upgrade to one of the more modern find variants!
The main difference in this example is regarding regular expressions. The BSD variants use basic POSIX regular expressions by default. To avoid burdensome extra escaping in regexes required for basic-PRE you can take advantage of more modern features of extended-PRE by specifying the -E option with the BSD variant to achieve the same behavior as the GNU variant that uses -regextype posix-extended.
find -E path/to/files -iregex '.*\.(avi|mkv|wmv|mp4|mp5|flv|M4V|mpeg|mov|m1v|m2v|3gp|avchd)$' -type f
Note in this case that the -E option comes before the path/to/files whereas the -regextype posix-extended option for GNU comes after the path.
It is too bad that GNU does not yet provide a -E option (yet!); since I think it would be a useful option to have parity with the BSD variants I will submit a patch to findutils to add this option and if it is accepted I will update this answer accordingly.
rm - Not Recommended
Though I strongly recommend against using rm, I will give examples of how to accomplish more or less what your question specifically asked (with some caveats).
Assuming you use a shell with Bourne syntax (which is usually what you find on Linux system which default to the Bash shell) you can use this command:
for ext in avi mkv wmv mp4 mp5 flv M4V mpeg mov m1v m2v 3gp avchd; do rm -f path/to/files/*.$ext; done
If you use Bash and have extended globbing turned on with shopt -s extglob then you can use Pattern Matching with Filename Expansion:
rm -f path/to/files/*.+(avi|mkv|wmv|mp4|mp5|flv|M4V|mpeg|mov|m1v|m2v|3gp|avchd)
The +(pattern-list) extended globbing syntax will match one or more occurrences of the given patterns.
However, I strongly recommend against using rm because:
It is error-prone and dangerous because it is easy to accidentally put a space between the *'s which means you will delete everything; you cannot preview the result of the command ahead of time; it is fire-and-forget, so good luck with the aftermath.
It is non-portable because even if it happens to work in your particular shell, the same command line may not work in other shells (including other Bourne-shell variants if you are prone to using Bash-isms).
It has severe limitations because if you have files that are nested in subdirectories or even just lots of files in a single directory, then you will quickly hit the limits on command line length when using file globbing.
I wish the rm command would just rm itself into oblivion because I can think of few places where I'd rather use rm instead of (even ancient implementations of) find.

With Bash, you could first enable extglob option:
$ shopt -s extglob
And do the following:
$ rm -i !(*.avi | *.mkv | *.wmv | *.mp4)

Unix: traverse a directory

I need to traverse a directory so starting in one directory and going deeper into difference sub directories. However I also need to be able to have access to each individual file to modify the file. Is there already a command to do this or will I have to write a script? Could someone provide some code to help me with this task? Thanks.

The find command is just the tool for that. Its -exec flag or -print0 in combination with xargs -0 allows fine-grained control over what to do with each file.
Example: Replace all foo's by bar's in all files in /tmp and subdirectories.
find /tmp -type f -exec sed -i -e 's/foo/bar/' '{}' ';'

for i in `find` ; do
if [ -d $i ] ; then do something with a directory ; fi
if [ -f $i ] ; then do something with a file etc. ; fi
done
This will return the whole tree (recursively) in the current directory in a list that the loop will go through.

This can be easily achieved by mixing find, xargs, sed (or other file modification command).
For example:
$ find /path/to/base/dir -type f -name '*.properties' | xargs sed -ie '/^#/d'
This will filter all files with file extension .properties.
The xargs command will feed the file path generated by find command into the sed command.
The sed command will delete all lines start with # in the files (feed by xargs).
Command combination in this way is very flexible.
For example, find command have different parameters so you can filter by user name, file size, file path (eg: under /test/ subfolder), file modification time.
Another dimension of flexibility is how and what to change in your file. For ex, sed command allows you to make changes on file in applying substitution (specify via regular expressions). Similarly, you can use gzip to compress the file. And so on ...

You would usually use the find command. On Linux, you have the GNU version, of course. It has many extra (and useful) options. Both will allow you to execute a command (eg a shell script) on the files as they are found.
The exact details of how to make changes to the file depend on the change you want to make to the file. That is probably best scripted, with find running the script:
POSIX or GNU:
find . -type f -exec your_script '{}' +
This will run your script once for a group of files with those names provided as arguments. If you want to do it one file at a time, replace the + with ';' (or \;).

I am assuming SearchMe is the example directory name you need to traverse completely.
I am also assuming, since it was not specified, the files you want to modify are all text file. Is this correct?
In such scenario I would suggest using the command:
find SearchMe -type f -exec vi {} \;
If you are not familiar with vi editor, just use another one (nano, emacs, kate, kwrite, gedit, etc.) and it should work as well.

Bash 4+
shopt -s globstar
for file in **
do
if [ -f "$file" ];then
# do some processing to your file here
# where the find command can't do conveniently
fi
done

How do I write a bash script to replace words in files and then rename files?

I have a folder structure, as shown below:
I need to create a bash script that does 4 things:
It searches all the files in the generic directory and finds the string 'generic' and makes it into 'something'
As above, but changes "GENERIC" to "SOMETHING"
As above, but changes "Generic" to "Something"
Renames any filename that has "generic" in it with "something"
Right now I am doing this process manually by using the search and replace in net beans. I dont know much about bash scripting, but i'm sure this can be done. I'm thinking of something that I would run and it would take "Something" as the input.
Where would I start? what functions should I use? overall guidance would be great. thanks.
I am using Ubuntu 10.5 desktop edition.

Editing
The substitution part is a sed script - call it mapname:
sed -i.bak \
-e 's/generic/something/g' \
-e 's/GENERIC/SOMETHING/g' \
-e 's/Generic/Something/g "$#"
Note that this will change words in comments and strings too, and it will change 'generic' as part of a word rather than just the whole word. If you want just the word, then you use end-word markers around the terms: 's/\<generic\>/something/g'. The -i.bak creates backups.
You apply that with:
find . -type f -exec mapname {} +
That creates a command with a list of files and executes it. Clearly, you can, if you prefer, avoid the intermediate mapname shell/sed script (by writing the sed script in place of the word mapname in the find command). Personally, I prefer to debug things separately.
Renaming
The renaming of the files is best done with the rename command - of which there are two variants, so you'll need to read your manual. Use one of these two:
find . -name '*generic*' -depth -exec rename generic something {} +
find . -name '*generic*' -depth -exec rename s/generic/something/g {} +
(Thanks to Stephen P for pointing out that I was using a more powerful Perl-based variant of rename with full Perl regexp capacity, and to Zack and Jefromi for pointing out that the Perl one is found in the real world* too.)
Notes:
This renames directories.
It is probably worth keeping the -depth in there so that the contents of the directories are renamed before the directories; you could otherwise get messages because you rename the directory and then can't locate the files in it (because find gave you the old name to work with).
The more basic rename will move ./generic/do_generic.java to ./something/do_generic.java only. You'd need to run the command more than once to get every component of every file name changed.
* The version of rename that I use is adapted from code in the 1st Edition of the Camel book.

Steps 1-3 can be done like this:
find .../path/to/generic -type f -print0 |
xargs -0 perl -pi~ -e \
's/\bgeneric\b/something/g;
s/\bGENERIC\b/SOMETHING/g;
s/\bGeneric\b/Something/g;'
I don't understand exactly what you want to happen in step 4 so I can't help with that part.

how do I check that two folders are the same in linux

I have moved a web site from one server to another and I copied the files using SCP
I now wish to check that all the files have been copied OK.
How do I compare the sites?
Count files for a folder?
Get the total files size for folder tree?
or is there a better way to compare the sites?
Paul

Using diff with the recursive -r and quick -q option. It is the best and by far the fastest way to do this.
diff -r -q /path/to/dir1 /path/to/dir2
It won't tell you what the differences are (remove the -q option to see that), but it will very quickly tell you if all the files are the same.
If it shows no output, all the files are the same, otherwise it will list the files that are different.

If you were using scp, you could probably have used rsync.
rsync won't transfer files that are already up to date, so you can use it to verify a copy is current by simply running rsync again.
If you were doing something like this on the old host:
scp -r from/my/dir newhost:/to/new/dir
Then you could do something like
rsync -a --progress from/my/dir newhost:/to/new/dir
The '-a' is short for 'archive' which does a recursive copy and preserves permissions, ownerships etc. Check the man page for more info, as it can do a lot of clever things.

cd website
find . -type f -print | sort | xargs sha1sum
will produce a list of checksums for the files. You can then diff those to see if there are any missing/added/different files.

maybe you can use something similar to this:
find <original root dir> | xargs md5sum > original
find <new root dir> | xargs md5sum > new
diff original new

To add on reply from Sidney.
It is not very necessary to filter out -type f, and produce hash code.
In reply to zidarsk8, you don't need to sort, since find, same as ls, sorts the filenames alphabetically by default. It works for empty directories as well.
To summarize, top 3 best answers would be:
(P.S. Nice to do a dry run with rsync)
diff -r -q /path/to/dir1 /path/to/dir2
diff <(cd dir1 && find) <(cd dir2 && find)
rsync --dry-run -avh from/my/dir newhost:/to/new/dir

Make checksums for all files, for example using md5sum. If they're all the same for all the files and no file is missing, everything's OK.

If you used scp, you probably can also use rsync over ssh.
rsync -avH --delete-after 1.example.com:/path/to/your/dir 2.example.com:/path/to/your/
rsync does the checksums for you.
Be sure to use the -n option to perform a dry-run. Check the manual page.
I prefer rsync over scp or even local cp, every time I can use it.
If rsync is not an option, md5sum can generate md5 digests and md5sumc --check will check them.

Try diffing your directory recursively. You'll get a nice summary if something is different in one of the directories.

I have been move a web site from one server to another I copied the files using SCP
You could do this with rsync, it is great if you just want to mirror something.
/Johan
Update : Seems like #rjack beat me with the rsync answer with 6 seconds :-)

I would add this to Douglas Leeder or Eineki, but sadly, don't have enough reputation to comment. Anyway, their answers are both great, excepting that they don't work for file names with spaces. To make that work, do
find [dir1] -type f -print0 | xargs -0 [preferred hash function] > [file1]
find [dir2] -type f -print0 | xargs -0 [preferred hash function] > [file2]
diff -y [file1] [file2]
Just from experimenting, I also like to use the -W ### arguement on diff and output it to a file, easier to parse and understand in the terminal.

...when comparing two folders across a network drive or on separate computers
If comparing two folders on the same computer, diff is fine, as explained by the main answer.
However, if trying to compare two folders on different computers, or across a network, don't do that! If across a network, it will take forever since it has to actually transmit every byte of every file in the folder across the network. So, if you are comparing a 3 GB dir, all 3 GB have to be transferred across the network just to see if the remote dir and local dir are the same.
Instead, use a SHA256 hash. Hash the dir on one computer on that computer, and on the other computer on that computer. Here is how:
(From my answer here: How to hash all files in an entire directory, including the filenames as well as their contents):
# 1. First, cd to the dir in which the dir of interest is found. This is
# important! If you don't do this, then the paths output by find will differ
# between the two computers since the absolute paths to `mydir` differ. We are
# going to hash the paths too, not just the file contents, so this matters.
cd /home/gabriel # example on computer 1
cd /home/gabriel/dev/repos # example on computer 2
# 2. hash all files inside `mydir`, then hash the list of all hashes and their
# respective file paths. This obtains one single final hash. Sorting is
# necessary by piping to `sort` to ensure we get a consistent file order in
# order to ensure a consistent final hash result. Piping to awk extracts
# just the hash.
find mydir -type f -exec sha256sum {} + | sort | sha256sum | awk '{print $1}'
Example run and doutput:
$ find eclipse-workspace -type f -exec sha256sum {} + | sort | sha256sum | awk '{print $1}'
8f493478e7bb77f1d025cba31068c1f1c8e1eab436f8a3cf79d6e60abe2cd2e4
Do this on each computer, then ensure the hashes are the same to know if the directories are the same.
Note that the above commands ignore empty directories, file permissions, timestamps of when files were last edited, etc. For most cases though that's ok.
You can also use rsync to basically do this same thing for you, even when copying or comparing across a network.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string