How to recursively remove different files in two directories

How to recursively remove different files in two directories - linux

I have 2 different recursive directories, in one directory have 200 .txt files in another have 210 .txt files, need a script to find the different file names and remove them from the directory.

There are probably better ways, but I think about:
find directory1 directory2 -name \*.txt -printf '%f\n' |
sort | uniq -u |
xargs -I{} find directory1 directory2 -name {} -delete
find directory1 directory2 -name \*.txt -printf '%f\n':
print basename of each file matching the glob *.txt
sort | uniq -u:
only print unique lines (if you wanted to delete duplicate, it would have been uniq -d)
xargs -I{} find directory1 directory2 -name {} -delete:
remove them (re-specify the path to narrow the search and avoid deleting files outside the initial search path)
Notes
Thank's to #KlausPrinoth for all the suggestions.
Obviously I'm assuming a GNU userland, I suppose people running with the tools providing bare minimum POSIX compatibility will be able to adapt it.

Yet another way is to use diff which is more than capable in finding file differences in files in directories. For instance if you have d1 and d2 that contain your 200 and 210 files respectively (with the first 200 files being the same), you could use diff and process substitution to provide the names to remove to a while loop:
( while read -r line; do printf "rm %s\n" ${line##*: }; done < <(diff -q d1 d2) )
Output (of d1 with 10 files, d2 with 12 files)
rm file11.txt
rm file12.txt
diff will not fit all circumstances, but is does a great job finding directory differences and is quite flexible.

Related

list base files in a folder with numerous date stampped versions of a file

I've got a folder with numerous versions of files (thousands of them), each with a unique date/time stamp as the file extension. For example:
./one.20190422
./one.20190421
./one.20190420
./folder/two.txt.20190420
./folder/two.txt.20190421
./folder/folder/three.mkv.20190301
./folder/folder/three.mkv.20190201
./folder/folder/three.mkv.20190101
./folder/four.doc.20190401
./folder/four.doc.20190329
./folder/four.doc.20190301
I need to get a unique list of the base files. For example, for the above example, this would be the expected output:
./one
./folder/two.txt
./folder/folder/three.mkv
./folder/four.doc
I've come up with the below code, but am wondering if there is a better, more efficient way.
# find all directories
find ./ -type d | while read folder ; do
# go into that directory
# then find all the files in that directory, excluding sub-directories
# remove the extension (date/time stamp)
# sort and remove duplicates
# then loop through each base file
cd "$folder" && find . -maxdepth 1 -type f -exec bash -c 'printf "%s\n" "${#%.*}"' _ {} + | sort -u | while read file ; do
# and find all the versions of that file
ls "$file".* | customFunctionToProcessFiles
done
done
If it matters, the end goal is find all the versions of a specific file, in groups of the base file, and process them for something. So my plan was to get the base files, then loop through the list and find all the version files. So, using the above example again, I'd process all the one.* files first, then the two.* files, etc...
Is there a better, faster, and/or more efficient way to accomplish this?
Some notes:
There are potentially thousands of files. I know I could just search for all files from the root folder, remove the date/time extension, sort and get unique, but since there may be thousands of files I thought it might be more efficient to loop through the directories.
The date/time stamp extension of the file is not in my control and it may not always be just numbers. The only thing I can guarantee is it is on the end after a period. And, whatever format the date/time is in, all the files will share it -- there won't be some files with one format and other files with another format.

You can use find ./ -type f -regex to look for files directly
find ./ -type f -regex '.*\.[0-9]+'
./some_dir/asd.mvk.20190422
./two.txt.20190420
Also, pipe the result to your function through xargs whithout needing while loops
re='(.*)(\.[0-9]{8,8})'
find ./ -type f -regextype posix-egrep -regex "$re" | \
sed -re "s/$re/\1/" | \
xargs -r0 customFunctionToProcessFiles

How to find/list the directories where a particular sub-directory is not present

I am writing a shell script where it is checking if the bin directory is present under all the users directory under /home directory. The bin directory can be present directly under user directory or under the child directory of the user directory.
I mean let say I have a user as amit under /home. So the bin directory can be present directly as /amit/bin or can be present as /amit/jash/bin
Now my requirement is that I should have a list of users directories where the bin directory is not present either directly under user directory or under the child directory of the user directory. I tried the command as :
find /home -type d ! -exec test -e '{}/bin' \; -print
but it is not working. However when I am replacing the bin directory with some file, the command is working fine. Looks like this command is particularly for files. Is there any similar command for directories?? Any help on this will be greatly appreciated.

You're on the right track. The catch is that your test of "does the following directory NOT exist in this target" can't be expressed within find's conditions in such a way as to return only the top-level directory. So you need to nest, one way or another.
One strategy would be to use a for loop in bash:
$ mkdir foo bar baz one two
$ mkdir bar/bin baz/bin
$ for d in /home/*/; do find "$d" -type d -name bin | grep -q . || echo "$d"; done
foo/
one/
two/
This uses pathname expansion (globbing) to generate the list of directories to test, and then checks for the existence of "bin". If that check fails (i.e. find outputs nothing), the directory is printed. Note the trailing slash on /home/*/, which ensures that you will only be searching within directories, rather than files that might accidentally exist in /home/.
Another possibility might be to use nested finds, if you don't want to depend on bash:
$ find /home/ -type d -depth 1 -not -exec sh -c "find {}/ -type d -name bin -print | grep -q . " \; -print
/home/foo
/home/one
/home/two
This roughly duplicates the effect of the bash for loop above, but by nesting find within find -exec. It uses grep -q . to convert the output of find into an exit status that can be used as a condition for the outer find.
Note that since you're looking for a bin directory, we want to use test -d rather than test -e (which would also check for a bin file, which probably does not matter to you.)
Another option is to use bash process redirection. On multiple lines for easier reading:
cd /home/
comm -3 \
<(printf '%s\n' */ | sed 's|/.*||' | sort) \
<(find */ -type d -name bin | cut -d/ -f1 | uniq)
This unfortunately requires you to change to the /home directory before running, because of the way it strips off subdirectories. You can of course collapse this into a big long one-liner if you feel so inclined.
This comm solution also has the risk of failing on directories with special characters in their names, like newlines.
One last option is bash-only but more than a one-liner. It involves subtracting the directories containing "bin" from the full list. It uses an associative array and globstar, so it depends on bash version 4.
#!/usr/bin/env bash
shopt -s globstar
# Go to our root
cd /home
# Declare an associative array
declare -A dirs=()
# Populate the array with our "full" list of home directories
for d in */; do dirs[${d%/}]=""; done
# Remove directories that contain a "bin" somewhere inside 'em
for d in **/bin; do unset dirs[${d%%/*}]; done
# Print the result in reproducible form
declare -p dirs
# Or print the result just as a list of words.
printf '%s\n' "${!dirs[#]}"
Note that we're storing directories in the array index, which (1) makes it easy for us to find and delete items, and (2) insures unique entries, even if one user has multiple "bin" directories under their home.

cd /home
find . -maxdepth 1 -type d ! -name . | sort > a
find . -type d -name bin | cut -d/ -f1,2 | sort > b
comm -23 a b
Here, I'm making two sorted lists. The first contains all the home directories, and the second contains the top parent of any bin subdirectory. Finally I output any items from the first list not present in the second.

Counting Amount of Files in Directory Including Hidden Files with BASH

I want to count the amount of files in the directory I am currently in (including hidden files). So far I have this:
ls -1a | wc -l
but I believe this returns 2 more than what I want because it also counts "." (current directory) and ".." (directory above this one) as files. How would I go about returning the correct amount of files?

I believe to count all files / directories / hidden file you can also use BASH array like this:
shopt -s nullglob dotglob
cd /whatever/path
arr=( * )
count="${#arr[#]}"
This also works with filenames that contain space or newlines.

Edit:
ls piped to wc is not the right tool for that job. This is because filenames in UNIX can contain newlines as well. This would lead to counting them multiple times.
Following #gniourf_gniourf's comment (thanks!) the following command will handle newlines in file names correctly and should be used:
find -mindepth 1 -maxdepth 1 -printf x | wc -c
The find command lists files in the current directory - including hidden files, excluding the . and .. because of -mindepth 1. It works non-recursively because of -maxdepth 1.
The -printf x action simply prints an x for each file in the directory which leads to an output like this:
xxxxxxxx
Piped to wc -c (-c means counting characters) you get your final result.
Former Answer:
Use the following command:
ls -1A | wc -l
-a will include all files or directories starting with a dot, but -A will exclude the current folder . and the parent folder ..
I suggest to follow man ls

You almost got it right:
ls -1A | wc -l
If you filenames contain new-lines or other funny characters do:
find -type f -ls | wc -l

Create a bash script to delete folders which do not contain a certain filetype

I have recently run into a problem.
I used a utility to move all my music files into directories based on tags. This left a LOT of almost empty folders. The folders, in general, contain a thumbs.db file or some sort of image for album art. The mp3s have the correct album art in their new directories, so the old ones are okay to delete.
Basically, I need to find any directories within D:/Music/ that:
-Do not have any subdirectories
-Do not contain any mp3 files
And then delete them.
I figured this would be easier to do in a shell script or bash script or whatever else linux/unix world than in Windows 8.1 (HAHA).
Any suggestions? I'm not very experienced writing scripts like this.

This should get you started
find /music -mindepth 1 -type d |
while read dt
do
find "$dt" -mindepth 1 -type d | read && continue
find "$dt" -iname '*.mp3' -type f | read && continue
echo DELETE $dt
done

Here's the short story...
find . -name '*.mp3' -o -type d -printf '%h\n' | sort | uniq > non-empty-dirs.tmp
find . -type d -print | sort | uniq > all-dirs.tmp
comm -23 all-dirs.tmp non-empty-dirs.tmp > dirs-to-be-deleted.tmp
less dirs-to-be-deleted.tmp
cat dirs-to-be-deleted.tmp | xargs rm -rf
Note that you might have to run all the commands a few times (depending on your repository's directory depth) before you're done deleting all recursive empty directories...
And the long story goes...
You can approach this problem from two basic perspective: either you find all directories, then iterate over each of them, check if it contain any mp3 file or any subdirectory, if not, mark that directory for deletion. It will works, but on large very large repositories, you might expect a significant run time.
Another approach, which is in my sense much more interesting, is to build a list of directories NOT to be deleted, and subtract that list from the list of all directories. Let's work the second strategy, one step at a time...
First of all, to find the path of all directories that contains mp3 files, you can simply do:
find . -name '*.mp3' -printf '%h\n' | sort | uniq
This means "find any file ending with .mp3, then print the path to it's parent directory".
Now, I could certainly name at least ten different approaches to find directories that contains at least one subdirectory, but keeping the same strategy as above, we can easily get...
find . -type d -printf '%h\n' | sort | uniq
What this means is: "Find any directory, then print the path to it's parent."
Both of these queries can be combined in a single invocation, producing a single list containing the paths of all directories NOT to be deleted.. Let's redirect that list to a temporary file.
find . -name '*.mp3' -o -type d -printf '%h\n' | sort | uniq > non-empty-dirs.tmp
Let's similarly produce a file containing the paths of all directories, no matter if they are empty or not.
find . -type d -print | sort | uniq > all-dirs.tmp
So there, we have, on one side, the complete list of all directories, and on the other, the list of directories not to be deleted. What now? There are tons of strategies, but here's a very simple one:
comm -23 all-dirs.tmp non-empty-dirs.tmp > dirs-to-be-deleted.tmp
Once you have that, well, review it, and if you are satisfied, then pipe it through xargs to rm to actually delete the directories.
cat dirs-to-be-deleted.tmp | xargs rm -rf

BASH: Checking if files are duplicates within a directory?

I am writing a house-keeping script and have files within a directory that I want to clean up.
I want to move files from a source directory to another, there are many sub-directories so there could be files that are the same. What I want to do, is either use CMP command or MD5sum each file, if they are no duplicates then move them, if they are the same only move 1.
So the I have the move part working correctly as follows:
find /path/to/source -name "IMAGE_*.JPG" -exec mv '{}' /path/to/destination \;
I am assuming that I will have to loop through my directory, so I am thinking.
for files in /path/to/source
do
if -name "IMAGE_*.JPG"
then
md5sum (or cmp) $files
...stuck here (I am worried about how this method will be able to compare all the files against eachother and how I would filter them out)...
then just do the mv to finish.
Thanks in advance.

find . -type f -exec md5sum {} \; | sort | uniq -d
That'll spit out all the md5 hashes that have duplicates. then it's just a matter of figuring out which file(s) produced those duplicate hashes.

There's a tool designed for this purpose, it's fdupes :
fdupes -r dir/

dupmerge is another such tool...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string