How do I classify files in Linux server by their names? - linux

How can use the ls command and options to list the repetitious filenames that are in different directories?

You can't use a single, basic ls command to do this. You'd have to use a combination of other POSIX/Unix/GNU utilities. For example, to find the duplicate filenames first:
find . -type f -exec basename "\{}" \; | sort | uniq -d > dupes
This means find all the files (-type f) through the entire directory hierarchy in the current directory (.), and execute (-exec) the command basename (which strips the directory portion) on the found file (\{}), end of command (\;). These files then sort and print out duplicate lines (uniq -d). The result goes in the file dupes. Now you have the filenames that are duplicated, but you don't know what directory they are in. Use find again to find them. Using bash as your shell:
while read filename; do find . -name "$filename" -print; done < dupes
This means loop through (while) all contents of file dupes and read into the variable filename each line. For each line, execute find again and search for the specific -name of the $filename and print it out (-print, but it's implicit so this is redundant).
Truth be told you can combine these without using an intermediate file:
find . -type f -exec basename "\{}" \; | sort | uniq -d | while read filename; do find . -name "$filename" -print; done
If you're not familiar with it, the | operator means, execute the following command using the output of the previous command as the input to the following command. Example:
eje#EEWANCO-PC:~$ mkdir test
eje#EEWANCO-PC:~$ cd test
eje#EEWANCO-PC:~/test$ mkdir 1 2 3 4 5
eje#EEWANCO-PC:~/test$ mkdir 1/2 2/3
eje#EEWANCO-PC:~/test$ touch 1/0000 2/1111 3/2222 4/2222 5/0000 1/2/1111 2/3/4444
eje#EEWANCO-PC:~/test$ find . -type f -exec basename "\{}" \; | sort | uniq -d | while read filename; do find . -name "$filename" -print; done
./1/0000
./5/0000
./1/2/1111
./2/1111
./3/2222
./4/2222
Disclaimer: The requirement stated that the filenames were all numbers. While I have tried to design the code to handle filenames with spaces (and in tests on my system, it works), the code may break when it encounters special characters, newlines, nuls, or other unusual situations. Please note that the -exec parameter has special security considerations and should not be used by root over arbitrary user files. The simplified example provided is intended for illustrative and didactic purposes only. Please consult your man pages and relevant CERT advisories for full security implications.

I have a function in my bash profile (bash 4.4) for duplicate files.
It is true that find is the correct tool.
I use find combined with -print0 options which separates the find results with null char instead of new lines (default find action). Now i can catch all files under current directory and subdirectories.
This will ensure that results will be correct no matter if filenames contain special chars like spaces or new lines (in some very rare cases). Instead of double running find against find, you can built an array and just locate the duplicate files in this array. Then you grep the whole array using the "duplicates" as pattern.
So something like this works ok for my function:
$ IFS= readarray -t -d '' fn< <(find . -name 'file*' -print0)
$ dupes=$(LC_ALL=C sort <(printf '\<%s\>$\n' "${fn[#]##*/}") |uniq -d)
$ grep -e "$dupes" <(printf '%s\n' "${fn[#]}") |awk -F/ '{print $NF,"==>",$0}' |LC_ALL=C sort
This is a test:
$ IFS= readarray -t -d '' fn< <(find . -name 'file*' -print0)
# find all files and load them in an array using null delimiter
$ printf '%s\n' "${fn[#]}" #print the array
./tmp/file7
./tmp/file14
./tmp/file11
./tmp/file8
./tmp/file9
./tmp/tmp2/file09 99
./tmp/tmp2/file14.txt
./tmp/tmp2/file15.txt
./tmp/tmp2/file$100
./tmp/tmp2/file14.txt.bak
./tmp/tmp2/file15.txt.bak
./tmp/file1
./tmp/file4
./file09 99
./file14
./file$100
./file1
$ dupes=$(LC_ALL=C sort <(printf '\<%s\>$\n' "${fn[#]##*/}") |uniq -d)
#Locate duplicate files
$ echo "$dupes"
\<file$100\>$ #Mind this one with special char $ in filename
\<file09 99\>$ #Mind also this one with spaces
\<file14\>$
\<file1\>$
#I have on purpose enclose the results between \<...\> to force grep later to capture full words and avoid file1 to match file1.txt or file11
$ grep -e "$dupes" <(printf '%s\n' "${fn[#]}") |awk -F/ '{print $NF,"==>",$0}' |LC_ALL=C sort
file$100 ==> ./file$100 #File with special char correctly captured
file$100 ==> ./tmp/tmp2/file$100
file09 99 ==> ./file09 99 #File with spaces in name also correctly captured
file09 99 ==> ./tmp/tmp2/file09 99
file1 ==> ./file1
file1 ==> ./tmp/file1
file14 ==> ./file14 #other files named file14 like file14.txt and file14.txt.bak not captured since they are not duplicates.
file14 ==> ./tmp/file14
Tips:
This one <(printf '\<%s\>$\n' "${fn[#]##*/}") uses process substitution on the basename of the find results using bash built in parameter expansion techniques.
LC_ALL=C is required on sorting in order filenames to be sorted correctly.
In bash versions before 4.4 , the readarray does not accept -d option (delimiter). In this case you can transform find results to an array with
while IFS= read -r -d '' res;do fn+=( "$res" );done < <(find.... -print0)

Related

Searching through every file in a directory (and in any sub-directories) one by one

I'm trying to loop through every file in a directory (including files in its subdirectories) and perform some action if the file meets an if-condition.
Part of my code is as follows:
for f in $direc/*
do
if grep -q 'search_term' $f; then
#action on this file
fi
done
However, this fails in the case of subdirectories. I would be very grateful if someone could help me out.
Thank you!
The -R option to grep will read all files in the directory tree including subdirectories. Combined with the -l option to print only the matching file names, you can use that to perform an action on each file that matches.
egrep -Rl pattern directory | while read path; do echo $path && mv $path /tmp; done
For example, that would print the file name and move the file to a different directory.
Find | xargs is the usual pattern I use, and has the advantage of not getting hung up on special characters in file names (spaces etc.) if you use the -print0 option of find.
find . -type f -print0 | xargs -0 -I{} sh -c "if grep -q 'search string' '{}'; then cmd-to-run '{}'; fi"
Yes because with this syntax, grep expect to process file(s) not directories. Minimal change to your script would be to test if $f is a file or not:
...
if [ -f "$f" ] && grep -q 'search_term' $f; then
...
In reality you would probably want to get list of files with patter match and act on those:
while read f; do
: #action on file file $f
done < <(grep -rl 'search_term' $direc/)
I've opted for getting the get the list of files through <(list) because piping it into while would cause the inside of your loop to run in another process (which could be a problem in particular if you expect any variable (changes) to be accessible from outside. And unlike simple for with `` it's not as as sensitive to what filenames you encounter (namely I have spaces in mind, this would still get confused by newlines though). Speaking of which:
while read -d "" f; do
: #action on file file $f
done < <(grep -rZl 'search_term' $direc/)
Nothing should be able to confuse that, as entries are nul character delimited and that one just must not appear in a file name.
Assuming no newlines in your file names:
find "$direc" -type f -exec grep -q 'search_term' {} \; -print |
while IFS= read -r f; do
#action on this file
done

How to grep files that has different letters?

I have thousands of files in a directory that are called: abc.txt srr.txt eek.txt abb.txt and etc. I want to grep only those files that has different last two letters. Example:
Good output: abc.txt eek.txt
Bad output: ekk.txt dee.txt.
Here is what I am trying to do:
#!/bin/bash
ls -l directory |grep .txt
It greps every file that has .txt in it.
How do I grep files that has two different last letters?
I'd go with find to list the *.txt files, and grep to filter out the ones that have the last two letters the same (using a backreference):
find . -type f -name '*.txt' | grep -v '\(.\)\1\.txt$'
It essentially picks up a character then immediately tries to back-reference it before .txt, and -v provides a reverse match leaving only files that do not have the same last two characters.
UPDATE: To move the found files you can chain mv to the command:
find . -type f -name '*.txt' | grep -v '\(.\)\1\.txt$' | xargs -i -t mv {} DESTINATION
It's not a good idea to parse the result of ls (read this doc to understand why). Here is what you could do in pure Bash, without using any external commands:
#!/bin/bash
shop -s nullglob # make sure glob yields nothing if there are no matches
for file in *.txt; do # grab all .txt files
[[ -f $file ]] || continue # skip if not a regular file
last6="${file: -6}" # get the last 6 characters of file name
[[ "${last6:1:1}" != "${last6:2:1}" ]] && printf '%s\n' "$file" # pick the files that match the criteria
# change printf to mv "$file" "$target_dir" above if you want to move the files
done
I've seem to accomplish what I wanted by using this:
ls -l |awk '{print $9}' | grep -vE "(.).?\1.?\."
awk '{print $9}' prints only the .txt files
grep -vE '(.).?\1.?\.' filters any names where the three characters before the period are not unique: aaa.txt, aab.txt, aba.txt and baa.txt are all filtered.

Linux piping find and md5sum not sending output

Trying to loop every file, do some cutting, extract the first 4 characters of the MD5.
Here's what I got so far:
find . -name *.jpg | cut -f4 -d/ | cut -f1 -d. | md5sum | head -c 4
Problem is, I don't see any more output at this point. How can I send output to md5sum and continue sending the result?
md5sum reads everything from stdin till end of file (eof) and outputs md5 sum of full file. You should separate input into lines and run md5sum per line, for example with while read var loop:
find . -name *.jpg | cut -f4 -d/ | cut -f1 -d. |
while read -r a;
do echo -n $a| md5sum | head -c 4;
done
read builtin bash command will read one line from input into shell variable $a; while loop will run loop body (commands between do and done) for every return from read, and $a will be the current line. -r option of read is to not convert backslash; -n option of echo command will not add newline (if you want newline, remove -n option of echo).
This will be slow for thousands of files and more, as there are several forks/execs for every file inside loop. Faster will be some scripting with perl or python or nodejs or any other scripting language with builtin md5 hash computing (or with some library).
You can do what you are attempting to do with a short "helper" script that you call from find. For example, you could create a short script to find the basename of each file passed as an argument, remove the '.jpg' extension, and then provide the remaining name w/o extension as input to md5sum on stdin to get the md5sum of the name itself. Call the script anything you like, say namemd5.sh. Example:
#!/bin/bash
[ -z "$1" ] && exit 1 ## validate single argument
fname=$(basename "$1") ## get the filename alone
fname="${fname%.jpg}" ## remove .jpg extension
fnsum=$(md5sum - <<<"$fname") ## get md5sum of name w/o .jpg
fnsum=${fnsum%% *} ## remove trailing ' -'
echo "$fnsum - $fname" ## output md5sum - name
## (remove ' - $fname' for md5sum alone)
(note: the name is provided as part of the output for example purposes, remove if you want the md5sum alone as shown in the comment above)
Example Files
$ find /home/david/img/wp/ -type f -name "*.jpg"
/home/david/img/wp/hacker_manifesto_1200x900.jpg
/home/david/img/wp/hacker_manifesto_by_otalicus.jpg
/home/david/img/wp/reflections-triple-1920x1200.jpg
/home/david/img/wp/hacker_wallpaper_1600x900.jpg
/home/david/img/wp/Zen.jpg
/home/david/img/wp/hacker_wallpaper_by_vanilla23-dot254.jpg
/home/david/img/wp/hacker_manifesto_1600x900.jpg
Example Use/Output
$ find /home/david/img/wp/ -type f -name "*.jpg" -exec ./namemd5.sh '{}' \;
0f7d2aac158eb9f7842215e14ff6573c - hacker_manifesto_1200x900
604bc695a0bb70b8db0352267caf226f - hacker_manifesto_by_otalicus
5decea0e306f185bf988ac9934ec0e2c - reflections-triple-1920x1200
82bd8e1ad3df588eb0e0848c5f764812 - hacker_wallpaper_1600x900
0f4daba431a22c03f28977f087e4c695 - Zen
0c55cd3ebd2a847e10c20d86e80e6ceb - hacker_wallpaper_by_vanilla23-dot254
e5c1da0c2db3827d2bf81c306633cc56 - hacker_manifesto_1600x900
You can also call the script with the -execdir version within find as well, e.g.
$ find /home/david/img/wp/ -type f -name "*.jpg" -execdir \
/full/path/to/namemd5.sh '{}' \;
(note: the use of the /full/path to your helper script above)
How to find all .jpg file then execute md5sum then cut first 4 caracters:
find . -name '*.jpg' -exec md5sum {} \; | cut -b 1-4

Compare 2 Folders and Find Files with Differing Byte Counts

Using Gnome in Linux Mint 12, I copied a Folder of about 9.7 GB (containing a complex tree of subfolders) from one NTFS Flash Drive to another NTFS Flash Drive. According to Gnome the file counts match, but according to du (and other programs) the byte counts don't match. (I've had the same problem copying folders in other Linux distros and Windows XP.)
I only want to know which files don't have matching byte counts. (I don't want to compare the contents of each file, because that would take way too long.) What's the best, easiest and fastest way to find the byte-count-mismatched files?
I would adapt the answer by #user1464130 as it has trouble handling spaces in file names.
cd dir1
find . -type f -printf "%p %s\n" | sort > ~/dir1.txt
cd dir2
find . -type f -printf "%p %s\n" | sort > ~/dir2.txt
diff ~/dir1.txt ~/dir2.txt
If you want to launch a command on each file and use the result in the report, you can use the while Bash construct. This example uses md5sum to compute a checksum for each file.
find . -maxdepth 1 -type f -printf "%p %s\n" | while read path size; do echo "$path - $(md5sum $path | tr -s " " | cut -f 1 -d " ") - $size" ; done
Each $() is executed separately and allows us to compute the checksum for each file. The use of tr squeezes every consecutive spaces into a single space and cut extracts the word in the n-th position, here in the first position. If we don't do that, we get the name of the file two times because md5sum give it back on stdout.
Here is an example without using the comparison (no diff). Note that I've used a dash - to emphasize the three datas we output about each file but it could be a problem if you want to feed it to another program.
$ find . -maxdepth 1 -name "*.c" -type f -printf "%p %s\n" | while read path size; do echo "$path - $(md5sum $path | tr -s " " | cut -f 1 -d " ") - $size" ; done
./thread.c - 5f2b7b12c7cd12fcb9e9796078e5d15b - 584
./utils.c - d61bc1dbc72768e622a04f03e3b8f7a2 - 3413
EDIT : And to handle spaces in filenames and still get the checksum and the size, you can use the following code.
$ find . -maxdepth 1 -name "*.c" -type f -print0 | xargs -0 -n 1 md5sum | while read checksum path; do echo $path $(stat --printf="%s" "$path") $checksum ; done
./ini tia li za tion.c 84 31626123e9056bac2e96b472bd62f309
Did you check if both partitions have the same attributes? (block size, size, reserved space for deletions or bad blocks, etc.)
For your specific case, I would recommend rsync with option -n (or --dry-run). It will tell you which files are different. That is:
$ rsync -I -n /source/ /target/
The option -I is to ignore times. You can use the same command to make both directories equivalent (timestamp, permissions, etc.).
Check the manual of rsync or try the option --help to get more options and examples on how to use it. It is very powerful.
Assuming you need to compare dir1 and dir 2, here are the console commands:
cd dir1
find . -type f|sort|xargs ls -l| awk '{print $5,$8}' > ~/dir1.txt
cd dir2
find . -type f|sort|xargs ls -l| awk '{print $5,$8}' > ~/dir2.txt
diff ~/dir1.txt ~/dir2.txt
You may need to edit awk parameters to make it print file length and path properly.

Shell: find files in a list under a directory

I have a list containing about 1000 file names to search under a directory and its subdirectories. There are hundreds of subdirs with more than 1,000,000 files. The following command will run find for 1000 times:
cat filelist.txt | while read f; do find /dir -name $f; done
Is there a much faster way to do it?
If filelist.txt has a single filename per line:
find /dir | grep -f <(sed 's#^#/#; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt)
(The -f option means that grep searches for all the patterns in the given file.)
Explanation of <(sed 's#^#/#; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt):
The <( ... ) is called a process subsitution, and is a little similar to $( ... ). The situation is equivalent to (but using the process substitution is neater and possibly a little faster):
sed 's#^#/#; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt > processed_filelist.txt
find /dir | grep -f processed_filelist.txt
The call to sed runs the commands s#^#/#, s/$/$/ and s/\([\.[\*]\|\]\)/\\\1/g on each line of filelist.txt and prints them out. These commands convert the filenames into a format that will work better with grep.
s#^#/# means put a / at the before each filename. (The ^ means "start of line" in a regex)
s/$/$/ means put a $ at the end of each filename. (The first $ means "end of line", the second is just a literal $ which is then interpreted by grep to mean "end of line").
The combination of these two rules means that grep will only look for matches like .../<filename>, so that a.txt doesn't match ./a.txt.backup or ./abba.txt.
s/\([\.[\*]\|\]\)/\\\1/g puts a \ before each occurrence of . [ ] or *. Grep uses regexes and those characters are considered special, but we want them to be plain so we need to escape them (if we didn't escape them, then a file name like a.txt would match files like abtxt).
As an example:
$ cat filelist.txt
file1.txt
file2.txt
blah[2012].txt
blah[2011].txt
lastfile
$ sed 's#^#/#; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt
/file1\.txt$
/file2\.txt$
/blah\[2012\]\.txt$
/blah\[2011\]\.txt$
/lastfile$
Grep then uses each line of that output as a pattern when it is searching the output of find.
If filelist.txt is a plain list:
$ find /dir | grep -F -f filelist.txt
If filelist.txt is a pattern list:
$ find /dir | grep -f filelist.txt
Use xargs(1) for the while loop can be a bit faster than in bash.
Like this
xargs -a filelist.txt -I filename find /dir -name filename
Be careful if the file names in filelist.txt contains whitespaces, read the second paragraph in the DESCRIPTION section of xargs(1) manpage about this problem.
An improvement based on some assumptions. For example, a.txt is in filelist.txt, and you can make sure there is only one a.txt in /dir. Then you can tell find(1) to exit early when it finds the instance.
xargs -a filelist.txt -I filename find /dir -name filename -print -quit
Another solution. You can pre-process the filelist.txt, make it into a find(1) arguments list like this. This will reduce find(1) invocations:
find /dir -name 'a.txt' -or -name 'b.txt' -or -name 'c.txt'
I'm not entirely sure of the question here, but I came to this page after trying to find a way to discover which 4 of 13000 files had failed to copy.
Neither of the answers did it for me so I did this:
cp file-list file-list2
find dir/ >> file-list2
sort file-list2 | uniq -u
Which resulted with a list of the 4 files I needed.
The idea is to combine the two file lists to determine the unique entries.
sort is used to make duplicate entries adjacent to each other which is the only way uniq will filter them out.

Resources