loop multiple file with only 1st file being read

loop multiple file with only 1st file being read - linux

So I have a that works as it going to mulitple directories then within these directories takes multiple fields from files and store it in a .txt files.
There are two loop, the first one that loops through all the folder
the second one that loops through all the files.
The problem I encounter is in the second loops that it read only the first file in the folder and then it moves on to the next folder and ignore all other files in the folder.
archive=/imdata/archive
inventory_archive=/imdata/a/shares/b/inventory/c
ls $archive | while read p; do
echo "Project: $p"
mkdir -v $inventory_archive/$p
dir=$inventory_archive/$p
ls -1 $archive/$p/d001 | while read s; do
echo "Searching Session: $s ..."
find $archive/$p/d001/$s -type f -iname "*.txt" | while read f; do
echo "FILE: $f"
study=`/home/me/program/bin/script $f | grep -m1 "field1" | cut -d "[" -f2 | cut -d "]" -f1`
echo "SID: $study"
if [ ! -d "$dir/$study" ]; then
mkdir -v $dir/$study
fi
studydir=$dir/$study
series=`/home/me/program/bin/script $f | grep -m1 "field2" | cut -d "[" -f2 | cut -d "]" -f1`
echo "SID_2: $series"
if [ ! -a "$studydir/$series.txt" ]; then
touch $studydir/$series.txt
fi
sop=`/home/me/program/bin/script $f | grep -m1 "field3" | cut -d "[" -f2 | cut -d "]" -f1`
echo "SID_3: $sop"
grep -qsF $sop $studydir/$series.txt || echo $sop >> $studydir/$series.txt
exit 1;
done;
done;
done;

Related

check owner permissions from file (multiple paths results) in bash script

friends,
I have a problem with my script. I tried to find a full path of /bin/openssl and next save results to the file and read from the file paths and check permissions and owners.
But in the result file /tmp/lista.txt , I have multiple paths. How to make a script to read and check every path from file result.
Any ideas?
#!/bin/sh
module_id="AV.1.8.2.1"
echo " === $module_id module === "
#MODULE BODY
find / -wholename "*bin/openssl" -print > /tmp/list.txt
file="/tmp/list.txt"
path=$(cat /tmp/list.txt)
os=$(uname)
permission_other_good=0
if [ -e $path ]
then
if [ $os = "Linux" ]
then
gid_min=$(grep ^GID_MIN /etc/login.defs)
gid_min_value=$(echo $gid_min | cut -d " " -f2)
uid_min=$(grep ^UID_MIN /etc/login.defs)
uid_min_value=$(echo $uid_min | cut -d " " -f2)
sys_gid_max=$(grep ^SYS_GID_MAX /etc/login.defs)
sys_gid_max_value=$(echo $sys_gid_max | cut -d " " -f2)
sys_uid_max=$(grep ^SYS_UID_MAX /etc/login.defs)
sys_uid_max_value=$(echo $sys_uid_max | cut -d " " -f2)
user_uid=$(stat -c %u $path)
user=$(stat -c %U $path)
group_gid=$(stat -c %g $path)
group=$(stat -c %G $path)
permission_other=$(stat -c %A $path | cut -b 8-10)
if [ -z $gid_min_value ]
then
gid_min_value=1000
fi
if [ -z $uid_min_value ]
then
uid_min_value=1000
fi
if [ -z $sys_gid_max_value ]
then
sys_gid_max_value=$((gid_min_value-1))
fi
if [ -z $sys_uid_max_value ]
then
sys_uid_max_value=$((uid_min_value-1))
fi
fi
if [ $os = "AIX" ]
then
gid_min_value=$(cat /etc/security/.ids | cut -d " " -f4)
sys_gid_max_value=$((gid_min_value-1))
uid_min_value=$(cat /etc/security/.ids | cut -d " " -f2)
sys_uid_max_value=$((uid_min_value-1))
user_uid=$(istat $path | awk -F " " 'NR==3{print $2}' | cut -d "(" -f1)
user=$(istat $path | awk -F " " 'NR==3{print $2}' | cut -d "(" -f2 | cut -d ")" -f1)
group_gid=$(istat $path | awk -F " " 'NR==3{print $4}' | cut -d "(" -f1)
group=$(istat $path | awk -F " " 'NR==3{print $4}' | cut -d "(" -f2 | cut -d ")" -f1)
permission_other=$(istat $path | awk -F " " 'NR==2{print $2}' | cut -b 7-9)
fi
if [ $permission_other = "r-x" ] || [ $permission_other = "r--" ] || [ $permission_other = "--x" ] || [ $permission_other = "---" ]
then
permission_other_good=1
fi
if [ $user_uid -le $sys_uid_max_value ] && [ $group_gid -le $sys_gid_max_value ] && [ $permission_other_good -eq 1 ]
then
compliant="Yes"
actual_value="user = $user group = $group permission = $permission_other"
else
compliant="No"
actual_value="user = $user group = $group permission = $permission_other"
fi
else
compliant="N/A"
actual_value="File $file does not exist"
fi
# SCRIPT RESULT
echo :::$module_id:::$compliant:::$actual_value:::
echo " === End of $module_id module === "

Since you are using bash, I would advise against using temporary files because it has a lot of caveats due to the filesystem (what happens if your disk is full? what happens if the file exists? what happens if you don’t have write permissions over the base directory? etc.) and due to the limitations of storing filenames in line-based text files (though this should not be an issue for finding openssl).
If you want to parse a list of files, here is a common and safe way to do that:
find / -wholename "*bin/openssl" -print0 | while IFS= read -r -d '' path
do
# Write your code here using the 'path' variable
done
But because you are storing variables inside the for loop, they would be limited to the scope of the loop because of the pipe operation between find and while. You can circumvent this problem by using process substitution instead:
while IFS= read -r -d '' path
do
# Write your code here using the 'path' variable
done < <(find / -wholename "*bin/openssl" -print0)
This behaves essentially the same as the previous code, except that the variable scope is not limited to the loop.
PS. You will have to deal with the assignment of your variables over several openssl paths. If you just copy/paste your code inside the for loop, the value of compliant and actual_value will retain information for the last path in the loop, which is probably not what you want.

This is how in bash you read a file line by line
#!/bin/bash
inputfile="/tmp/list.txt"
while IFS= read -r line
do
echo "do your stuff with $line"
done < "$inputfile"

okey the best solution is to put the result in an array and then for
declare -a my_paths
my_paths=($(find / -wholename "*bin/openssl" -print 2>/dev/null))
for my_path in "${my_paths[#]}"; do
done

Created directory with for loop in bash

I have these files. Imagine that each "test" represent the name of one server:
test10.txt
test11.txt
test12.txt
test13.txt
test14.txt
test15.txt
test16.txt
test17.txt
test18.txt
test19.txt
test1.txt
test20.txt
test21.txt
test22.txt
test23.txt
test24.txt
test25.txt
test26.txt
test27.txt
test28.txt
test29.txt
test2.txt
test30.txt
test31.txt
test32.txt
test33.txt
test34.txt
test35.txt
test36.txt
test37.txt
test38.txt
test39.txt
test3.txt
test40.txt
test4.txt
test5.txt
test6.txt
test7.txt
test8.txt
test9.txt
In each txt file, I have this type of data:
2019-10-14-00-00;/dev/hd1;1024.00;136.37;/
2019-10-14-00-00;/dev/hd2;5248.00;4230.53;/usr
2019-10-14-00-00;/dev/hd3;2560.00;481.66;/var
2019-10-14-00-00;/dev/hd4;3584.00;67.65;/tmp
2019-10-14-00-00;/dev/hd5;256.00;26.13;/home
2019-10-14-00-00;/dev/hd1;1024.00;476.04;/opt
2019-10-14-00-00;/dev/hd5;384.00;0.38;/usr/xxx
2019-10-14-00-00;/dev/hd4;256.00;21.39;/xxx
2019-10-14-00-00;/dev/hd2;512.00;216.84;/opt
2019-10-14-00-00;/dev/hd3;128.00;21.46;/var/
2019-10-14-00-00;/dev/hd8;256.00;75.21;/usr/
2019-10-14-00-00;/dev/hd7;384.00;186.87;/var/
2019-10-14-00-00;/dev/hd6;256.00;0.63;/var/
2019-10-14-00-00;/dev/hd1;128.00;0.37;/admin
2019-10-14-00-00;/dev/hd4;256.00;179.14;/opt/
2019-10-14-00-00;/dev/hd3;2176.00;492.93;/opt/
2019-10-14-00-00;/dev/hd1;256.00;114.83;/opt/
2019-10-14-00-00;/dev/hd9;256.00;41.73;/var/
2019-10-14-00-00;/dev/hd1;3200.00;954.28;/var/
2019-10-14-00-00;/dev/hd10;256.00;0.93;/var/
2019-10-14-00-00;/dev/hd10;64.00;1.33;/
2019-10-14-00-00;/dev/hd2;1664.00;501.64;/opt/
2019-10-14-00-00;/dev/hd4;256.00;112.32;/opt/
2019-10-14-00-00;/dev/hd9;2176.00;1223.1;/opt/
2019-10-14-00-00;/dev/hd11;22784.00;12325.8;/opt/
2019-10-14-00-00;/dev/hd12;256.00;2.36;/
2019-10-14-06-00;/dev/hd12;1024.00;137.18;/
2019-10-14-06-00;/dev/hd1;256.00;2.36;/
2019-10-14-00-00;/dev/hd1;1024.00;136.37;/
2019-10-14-00-00;/dev/hd2;5248.00;4230.53;/usr
2019-10-14-00-00;/dev/hd3;2560.00;481.66;/var
2019-10-14-00-00;/dev/hd4;3584.00;67.65;/tmp
2019-10-14-00-00;/dev/hd5;256.00;26.13;/home
2019-10-14-00-00;/dev/hd1;1024.00;476.04;/opt
2019-10-14-00-00;/dev/hd5;384.00;0.38;/usr/xxx
2019-10-14-00-00;/dev/hd4;256.00;21.39;/xxx
2019-10-14-00-00;/dev/hd2;512.00;216.84;/opt
2019-10-14-00-00;/dev/hd3;128.00;21.46;/var/
2019-10-14-00-00;/dev/hd8;256.00;75.21;/usr/
2019-10-14-00-00;/dev/hd7;384.00;186.87;/var/
2019-10-14-00-00;/dev/hd6;256.00;0.63;/var/
2019-10-14-00-00;/dev/hd1;128.00;0.37;/admin
2019-10-14-00-00;/dev/hd4;256.00;179.14;/opt/
2019-10-14-00-00;/dev/hd3;2176.00;492.93;/opt/
2019-10-14-00-00;/dev/hd1;256.00;114.83;/opt/
2019-10-14-00-00;/dev/hd9;256.00;41.73;/var/
2019-10-14-00-00;/dev/hd1;3200.00;954.28;/var/
2019-10-14-00-00;/dev/hd10;256.00;0.93;/var/
2019-10-14-00-00;/dev/hd10;64.00;1.33;/
2019-10-14-00-00;/dev/hd2;1664.00;501.64;/opt/
2019-10-14-00-00;/dev/hd4;256.00;112.32;/opt/
I would like to create a directory for each server, create in each directory a txt file for each FS and put in these txt files each lines which correspond to the FS.
For that, I've tried loop :
#!/bin/bash
directory=(ls *.txt | cut -d'.' -f1)
for d in $directory
do
if [ ! -d $d ]
then
mkdir $d
fi
done
for i in $(cat *.txt)
do
file=$(echo $i | awk -F';' '{print $2}' | sort | uniq | cut -d'/' -f3 )
data=$(echo $i | awk -F';' '{print $2}' )
echo $i | grep -w $data >> /xx/xx/xx/xx/xx/${directory/${file}.txt
done
But this loop doesn't work properly. The directories are created but not the file inside each directory.
I would like something like :
test1/hd1.txt ( with each line which for the hd1 fs in the hd1.txt)
And same thing for each server.
Can you show me how to do that?

#!/bin/bash
for src in *.txt; do
# start a subshell so we don't need to cd back afterwards
# make "$src" be stdin before cd, so we don't need full path
# be careful that in subshell only awk reads from stdin
(
# extract server name to use as directory
dir=/xx/xx/xx/xx/xx/"${src%.txt}"
# chain with "&&" so failures don't cause bad files
mkdir -p "$dir" &&
cd "$dir" &&
awk -F \; '{ split($2, dev, "/"); print > dev[3]".txt" }'
) < "$src"
done
The awk script reads lines delimited by semi-colons.
It splits the second field on slashes to extract the device name (assumption is that the devices always have form: /dev/name
Finally, the > sends output to the relevant file.
For reference, you can make your script work by doing directory=$(...); adding the prefix to mkdir (assuming the prefix directories already exist); closing the reference ${directory}; and quoting all variable references for safety:
#!/bin/bash
directory=$(ls *.txt | cut -d'.' -f1)
for d in "$directory"
do
if [ ! -d "$d" ]
then
mkdir /xx/xx/xx/xx/xx/"$d"
fi
done
for i in $(cat *.txt)
do
file=$(echo "$i" | awk -F';' '{print $2}' | sort | uniq | cut -d'/' -f3 )
data=$(echo $i | awk -F';' '{print $2}' )
echo "$i" | grep -w "$data" >> /xx/xx/xx/xx/xx/"${directory}"/"${file}".txt
done

for file in `ls *.txt`
do
echo ${file}
directory=`echo ${file} | cut -d'.' -f1`
#echo ${directory}
if [ ! -d ${directory} ]
then
mkdir ${directory}
fi
FS=`cat ${file} | awk -F';' '{print $2}' | sort | uniq | cut -d'/' -f3`
#echo $FS
for f in $FS
do
cat ${file} |grep -w -e $f > ${directory}/${f}.txt
done
done
Explanation:
For each file in the current directory, the outer for loop will run.
In the loop for the selected file, a respective directory will be created first.
Next using the FS variable we take all the possible file systems from that selected file.
Finally, an inner loop will be run using the FS types to grep and create separate file system files in the directory.

How to run iterations asynchronously in shell script

I have a few .csv files like below.
xyz0900#1#-1637746436.csv
xxx0900#1#-1637746436.csv
zzz0900#2#-1637746439.csv
yyy0900#1#-1637746436.csv
sss0900#2#-1637746439.csv
I have written a script to perform below tasks:
Get the large file based on the pattern which we have passed as a argument to the script.
Merge all other files which are having same pattern and create a new file
Remove duplicate header from new file.
Move new file to the destination based on the parameter passed as a argument.
Example: I am passing "1637746436#home/dest1,1637746436#home/dest2" as
a second argument to the script. Below script will fetch the
pattern(1637746436). Get the bigger file and merge all other
files(having same pattern) with it. New file will be get created and same will be moved to the destination(home/dest1).
The below script will perform the pattern matching and execution sequentially.
How to make 'for loop iteration' should be executed parallelly? I mean pattern matching of "1637746436#home/dest1,1637746436#home/dest2" should be performed simultaneously(not one after another).
Please help on this.
$merge.sh /home/dummy/17 "1637746436#home/dest1,1637746439#home/dest2"
#!/bin/bash
current=`pwd`
source=$1
destination=$2
echo "$destination" | tr "," "\n" > $current/out.txt
cat out.txt | cut -d "#" -f1 > $current/pattern.txt
for var in `cat pattern.txt`
do
getBiggerfile=$(ls -Sl $source/*$var.csv | head -1)
cd $source
getFileName=$(echo $getBiggerfile | cut -d " " -f9-)
newFileName=$(echo $getFileName | cut -d "#" -f1)
cat *$var.csv >> $getFileName
header=$(head -n 1 $getFileName)
(printf "%s\n" "$header";
grep -vFxe "$header" $getFileName
) > $newFileName.csv
rm -rf *$var.csv
cd $current
for var1 in `cat out.txt`
do
target=`echo $var1 | cut -d "#" -f2`
id=$(echo $var1 | cut -c-10)
if [ $id = $var ]
then
mv $newFileName.csv $target
fi
done
done

The cleanest would be to make the internals of the loop a function, and call the function inside the loop, putting it in the background (child processes), then wait for the background (child) processes to finish:
function do_the_thing(){
source="$1"
current="$2"
var="$3"
getBiggerfile=$(ls -Sl $source/*$var.csv | head -1)
cd $source
getFileName=$(echo $getBiggerfile | cut -d " " -f9-)
newFileName=$(echo $getFileName | cut -d "#" -f1)
cat *$var.csv >> $getFileName
header=$(head -n 1 $getFileName)
(printf "%s\n" "$header";
grep -vFxe "$header" $getFileName
) > $newFileName.csv
rm -rf *$var.csv
cd $current
for var1 in `cat out.txt`
do
target=`echo $var1 | cut -d "#" -f2`
id=$(echo $var1 | cut -c-10)
if [ $id = $var ]
then
mv $newFileName.csv $target
fi
done
}
for var in `cat pattern.txt`
do
do_the_thing "$source" "$current" "$var" &
done
wait

grep text in a loop for multiple file but only grep the first

Hi I have a bunch of files which can be read in a text format and I am grep key feature within the text files. I have multiple folders with multiple files, each folder is its own set.
a.file
b.file
c.file
d.file
etc...
within the files, I there are similar feature that I used to group them which I grep the similar trait and using that as a new text file name then I store the unqiue trait inside the text file
similarA.txt include uniqueA and uniqueB
similarB.txt include uniqueC and uniqueD
So the problem I have with my script right now is that my script is only taking the first file and the first file's unique ID and then it moves on to the next folder which it does not grep from every file in the folder.
ls $parent_folder | while read p; do
dir=$new_folder/$p
ls -1 $parent_folder/$p/individual_folder | while read s; do
find $parent_folder/$p/individual_folder/$s -type f -iname "*.dcm" | while read f; do
if [[ $f == *.gz ]]; then
cp $f $new_folder
target_file="`ls -1 $new_folder | grep ".gz" | head -1`"
gunzip $new_folder/$target_file
target_file="`ls -1 $new_folder | grep ".dcm" | head -1`"
folder_name=`/usr/local/bin/file2txt_command $new_folder/$target_file | grep -m1 "0020,000d" | cut -d "[" -f2 | cut -d "]" -f1`
folder_namedir=$dir/$folder_name
common_trait=`/usr/local/bin/file2txt_command $new_folder/$target_file | grep -m1 "0020,000e" | cut -d "[" -f2 | cut -d "]" -f1`
if [ ! -a "$folder_namedir/$common_trait.txt" ]; then
touch $folder_namedir/$common_trait.txt
fi
unqiue_trait=`/usr/local/bin/file2txt_command $new_folder/$target_file | grep -m1 "0008,0018" | cut -d "[" -f2 | cut -d "]" -f1`
grep -qsF $unqiue_trait $folder_namedir/$common_trait.txt || echo $unqiue_trait >> $folder_namedir/$common_trait.txt
rm $new_folder/$target_file
else
folder_name=`/usr/local/bin/dcm4che-3.3.3/bin/file2txt_command $f | grep -m1 "0020,000d" | cut -d "[" -f2 | cut -d "]" -f1`
folder_namedir=$dir/$folder_name
common_trait=`/usr/local/bin/file2txt_command $f | grep -m1 "0020,000e" | cut -d "[" -f2 | cut -d "]" -f1`
if [ ! -a "$folder_namedir/$common_trait.txt" ]; then
touch $folder_namedir/$common_trait.txt
fi
unqiue_trait=`/usr/local/bin/file2txt_command $f | grep -m1 "0008,0018" | cut -d "[" -f2 | cut -d "]" -f1`
grep -qsF $unqiue_trait $folder_namedir/$common_trait.txt || echo $unqiue_trait >> $folder_namedir/$common_trait.txt
fi
exit
done
done
done

How to find duplicate files with same name but in different case that exist in same directory in Linux?

How can I return a list of files that are named duplicates i.e. have same name but in different case that exist in the same directory?
I don't care about the contents of the files. I just need to know the location and name of any files that have a duplicate of the same name.
Example duplicates:
/www/images/taxi.jpg
/www/images/Taxi.jpg
Ideally I need to search all files recursively from a base directory. In above example it was /www/

The other answer is great, but instead of the "rather monstrous" perl script i suggest
perl -pe 's!([^/]+)$!lc $1!e'
Which will lowercase just the filename part of the path.
Edit 1: In fact the entire problem can be solved with:
find . | perl -ne 's!([^/]+)$!lc $1!e; print if 1 == $seen{$_}++'
Edit 3: I found a solution using sed, sort and uniq that also will print out the duplicates, but it only works if there are no whitespaces in filenames:
find . |sed 's,\(.*\)/\(.*\)$,\1/\2\t\1/\L\2,'|sort|uniq -D -f 1|cut -f 1
Edit 2: And here is a longer script that will print out the names, it takes a list of paths on stdin, as given by find. Not so elegant, but still:
#!/usr/bin/perl -w
use strict;
use warnings;
my %dup_series_per_dir;
while (<>) {
my ($dir, $file) = m!(.*/)?([^/]+?)$!;
push #{$dup_series_per_dir{$dir||'./'}{lc $file}}, $file;
}
for my $dir (sort keys %dup_series_per_dir) {
my #all_dup_series_in_dir = grep { #{$_} > 1 } values %{$dup_series_per_dir{$dir}};
for my $one_dup_series (#all_dup_series_in_dir) {
print "$dir\{" . join(',', sort #{$one_dup_series}) . "}\n";
}
}

Try:
ls -1 | tr '[A-Z]' '[a-z]' | sort | uniq -c | grep -v " 1 "
Simple, really :-) Aren't pipelines wonderful beasts?
The ls -1 gives you the files one per line, the tr '[A-Z]' '[a-z]' converts all uppercase to lowercase, the sort sorts them (surprisingly enough), uniq -c removes subsequent occurrences of duplicate lines whilst giving you a count as well and, finally, the grep -v " 1 " strips out those lines where the count was one.
When I run this in a directory with one "duplicate" (I copied qq to qQ), I get:
2 qq
For the "this directory and every subdirectory" version, just replace ls -1 with find . or find DIRNAME if you want a specific directory starting point (DIRNAME is the directory name you want to use).
This returns (for me):
2 ./.gconf/system/gstreamer/0.10/audio/profiles/mp3
2 ./.gconf/system/gstreamer/0.10/audio/profiles/mp3/%gconf.xml
2 ./.gnome2/accels/blackjack
2 ./qq
which are caused by:
pax> ls -1d .gnome2/accels/[bB]* .gconf/system/gstreamer/0.10/audio/profiles/[mM]* [qQ]?
.gconf/system/gstreamer/0.10/audio/profiles/mp3
.gconf/system/gstreamer/0.10/audio/profiles/MP3
.gnome2/accels/blackjack
.gnome2/accels/Blackjack
qq
qQ
Update:
Actually, on further reflection, the tr will lowercase all components of the path so that both of
/a/b/c
/a/B/c
will be considered duplicates even though they're in different directories.
If you only want duplicates within a single directory to show as a match, you can use the (rather monstrous):
perl -ne '
chomp;
#flds = split (/\//);
$lstf = $f[-1];
$lstf =~ tr/A-Z/a-z/;
for ($i =0; $i ne $#flds; $i++) {
print "$f[$i]/";
};
print "$x\n";'
in place of:
tr '[A-Z]' '[a-z]'
What it does is to only lowercase the final portion of the pathname rather than the whole thing. In addition, if you only want regular files (no directories, FIFOs and so forth), use find -type f to restrict what's returned.

I believe
ls | sort -f | uniq -i -d
is simpler, faster, and will give the same result

Following up on the response of mpez0, to detect recursively just replace "ls" by "find .".
The only problem I see with this is that if this is a directory that is duplicating, then you have 1 entry for each files in this directory. Some human brain is required to treat the output of this.
But anyway, you're not automatically deleting these files, are you?
find . | sort -f | uniq -i -d

This is a nice little command line app called findsn you get if you compile fslint that the deb package does not include.
it will find any files with the same name, and its lightning fast and it can handle different case.
/findsn --help
find (files) with duplicate or conflicting names.
Usage: findsn [-A -c -C] [[-r] [-f] paths(s) ...]
If no arguments are supplied the $PATH is searched for any redundant
or conflicting files.
-A reports all aliases (soft and hard links) to files.
If no path(s) specified then the $PATH is searched.
If only path(s) specified then they are checked for duplicate named
files. You can qualify this with -C to ignore case in this search.
Qualifying with -c is more restrictive as only files (or directories)
in the same directory whose names differ only in case are reported.
I.E. -c will flag files & directories that will conflict if transfered
to a case insensitive file system. Note if -c or -C specified and
no path(s) specified the current directory is assumed.

Here is an example how to find all duplicate jar files:
find . -type f -printf "%f\n" -name "*.jar" | sort -f | uniq -i -d
Replace *.jar with whatever duplicate file type you are looking for.

Here's a script that worked for me ( I am not the author). the original and discussion can be found here:
http://www.daemonforums.org/showthread.php?t=4661
#! /bin/sh
# find duplicated files in directory tree
# comparing by file NAME, SIZE or MD5 checksum
# --------------------------------------------
# LICENSE(s): BSD / CDDL
# --------------------------------------------
# vermaden [AT] interia [DOT] pl
# http://strony.toya.net.pl/~vermaden/links.htm
__usage() {
echo "usage: $( basename ${0} ) OPTION DIRECTORY"
echo " OPTIONS: -n check by name (fast)"
echo " -s check by size (medium)"
echo " -m check by md5 (slow)"
echo " -N same as '-n' but with delete instructions printed"
echo " -S same as '-s' but with delete instructions printed"
echo " -M same as '-m' but with delete instructions printed"
echo " EXAMPLE: $( basename ${0} ) -s /mnt"
exit 1
}
__prefix() {
case $( id -u ) in
(0) PREFIX="rm -rf" ;;
(*) case $( uname ) in
(SunOS) PREFIX="pfexec rm -rf" ;;
(*) PREFIX="sudo rm -rf" ;;
esac
;;
esac
}
__crossplatform() {
case $( uname ) in
(FreeBSD)
MD5="md5 -r"
STAT="stat -f %z"
;;
(Linux)
MD5="md5sum"
STAT="stat -c %s"
;;
(SunOS)
echo "INFO: supported systems: FreeBSD Linux"
echo
echo "Porting to Solaris/OpenSolaris"
echo " -- provide values for MD5/STAT in '$( basename ${0} ):__crossplatform()'"
echo " -- use digest(1) instead for md5 sum calculation"
echo " $ digest -a md5 file"
echo " -- pfexec(1) is already used in '$( basename ${0} ):__prefix()'"
echo
exit 1
(*)
echo "INFO: supported systems: FreeBSD Linux"
exit 1
;;
esac
}
__md5() {
__crossplatform
:> ${DUPLICATES_FILE}
DATA=$( find "${1}" -type f -exec ${MD5} {} ';' | sort -n )
echo "${DATA}" \
| awk '{print $1}' \
| uniq -c \
| while read LINE
do
COUNT=$( echo ${LINE} | awk '{print $1}' )
[ ${COUNT} -eq 1 ] && continue
SUM=$( echo ${LINE} | awk '{print $2}' )
echo "${DATA}" | grep ${SUM} >> ${DUPLICATES_FILE}
done
echo "${DATA}" \
| awk '{print $1}' \
| sort -n \
| uniq -c \
| while read LINE
do
COUNT=$( echo ${LINE} | awk '{print $1}' )
[ ${COUNT} -eq 1 ] && continue
SUM=$( echo ${LINE} | awk '{print $2}' )
echo "count: ${COUNT} | md5: ${SUM}"
grep ${SUM} ${DUPLICATES_FILE} \
| cut -d ' ' -f 2-10000 2> /dev/null \
| while read LINE
do
if [ -n "${PREFIX}" ]
then
echo " ${PREFIX} \"${LINE}\""
else
echo " ${LINE}"
fi
done
echo
done
rm -rf ${DUPLICATES_FILE}
}
__size() {
__crossplatform
find "${1}" -type f -exec ${STAT} {} ';' \
| sort -n \
| uniq -c \
| while read LINE
do
COUNT=$( echo ${LINE} | awk '{print $1}' )
[ ${COUNT} -eq 1 ] && continue
SIZE=$( echo ${LINE} | awk '{print $2}' )
SIZE_KB=$( echo ${SIZE} / 1024 | bc )
echo "count: ${COUNT} | size: ${SIZE_KB}KB (${SIZE} bytes)"
if [ -n "${PREFIX}" ]
then
find ${1} -type f -size ${SIZE}c -exec echo " ${PREFIX} \"{}\"" ';'
else
# find ${1} -type f -size ${SIZE}c -exec echo " {} " ';' -exec du -h " {}" ';'
find ${1} -type f -size ${SIZE}c -exec echo " {} " ';'
fi
echo
done
}
__file() {
__crossplatform
find "${1}" -type f \
| xargs -n 1 basename 2> /dev/null \
| tr '[A-Z]' '[a-z]' \
| sort -n \
| uniq -c \
| sort -n -r \
| while read LINE
do
COUNT=$( echo ${LINE} | awk '{print $1}' )
[ ${COUNT} -eq 1 ] && break
FILE=$( echo ${LINE} | cut -d ' ' -f 2-10000 2> /dev/null )
echo "count: ${COUNT} | file: ${FILE}"
FILE=$( echo ${FILE} | sed -e s/'\['/'\\\['/g -e s/'\]'/'\\\]'/g )
if [ -n "${PREFIX}" ]
then
find ${1} -iname "${FILE}" -exec echo " ${PREFIX} \"{}\"" ';'
else
find ${1} -iname "${FILE}" -exec echo " {}" ';'
fi
echo
done
}
# main()
[ ${#} -ne 2 ] && __usage
[ ! -d "${2}" ] && __usage
DUPLICATES_FILE="/tmp/$( basename ${0} )_DUPLICATES_FILE.tmp"
case ${1} in
(-n) __file "${2}" ;;
(-m) __md5 "${2}" ;;
(-s) __size "${2}" ;;
(-N) __prefix; __file "${2}" ;;
(-M) __prefix; __md5 "${2}" ;;
(-S) __prefix; __size "${2}" ;;
(*) __usage ;;
esac
If the find command is not working for you, you may have to change it. For example
OLD : find "${1}" -type f | xargs -n 1 basename
NEW : find "${1}" -type f -printf "%f\n"

You can use:
find -type f -exec readlink -m {} \; | gawk 'BEGIN{FS="/";OFS="/"}{$NF=tolower($NF);print}' | uniq -c
Where:
find -type f
recursion print all file's full path.
-exec readlink -m {} \;
get file's absolute path
gawk 'BEGIN{FS="/";OFS="/"}{$NF=tolower($NF);print}'
replace the all filename's to lower case
uniq -c
unique the path, -c output the count of duplicate.

Little bit late to this one, but here's the version I went with:
find . -type f | awk -F/ '{print $NF}' | sort -f | uniq -i -d
Here we are using:
find - find all files under the current dir
awk - remove the file path part of the filename
sort - sort case insensitively
uniq - find the dupes from what makes it through the pipe
(Inspired by #mpez0 answer, and #SimonDowdles comment on #paxdiablo answer.)

You can check duplicates in a given directory with GNU awk:
gawk 'BEGINFILE {if ((seen[tolower(FILENAME)]++)) print FILENAME; nextfile}' *
This uses BEGINFILE to perform some action before going on and reading a file. In this case, it keeps track of the names that have appeared in an array seen[] whose indexes are the names of the files in lowercase.
If a name has already appeared, no matter its case, it prints it. Otherwise, it just jumps to the next file.
See an example:
$ tree
.
├── bye.txt
├── hello.txt
├── helLo.txt
├── yeah.txt
└── YEAH.txt
0 directories, 5 files
$ gawk 'BEGINFILE {if ((a[tolower(FILENAME)]++)) print FILENAME; nextfile}' *
helLo.txt
YEAH.txt

I just used fdupes on CentOS to clean up a whole buncha duplicate files...
yum install fdupes

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

loop multiple file with only 1st file being read - linux

Related

check owner permissions from file (multiple paths results) in bash script

Created directory with for loop in bash

How to run iterations asynchronously in shell script

grep text in a loop for multiple file but only grep the first

How to find duplicate files with same name but in different case that exist in same directory in Linux?

Categories

Resources