Bash: Compare alphanumeric string with lower and upper case

Bash: Compare alphanumeric string with lower and upper case - string

Given a directory with files with an alphanumeric name:
file45369985.xml
file45793220.xml
file0005461x.xml
Also, given a csv table with a list of files
file45369985.xml,file,45369985,.xml,https://www.tib.eu/de/suchen/id/FILE:45369985/Understanding-terrorism-challenges-perspectives?cHash=16d713678274dd2aa205fc07b2fc5b86
file0005461X.xml,file,0005461X,.xml,https://www.tib.eu/de/suchen/id/FILE:0005461X/The-reality-of-social-construction?cHash=5d8152fbbfae77357c1ec6f443f8c8a4
I would like to match all files in the csv table with the directory's content and move them somewhere else. However, I cannot switch off the case sensitivity in this command:
while read p; do
data_set=$(echo "$p" | cut -f1 -d",")
# do something else
done
How can the "X-Files" be correctly matched as well?

Given the format of the csv file (no quotes around the first field), I show an answer for filenames without newlines.
List all files in current directory
find . -maxdepth 1 -type f -printf "%f\n"
Look for one filename in that list (ignoring case)
grep -Fix file0005461X.xml <(find . -maxdepth 1 -type f -printf "%f\n")
Show first field only from file
cut -d"," -f1 csvfile
Pretend that the output is a file
<(cut -d"," -f1 csvfile)
Tell grep to use that "file" for strings to look for with option f
grep -Fixf <(cut -d"," -f1 csvfile) <(find . -maxdepth 1 -type f -printf "%f\n")
Move to /tmp
grep -Fixf <(cut -d"," -f1 csvfile) <(find . -maxdepth 1 -type f -printf "%f\n") |
xargs -i{} mv "{}" /tmp

You can use join to perform a inner join between the CSV and the file list:
join -i -t, \
<(sort -t, -k1 list.csv) \
<(find given_dir -maxdepth 1 -mindepth 1 -type f -printf "%f\n" | sort) \
-o "2.1"
Explanation:
-i: perform a case insensitive comparison for the join
-t,: use the comma as a field separator
<(sort -t, -k1 list.csv): sort the CSV file on the first field using the comma as a field separator and use the output as a file, and perform a process substitution to "connect the output" to a file and use it as file argument (see Bash manual page)
<(find given_dir -maxdepth 1 -mindepth 1 -type f -printf "%f\n" | sort): list all the file stored in the root of the given directory given_dir (and not in the subdirectories), sort it and perform a process substitution like the above
-o "2.1": list the first column of the second input (the find output) of the join result
Note: this solution relies on GNU find due to printf command

awk -F [,\.] '{ print substr($1,1,length($1)-1)toupper(substr($1,length($1)))"."$2;print substr($1,1,length($1)-1)tolower(substr($1,length($1)))"."$2 }' csvfile | while read line
do
find /path -name "$line" -exec mv '{}' /newpath \;
done
Use awk and set the file delimiter to . and , Take each line and generate both an uppercase and lowercase X version of the file name.
Loop through this output and find the file in a given path. If the file exists, execute the move command to a given path.

You can use grep -i to make case insensitive matches:
while read p; do
data_set=$(echo "$p" | cut -f1 -d",")
match=$(ls $your_dir | grep -i "^$data_set\$")
if [ ! -z match ]; then
mv "$match" $another_dir
fi
done

Related

Get filenames sorted by mtime

How to get file names sorted by the modification timestamp descending?
I should add that file names may potentially contain any special character except \0.
Here is what I got so far. The loop that gets file name and its mtime, however it is unsorted:
while IFS= read -r -d '' fname; do
read -r -d '' mtime
done < <(find . -maxdepth 3 -printf '%p\0%T#\0')

If you reorder your find printf, it becomes esay to sort:
find . -maxdepth 3 -printf '%T# :: %p\0'|\
sort -zrn |\
sed -z 's/^[0-9.]* :: //' |\
xargs -0 -n1 echo
The sed and xargs lines are just examples of stripping out the mtime and then doing something with the filenames.

For files within the same folder, this will do:
$ ls -t
If you want to cross a tree, one of these will do depending on your variant of Linux (stat command has different syntaxes):
$ find . -type f -exec stat -c '%Y %n' {} \; | sort -nr | cut -d' ' -f2-
Or:
$ find . -type f -exec stat -f '%m %N' {} \; | sort -nr | cut -d' ' -f2-
I hope this helps.

Assuming that you want a list of files and timestamps, ordered by timestamp:
while IFS=: read mtime fname ; do
echo "mtime = [$mtime] / fname = [$fname]"
done < <(find . -printf '%T#:%f\n' | sort -t:)
I've choosen a : as a delimiter as it is quite rare as a character for your filenames, being even prohibited in DOS/NTFS
With needs so stringent (filenames with : or \n as possible characters),
to get what you need, you can try:
while IFS= read -r -d '' mtime; do
read -r -d '' fname;
echo "[$mtime][$fname]";
done < <(find . -maxdepth 3 -printf '%T#\0%p\0' ) | sort -nr
Trying to solve the newlines embedded in the filenames:
while IFS= read -r -d '' mtime; do
read -r -d '' fname;
printf "[%s][%s]\0" "$mtime" "$fname";
done < <(find . -maxdepth 3 -printf '%T#\0%p\0' ) \
| sort -nrz | tr \\0 \\n

All you need is:
find . -maxdepth 3 -printf '%T#\t%p\0' | sort -zn
and if you wanted to get just the filename newline-terminated after that then pipe it to awk to remove the timestamp and tab plus replace NUL with newline:
find . -maxdepth 3 -printf '%T#\t%p\0' | sort -zn | awk -v RS='\0' '{sub(/^[^\t]+\t/,"")}1'

bash remove duplicate files based on sequence number at the end

Hy,
I'm trying to delete some duplicate files in a folder (aprox. 50000 files) that have the same name but the only thing that differs is a sequence number at the end :
aaaaaaaaaa.ext.84837384
aaaaaaaaaa.ext.44549388
aaaaaaaaaa.ext.22134455
bbbbbbbbbb.ext.11244355
bbbbbbbbbb.ext.88392456
I want to delete the duplicate files based on minimum of sequence number (.22134455 to be hold for aaaaaaaaaa.ext and .11244355 to be hold for bbbbbbbbbbb)
I mentioned that i have a lot of files in the folder ~ 50.000 files and sorting and filtering based on size and md5 would take like forever.
I tried find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate but is taking forever.
Thank you very much

Use this
find . -name '*.ext.*' -print0 | sort -z | awk -v RS='\0' -F. '{fn=$0; num=$NF; $NF=""; if(a[$0]){printf "%s\0", fn};a[$0]++;}' | xargs -n 100 -0 rm -f
Explanation:
find . -name '*.ext.*' -print0: Print filenames delimited by a null character.
sort -z: Sort zero delimited entries.
awk: separate records by null character & fields by a .. strip off the last field - number & remember the remaining filename. Except for the first entry, print other file names, separated by null character.
xargs -0: receive null char separated filenames on stdin & rm -f them.
Assumption: All the files are in the current directory.
Add -maxdepth 1 option to find command, if there are sub-directories & you want to skip iterating through them.

This script will remove all duplicated files in the directory that's in.
List and sort files by filename, sequence number will be used to sort duplicates, then remove the file if it was already 'visited', else just saved the filename minus sequence in a temporary variable.
#!/bin/bash
tmp_filename=
for full_filename in `ls | sort`; do
filename=$(basename "$full_filename")
extension="${filename##*.}"
filename="${filename%.*}"
if [[ "$tmp_filename" == "$filename" ]]; then
rm "$full_filename"
else
tmp_filename="$filename"
fi
done

Finding and counting duplicate filenames

Need to search through all sub folders of current folder recursively and list all files of certain type and number of duplicates
e.g. if current folder is home and there are 2 sub folders dir1 and dir2
Then i need it to search dir1 and dir2 and list file names and number of duplicates
this is what i have so far:
I am using
find -name "*.h" .
to get a list of all the files of certain type.
I need to now count duplicates and create a new list like
file1.h 2
file2.h 1
where file1 is file name and 2 is number of duplicates overall.

Use uniq --count
You can use a set of core utilities to do this quickly. For example, given the following setup:
mkdir -p foo/{bar,baz}
touch foo/bar/file{1,2}.h
touch foo/baz/file{2,3}.h
you can then find (and count) the files with a pipeline like this:
find foo -name \*.h -print0 | xargs -0n1 basename | sort | uniq -c
This results in the following output:
1 file1.h
2 file2.h
1 file3.h
If you want other output formats, or to order the list in some other way than alphabetically by file, you can extend the pipeline with another sort (e.g. sort -nr) or reformat your columns with sed, awk, perl, ruby, or your text-munging language of choice.

find -name "*.h"|awk -F"/" '{a[$NF]++}END{for(i in a)if(a[i]>1)print i,a[i]}'
Note: This will print files with similar names and only if there are more than one.

Using a shell script, the following code will print a filename of there are duplicates, then below that list all duplicates.
The script is used as in the following exanmple:
./find_duplicate.sh ./ Project
and will search the current directory tree for file names with 'project' in it.
#! /bin/sh
find "${1}" -iname *"${2}"* -printf "%f\n" \
| tr '[A-Z]' '[a-z]' \
| sort -n \
| uniq -c \
| sort -n -r \
| while read LINE
do
COUNT=$( echo ${LINE} | awk '{print $1}' )
[ ${COUNT} -eq 1 ] && break
FILE=$( echo ${LINE} | cut -d ' ' -f 2-10000 2> /dev/null )
echo "count: ${COUNT} | file: ${FILE}"
FILE=$( echo ${FILE} | sed -e s/'\['/'\\\['/g -e s/'\]'/'\\\]'/g )
find ${1} -iname "${FILE}" -exec echo " {}" ';'
echo
done
if you wish to search for all files (and not search for a pattern in the name, replace the line:
find "${1}" -iname *"${2}"* -printf "%f\n" \
with
find "${1}" -type f -printf "%f\n" \

Combining greps to make script to count files in folder

I need some help combining elements of scripts to form a read output.
Basically I need to get the file name of a user for the folder structure listed below and using count the number of lines in the folder for that user with the file type *.ano
This is shown in the extract below, to note that the location on the filename is not always the same counting from the front.
/home/user/Drive-backup/2010 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/4.txt
/home/user/Drive-backup/2011 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/3.ano
/home/user/Drive-backup/2010 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/4.ano
awk -F/ '{print $(NF-2)}'
This will give me the username I need but I also need to know how many non blank lines they are in that users folder for file type *.ano. I have the grep below that works but I dont know how to put it all together so it can output a file that makes sense.
grep -cv '^[[:space:]]*$' *.ano | awk -F: '{ s+=$2 } END { print s }'
Example output needed
UserA 500
UserB 2
UserC 20

find /home -name '*.ano' | awk -F/ '{print $(NF-2)}' | sort | uniq -c
That ought to give you the number of "*.ano" files per user given your awk is correct. I often use sort/uniq -c to count the number of instances of a string, in this case username, as opposed to 'wc -l' only counting input lines.
Enjoy.

Have a look at wc (word count).

To count the number of *.ano files in a directory you can use
find "$dir" -iname '*.ano' | wc -l
If you want to do that for all directories in some directory, you can just use a for loop:
for dir in * ; do
echo "user $dir"
find "$dir" -iname '*.ano' | wc -l
done

Execute the bash-script below from folder
/home/user/Drive-backup/2010 Backup/2010 Account/Jan
and it will report the number of non-blank lines per user.
#!/bin/bash
#save where we start
base=$(pwd)
# get all top-level dirs, skip '.'
D=$(find . \( -type d ! -name . -prune \))
for d in $D; do
cd $base
cd $d
# search for all files named *.ano and count blank lines
sum=$(find . -type f -name *.ano -exec grep -cv '^[[:space:]]*$' {} \; | awk '{sum+=$0}END{print sum}')
echo $d $sum
done

This might be what you want (untested): requires bash version 4 for associative arrays
declare -A count
cd /home/user/Drive-backup
for userdir in */*/*/*; do
username=${userdir##*/}
lines=$(grep -cv '^[[:space:]]$' $userdir/user.dir/*.ano | awk '{sum += $2} END {print sum}')
(( count[$username] += lines ))
done
for user in "${!count[#]}"; do
echo $user ${count[$user]}
done

Here's yet another way of doing it (on Mac OS X 10.6):
find -x "$PWD" -type f -iname "*.ano" -exec bash -c '
ar=( "${#%/*}" ) # perform a "dirname" command on every array item
printf "%s\000" "${ar[#]%/*}" # do a second "dirname" and add a null byte to every array item
' arg0 '{}' + | sort -uz |
while IFS="" read -r -d '' userDir; do
# to-do: customize output to get example output needed
echo "$userDir"
basename "$userDir"
find -x "${userDir}" -type f -iname "*.ano" -print0 |
xargs -0 -n 500 grep -hcv '^[[:space:]]*$' | awk '{ s+=$0 } END { print s }'
#xargs -0 -n 500 grep -cv '^[[:space:]]*$' | awk -F: '{ s+=$NF } END { print s }'
printf '%s\n' '----------'
done

Finding the number of files in a directory for all directories in pwd

I am trying to list all directories and place its number of files next to it.
I can find the total number of files ls -lR | grep .*.mp3 | wc -l. But how can I get an output like this:
dir1 34
dir2 15
dir3 2
...
I don't mind writing to a text file or CSV to get this information if its not possible to get it on screen.
Thank you all for any help on this.

This seems to work assuming you are in a directory where some subdirectories may contain mp3 files. It omits the top level directory. It will list the directories in order by largest number of contained mp3 files.
find . -mindepth 2 -name \*.mp3 -print0| xargs -0 -n 1 dirname | sort | uniq -c | sort -r | awk '{print $2 "," $1}'
I updated this with print0 to handle filenames with spaces and other tricky characters and to print output suitable for CSV.

find . -type f -iname '*.mp3' -printf "%h\n" | uniq -c
Or, if order (dir-> count instead of count-> dir) is really important to you:
find . -type f -iname '*.mp3' -printf "%h\n" | uniq -c | awk '{print $2" "$1}'

There's probably much better ways, but this seems to work.
Put this in a shell script:
#!/bin/sh
for f in *
do
if [ -d "$f" ]
then
cd "$f"
c=`ls -l *.mp3 2>/dev/null | wc -l`
if test $c -gt 0
then
echo "$f $c"
fi
cd ..
fi
done

With Perl:
perl -MFile::Find -le'
find {
wanted => sub {
return unless /\.mp3$/i;
++$_{$File::Find::dir};
}
}, ".";
print "$_,$_{$_}" for
sort {
$_{$b} <=> $_{$a}
} keys %_;
'

Here's yet another way to even handle file names containing unusual (but legal) characters, such as newlines, ...:
# count .mp3 files (using GNU find)
find . -xdev -type f -iname "*.mp3" -print0 | tr -dc '\0' | wc -c
# list directories with number of .mp3 files
find "$(pwd -P)" -xdev -depth -type d -exec bash -c '
for ((i=1; i<=$#; i++ )); do
d="${#:i:1}"
mp3s="$(find "${d}" -xdev -type f -iname "*.mp3" -print0 | tr -dc "${0}" | wc -c )"
[[ $mp3s -gt 0 ]] && printf "%s\n" "${d}, ${mp3s// /}"
done
' "'\\0'" '{}' +

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Bash: Compare alphanumeric string with lower and upper case - string

You can use grep -i to make case insensitive matches: while read p; do data_set=$(echo "$p" | cut -f1 -d",") match=$(ls $your_dir | grep -i "^$data_set\$") if [ ! -z match ]; then mv "$match" $another_dir fi done

Related

Get filenames sorted by mtime

bash remove duplicate files based on sequence number at the end

Finding and counting duplicate filenames

Combining greps to make script to count files in folder

Finding the number of files in a directory for all directories in pwd

Categories

Resources