Find and delete files that contain same string in filename in linux terminal

Find and delete files that contain same string in filename in linux terminal - linux

I want to delete all files from a folder that contain a not unique numerical string in the filename using linux terminal. E.g.:
werrt-110009.jpg => delete
asfff-110009.JPG => delete
asffa-123489.jpg => maintain
asffa-111122.JPG => maintain
Any suggestions?

I only now understand your question, I think. You want to remove all files that contain a numeric value that is not unique (in a particular folder). If a filename contains a value that is also found in another filename, you want to remove both files, right?
This is how I would do that (it may not be the fastest way):
# put all files in your folder in a list
# for array=(*) to work make sure you have enabled nullglob: shopt -s nullglob
array=(*)
delete=()
for elem in "${array[#]}"; do
# for each elem in your list extract the number
num_regex='([0-9]+)\.'
[[ "$elem" =~ $num_regex ]]
num="${BASH_REMATCH[1]}"
# use the extracted number to check if it is unique
dup_regex="[^0-9]($num)\..+?(\1)"
# if it is not unique, put the file in the files-to-delete list
if [[ "${array[#]}" =~ $dup_regex ]]; then
delete+=("$elem")
fi
done
# delete all found duplicates
for elem in "${delete[#]}"; do
rm "$elem"
done
In your example, array would be:
array=(werrt-110009.jpg asfff-110009.JPG asffa-123489.jpg asffa-111122.JPG)
And the result in delete would be:
delete=(werrt-110009.jpg asfff-110009.JPG)
Is this what you meant?

you can use the linux find command along with the -regex parameter and the -delete parameter
to do it in one command

Use "rm" command to delete all matching string files in directory
cd <path-to-directory>/ && rm *110009*
This command helps to delete all files with matching string and it doesn't depend on the position of string in file name.
I was mentioned rm command option as another option to delete files with matching string.
Below is the complete script to achieve your requirement,
#!/bin/sh -eu
#provide the destination fodler path
DEST_FOLDER_PATH="$1"
TEMP_BUILD_DIR="/tmp/$( date +%Y%m%d-%H%M%S)_clenup_duplicate_files"
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
clean_up()
{
if [ -d $TEMP_BUILD_DIR ]; then
rm -rf $TEMP_BUILD_DIR
fi
}
trap clean_up EXIT
[ ! -d $TEMP_BUILD_DIR ] && mkdir -p $TEMP_BUILD_DIR
TEMP_FILES_LIST_FILE="$TEMP_BUILD_DIR/folder_file_names.txt"
echo "$(ls $DEST_FOLDER_PATH)" > $TEMP_FILES_LIST_FILE
while read filename
do
#check files with number pattern
if [[ "$filename" =~ '([0-9]+)\.' ]]; then
#fetch the number to find files with similar number
matching_string="${BASH_REMATCH[1]}"
# use the extracted number to check if it is unique
#find the files count with matching_string
if [ $(ls -1 $DEST_FOLDER_PATH/*$matching_string* | wc -l) -gt 1 ]; then
rm $DEST_FOLDER_PATH/*$matching_string*
fi
fi
#reload remaining files in folder (this optimizes the loop and speeds up the operation
#(this helps lot when folder contains more files))
echo "$(ls $DEST_FOLDER_PATH)" > $TEMP_FILES_LIST_FILE
done < $TEMP_FILES_LIST_FILE
exit 0
How to execute this script,
Save this script into file as
path-to-script/delete_duplicate_files.sh (you can rename whatever
you want)
Make script executable
chmod +x {path-to-script}/delete_duplicate_files.sh
Execute script by providing directory path where duplicate
files(files with matching number pattern) needs to be deleted
{path-to-script}/delete_duplicate_files.sh "{path-to-directory}"

Related

Delete files in one directory that do not exist in another directory or its child directories

I am still a newbie in shell scripting and trying to come up with a simple code. Could anyone give me some direction here. Here is what I need.
Files in path 1: /tmp
100abcd
200efgh
300ijkl
Files in path2: /home/storage
backupfile_100abcd_str1
backupfile_100abcd_str2
backupfile_200efgh_str1
backupfile_200efgh_str2
backupfile_200efgh_str3
Now I need to delete file 300ijkl in /tmp as the corresponding backup file is not present in /home/storage. The /tmp file contains more than 300 files. I need to delete the files in /tmp for which the corresponding backup files are not present and the file names in /tmp will match file names in /home/storage or directories under /home/storage.
Appreciate your time and response.

You can also approach the deletion using grep as well. You can loop though the files in /tmp checking with ls piped to grep, and deleting if there is not a match:
#!/bin/bash
[ -z "$1" -o -z "$2" ] && { ## validate input
printf "error: insufficient input. Usage: %s tmpfiles storage\n" ${0//*\//}
exit 1
}
for i in "$1"/*; do
fn=${i##*/} ## strip path, leaving filename only
## if file in backup matches filename, skip rest of loop
ls "${2}"* | grep -q "$fn" &>/dev/null && continue
printf "removing %s\n" "$i"
# rm "$i" ## remove file
done
Note: the actual removal is commented out above, test and insure there are no unintended consequences before preforming the actual delete. Call it passing the path to tmp (without trailing /) as the first argument and with /home/storage as the second argument:
$ bash scriptname /path/to/tmp /home/storage

You can solve this by
making a list of the files in /home/storage
testing each filename in /tmp to see if it is in the list from /home/storage
Given the linux+shell tags, one might use bash:
make the list of files from /home/storage an associative array
make the subscript of the array the filename
Here is a sample script to illustrate ($1 and $2 are the parameters to pass to the script, i.e., /home/storage and /tmp):
#!/bin/bash
declare -A InTarget
while read path
do
name=${path##*/}
InTarget[$name]=$path
done < <(find $1 -type f)
while read path
do
name=${path##*/}
[[ -z ${InTarget[$name]} ]] && rm -f $path
done < <(find $2 -type f)
It uses two interesting shell features:
name=${path##*/} is a POSIX shell feature which allows the script to perform the basename function without an extra process (per filename). That makes the script faster.
done < <(find $2 -type f) is a bash feature which lets the script read the list of filenames from find without making the assignments to the array run in a subprocess. Here the reason for using the feature is that if the array is updated in a subprocess, it would have no effect on the array value in the script which is passed to the second loop.
For related discussion:
Extract File Basename Without Path and Extension in Bash
Bash Script: While-Loop Subshell Dilemma

I spent some really nice time on this today because I needed to delete files which have same name but different extensions, so if anyone is looking for a quick implementation, here you go:
#!/bin/bash
# We need some reference to files which we want to keep and not delete,
 # let's assume you want to keep files in first folder with jpeg, so you
# need to map it into the desired file extension first.
FILES_TO_KEEP=`ls -1 ${2} | sed 's/\.pdf$/.jpeg/g'`
#iterate through files in first argument path
for file in ${1}/*; do
# In my case, I did not want to do anything with directories, so let's continue cycle when hitting one.
if [[ -d $file ]]; then
continue
fi
# let's omit path from the iterated file with baseline so we can compare it to the files we want to keep
NAME_WITHOUT_PATH=`basename $file`
 # I use mac which is equal to having poor quality clts
# when it comes to operating with strings,
# this should be safe check to see if FILES_TO_KEEP contain NAME_WITHOUT_PATH
if [[ $FILES_TO_KEEP == *"$NAME_WITHOUT_PATH"* ]];then
echo "Not deleting: $NAME_WITHOUT_PATH"
else
# If it does not contain file from the other directory, remove it.
echo "deleting: $NAME_WITHOUT_PATH"
rm -rf $file
fi
done
Usage: sh deleteDifferentFiles.sh path/from/where path/source/of/truth

Split multiple files

I have a directory with hundreds of files and I have to divide all of them in 400 lines files (or less).
I have tried ls and split, wc and split and to make some scripts.
Actually I'm lost.
Please, can anybody help me?
EDIT:
Thanks to John Bollinger and his answer this is the scritp we will use to our purpose:
#!/bin/bash
# $# -> all args passed to the script
# The arguments passed in order:
# $1 = num of lines (required)
# $2 = dir origin (optional)
# $3 = dir destination (optional)
if [ $# -gt 0 ]; then
lin=$1
if [ $# -gt 1 ]; then
dirOrg=$2
if [ $# -gt 2 ]; then
dirDest=$3
if [ ! -d "$dirDest" ]; then
mkdir -p "$dirDest"
fi
else
dirDest=$dirOrg
fi
else
dirOrg=.
dirDest=.
fi
else
echo "Missing parameters: NumLineas [DirectorioOrigen] [DirectorioDestino]"
exit 1
fi
# The shell glob expands to all the files in the target directory; a different
# glob pattern could be used if you want to restrict splitting to a subset,
# or if you want to include dotfiles.
for file in "$dirOrg"/*; do
# Details of the split command are up to you. This one splits each file
# into pieces named by appending a sequence number to the original file's
# name. The original file is left in place.
fileDest=${file##*/}
split --lines="$lin" --numeric-suffixes "$file" "$dirDest"/"$fileDest"
done
exit0

Since you seem to know about split, and to want to use it for the job, I guess your issue revolves around using one script to wrap the whole task. The details are unclear, but something along these lines is probably what you want:
#!/bin/bash
# If an argument is given then it is the name of the directory containing the
# files to split. Otherwise, the files in the working directory are split.
if [ $# -gt 0 ]; then
dir=$1
else
dir=.
fi
# The shell glob expands to all the files in the target directory; a different
# glob pattern could be used if you want to restrict splitting to a subset,
# or if you want to include dotfiles.
for file in "$dir"/*; do
# Details of the split command are up to you. This one splits each file
# into pieces named by appending a sequence number to the original file's
# name. The original file is left in place.
split --lines=400 --numeric-suffixes "$file" "$file"
done

Writing a function to replace duplicate files with hardlinks

I need to write a bash script that iterates through the files of a specified directory and replaces duplicates of files with hardlinks. Right now, my entire function looks like this:
#! /bin/bash
# sameln --- remove duplicate copies of files in specified directory
D=$1
cd $D #go to directory specified as default input
fileNum=0 #loop counter
DIR=".*|*"
for f in $DIR #for every file in the directory
do
files[$fileNum]=$f #save that file into the array
fileNum=$((fileNum+1)) #increment the counter
done
for((j=0; j<$fileNum; j++)) #for every file
do
if [ -f "$files[$j]" ] #access that file in the array
then
for((k=0; k<$fileNum; k++)) #for every other file
do
if [ -f "$files[$k]" ] #access other files in the array
then
test[cmp -s ${files[$j]} ${files[$k]}] #compare if the files are identical
[ln ${files[$j]} ${files[$k]}] #change second file to a hard link
fi
done
fi
done
Basically:
Loop through all files of depth 1 in specified directory
Put file contents into array
Compare each array item with every other array item and replace duplicates with hardlinks
The test directory has four files: a, b, c, d
a and b are different, but c and d are duplicates (they are empty). After running the script, ls -l shows that all of the files still only have 1 hardlink, so the script appears to have basically done nothing.
Where am I going wrong?

DIR=".*|*"
for f in $DIR #for every file in the directory
do
echo $f
done
This code outputs
.*|*
You should not loop over files like this. Look into the find command. As you see, your code doesn't work because the first loop is already faulty.
BTW, don't name your variables all uppercase, those are reserved for system variables, I believe.

You may be making this process a bit harder on yourself than necessary. There is already a Linux command fdupes that scans a directory conducting a byte-by-byte, md5sum, date & time comparison to determine whether files are duplicates of one another. It can easily find and return groups of files that are duplicates. Your are left with only using the results.
Below is a quick example of using this tool for the job. NOTE this quick example works only for filenames that do not contain spaces within them. You will have to modify it if you are dealing with filenames containing spaces. This is intended to show an approach to using a tool that already does what you want. Also note the actual ln command is commented out below. The program just prints what it would do. After testing you can remove the comment to the ln command once you are satisfied with the results.
#! /bin/bash
# sameln --- remove duplicate copies of files in specified directory using fdupes
[ -d "$1" ] || { # test valid directory supplied
printf "error: invalid directory '%s'. usage: %s <dir>\n" "$1" "${0//\//}"
exit 1
}
type fdupes &>/dev/null || { # verify fdupes is available in path
printf "error: 'fdupes' required. Program not found within your path\n"
exit 1
}
pushd "$1" &>/dev/null # go to directory specified as default input
declare -a files # declare files and dupes array
declare -a dupes
## read duplicate files into files array
IFS=$'\n' read -d '' -a files < <(fdupes --sameline .)
## for each list of duplicates
for ((i = 0; i < ${#files[#]}; i++)); do
printf "\n duplicate files %s\n\n" "${files[i]}"
## split into original files (no interal 'spaces' allowed in filenames)
dupes=( ${files[i]} )
## for the 1st duplicate on
for ((j = 1; j < ${#dupes[#]}; j++)); do
## create hardlink to original (actual command commented)
printf " ln -f %s %s\n" "${dupes[0]}" "${dupes[j]}"
# ln -f "${dupes[0]}" "${dupes[j]}"
done
done
exit 0
Output/Example
$ bash rmdupes.sh dat
duplicate files ./output.dat ./tmptest ./env4.dat.out
ln -f ./output.dat ./tmptest
ln -f ./output.dat ./env4.dat.out
duplicate files ./vh.conf ./vhawk.conf
ln -f ./vh.conf ./vhawk.conf
duplicate files ./outfile.txt ./newfile.txt
ln -f ./outfile.txt ./newfile.txt
duplicate files ./z1 ./z1cpy
ln -f ./z1 ./z1cpy

Bash command to move only some files?

Let's say I have the following files in my current directory:
1.jpg
1original.jpg
2.jpg
2original.jpg
3.jpg
4.jpg
Is there a terminal/bash/linux command that can do something like
if the file [an integer]original.jpg exists,
then move [an integer].jpg and [an integer]original.jpg to another directory.
Executing such a command will cause 1.jpg, 1original.jpg, 2.jpg and 2original.jpg to be in their own directory.
NOTE
This doesn't have to be one command. I can be a combination of simple commands. Maybe something like copy original files to a new directory. Then do some regular expression filter on files in the newdir to get a list of file names from old directory that still need to be copied over etc..

Turning on extended glob support will allow you to write a regular-expression-like pattern. This can handle files with multi-digit integers, such as '87.jpg' and '87original.jpg'. Bash parameter expansion can then be used to strip "original" from the name of a found file to allow you to move the two related files together.
shopt -s extglob
for f in +([[:digit:]])original.jpg; do
mv $f ${f/original/} otherDirectory
done
In an extended pattern, +( x ) matches one or more of the things inside the parentheses, analogous to the regular expression x+. Here, x is any digit. Therefore, we match all files in the current directory whose name consists of 1 or more digits followed by "original.jpg".
${f/original/} is an example of bash's pattern substitution. It removes the first occurrence of the string "original" from the value of f. So if f is the string "1original.jpg", then ${f/original/} is the string "1.jpg".

well, not directly, but it's an oneliner (edit: not anymore):
for i in [0-9].jpg; do
orig=${i%.*}original.jpg
[ -f $orig ] && mv $i $orig another_dir/
done
edit: probably I should point out my solution:
for i in [0-9].jpg: execute the loop body for each jpg file with one number as filename. store whole filename in $i
orig={i%.*}original.jpg: save in $orig the possible filename for the "original file"
[ -f $orig ]: check via test(1) (the [ ... ] stuff) if the original file for $i exists. if yes, move both files to another_dir. this is done via &&: the part after it will be only executed if the test was successful.

This should work for any strictly numeric prefix, i.e. 234.jpg
for f in *original.jpg; do
pre=${f%original.jpg}
if [[ -e "$pre.jpg" && "$pre" -eq "$pre" ]] 2>/dev/null; then
mv "$f" "$pre.jpg" targetDir
fi
done
"$pre" -eq "$pre" gives an error if not integer
EDIT:
this fails if there exist original.jpg and .jpg both.
$pre is then nullstring and "$pre" -eq "$pre" is true.

The following would work and is easy to understand (replace out with the output directory, and {1..9} with the actual range of your numbers.
for x in {1..9}
do
if [ -e ${x}original.jpg ]
then
mv $x.jpg out
mv ${x}original.jpg out
fi
done
You can obviously also enter it as a single line.

You can use Regex statements to find "matches" in the files names that you are looking through. Then perform your actions on the "matches" you find.

integer=0; while [ $integer -le 9 ] ; do if [ -e ${integer}original.jpg ] ; then mv -vi ${integer}.jpg ${integer}original.jpg lol/ ; fi ; integer=$[ $integer + 1 ] ; done
Note that here, "lol" is the destination directory. You can change it to anything you like. Also, you can change the 9 in while [ $integer -le 9 ] to check integers larger than 9. Right now it starts at 0* and stops after checking 9*.
Edit: If you want to, you can replace the semicolons in my code with carriage returns and it may be easier to read. Also, you can paste the whole block into the terminal this way, even if that might not immediately be obvious.

How can I delete directories based on their numeric name value with a shell script?

I have a directory that contains numerically named subdirectories ( eg. 1, 2, 3, 32000, 43546 ). I need to delete all directories over a certain number. For example, I need to delete all subdirectories that have a name that is numerically larger than 14234. Can this be done with a single command line action?
rm -r /directory/subdirectories_over_14234 ( how can I do this? )

In bash, I'd write
for dir in *; do [[ -d $dir ]] && (( dir > 14234 )) && echo rm -r $dir; done
Remove the echo at your discretion.

Well you can do a bash for loop instruction so as to iterate over the directory filename and use the test command then after extracting the target number of the file name.
Should be something like this :
for $file in /your/path
do
#extract number here with any text processing command (ed ?)
if test [$name -leq your_value]
then
rm -R $file
fi
done

You don't mention which shell you're using. I'm using Zsh and it has a very cool feature: it can select files based on numbers just like you want! So you can do
$ rm -r /directory/<14234->(/)
to select all the subdirectories of /directory with a numeric value over 14234.
In general, you use
<a-b>
to select paths with a numeric values between a and b. You append a (/) to only match directories. Use (.) to only match files. The glob patterns in Zsh are very powerful and can mostly (if not always) replace the good old find command.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Find and delete files that contain same string in filename in linux terminal - linux

I want to delete all files from a folder that contain a not unique numerical string in the filename using linux terminal. E.g.: werrt-110009.jpg => delete asfff-110009.JPG => delete asffa-123489.jpg => maintain asffa-111122.JPG => maintain Any suggestions?

you can use the linux find command along with the -regex parameter and the -delete parameter to do it in one command

Related

Delete files in one directory that do not exist in another directory or its child directories

Split multiple files

Writing a function to replace duplicate files with hardlinks

Bash command to move only some files?

How can I delete directories based on their numeric name value with a shell script?

Categories

Resources