How to delete smallest file if names are duplicate - linux

I would like to clean up a folder with videos. I have a bunch of videos that were downloaded with different resolutions, so each file will start with the same name and then end with "_480p" or "_720p" etc.
I just want to keep the largest file of each such set.
So I am looking for a way to delete files based on
check if name before "_" is identical
if true, then delete all files except largest one

Thinking of a flexible and fast way to approach the problem, you can gather a list of files ending in "[[:digit:]]+p" and then a quick way to parse the names is to provide them on stdin to awk and let awk index an array with the file prefix (path + part of name before '_') so it will be unique for files allowing the different format size to be obtained and stored at that index.
Then it's a simply matter of comparing the stored resolution number for the file against the current file number and deleting the lesser of the two.
Your find command to locate all files in the directory below the current, recursively, could be:
find ./tmp -type f -regex "^.*[0-9]+p$"
What I would do is then pipe the filename output to a short awk script where an array stores the last seen number for a given file prefix, and then if the current record (line) resolution number if bigger than the value stored in the array, a filename using the array number is created and that file deleted with system() using rm filename. If the current line resolution number is less than what is already stored in the array for the file, you simply delete the current file.
You can do that as:
#!/usr/bin/awk -f
BEGIN { FS = "/" }
{
num = $NF # last field holds number up to 'p'
prefix = $0 # prefix is name up to "_[[:digit:]]+p
sub (/^.*_/, "", num) # isolate number
sub (/p$/, "", num) # remove 'p' at and
sub (/_[[:digit:]]+p$/, "", prefix) # isolate path and name prefix
if (prefix in a) { # current file in array a[] ?
rmfile = $0 # set file to remove to current
if (num + 0 > a[prefix] + 0) { # current number > array number
rmfile = prefix "_" a[prefix] "p" # for remove filename from array
a[prefix] = num # update array with higher num
}
system ("rm " rmfile); # delete the file
}
else
a[prefix] = num # if no num for prefix in array, store first
}
(note: the field-separator splits the fields using the directory separator so you have all file components to work with.)
Example Use/Output
With a representative set of files in a tmp/ directory below the current, e,g.
$ ls -1 tmp
a_480p
a_720p
b_1080p
b_480p
c_1080p
c_720p
Running the find command piped to the awk script named awkparse.sh would be as follows (don't forget to make the awk script executable):
$ find ./tmp -type f -regex "^.*[0-9]+p$" | ./awkparse.sh
Looking at the directory after piping the results of find to the awk script, the tmp/ directory now only contains the highest resolution (largest) files for any given filename, e.g.
$ ls -1
a_720p
b_1080p
c_1080p
This would be highly efficient. It could also handle all files in a nested directory structure where multiple directory levels hold files you need to clean out. Look things over and let me know if you have questions.

This shell script might be what you want:
previous_prefix=
for file in *_[0-9]*[0-9]p*; do
prefix=${file%_*}
resolution=${file##*_}
resolution=${resolution%%p*}
if [ "$prefix" = "$previous_prefix" ]; then
if [ "$resolution" -gt "$greater_resolution" ]; then
file_to_be_removed=$greater_file
greater_file=$file
greater_resolution=$resolution
else
file_to_be_removed=$file
fi
echo rm -- "$file_to_be_removed"
else
greater_resolution=$resolution
greater_file=$file
previous_prefix=$prefix
fi
done
Drop the echo if the output looks good.

I would try to:
list all non-smallest files (non-480p): *_720p* and *_1080p*
for each of them replace *_720p*/*_1080p* in the name with all possible smaller resolutions
and try to delete those files with rm -f, whether they exist or not
#!/bin/sh -e
shopt -s nullglob
for file in *_1080p*; do
rm -f -- "${file//_1080p/_720p}"
rm -f -- "${file//_1080p/_480p}"
done
for file in *_720p*; do
rm -f -- "${file//_720p/_480p}"
done
And here is a Bash script using nested loops to automate the above:
#!/bin/bash -e
shopt -s nullglob
res=(_1080p _720p _480p _240p)
for r in ${res[#]}; do
res=("${res[#]:1}") # remove the first element in res array
for file in *$r*; do
for r2 in ${res[#]}; do
rm -f -- "${file//$r/$r2}"
done
done
done

Related

Why comparison of two md5sum files is not working properly?

I have 2 lists with files with their md5sum checks and the lists have different paths for the same files.
Example of content in first file with check sums (server.list):
2c03ff18a643a1437ec0cf051b8b7b9d /tmp/fastq1_L001_R1_001.fastq.gz
c430f587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R2_001.fastq.gz/
6e6bcd84f264233cf7c428c0cfdc0c03 tmp/fastq1_L002_R1_001.fastq.gz
Example of content in two file with check sums (downloaded.list):
2c03ff18a643a1437ec0cf051b8b7b9d /home/projects/fastq1_L001_R1_001.fastq.gz
c430f587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R2_001.fastq.gz
6e6bcd84f264233cf7c428c0cfdc0c03 /home/projects/fastq1_L002_R1_001.fastq.gz
When I run the following line, I got the following lines:
awk -F"/" 'FNR==NR{filearray[$1]=$NF; next }!($1 in filearray){printf "%s has a different md5sum\n",$NF}' downloaded.list server.list
fastq1_L001_R1_001.fastq.gz has a different md5sum
fastq1_L001_R2_001.fastq.gz has a different md5sum
fastq1_L002_R2_001.fastq.gz has a different md5sum
Why I am getting this message since the first column is the same in both files? Can someone enlighten me on this issue?
Edit:
If I remove the path and leave only the file name, it works just fine.
Edit 2:
As pointed out, there is another possibility of file path form, which does not start with /. In this case, I cannot use / as the field separator.
Assumptions:
filename (sans path) and md5sum have to match
filenames may not be listed in the same order
filenames may not exist in both files
Sample data:
$ head downloaded.list server.list
==> downloaded.list <==
2c03ff18a643a1437ec0cf051b8b7b9d /home/projects/fastq1_L001_R1_001.fastq.gz # match
YYYYf587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R5_911.fastq.gz # different md5sum
c430f587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R2_001.fastq.gz # match
MNOPf587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R8_abc.fastq.gz # filename does not exist in other file
ABCDf587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R9_004.fastq.gz # different filename but matching md5sum (vs last line of other file)
==> server.list <==
2c03ff18a643a1437ec0cf051b8b7b9d /tmp/fastq1_L001_R1_001.fastq.gz # match
c430f587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R2_001.fastq.gz # match
XXXXf587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R5_911.fastq.gz # different md5sum
TUVWff18a643a1437ec0cf051b8b7b9d /tmp/fastq1_L999_R6_922.fastq.gz # filename does not exist in other file
ABCDf587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R7_933.fastq.gz # different filename but matching md5sum (vs last line of other file)
One awk idea to address white space issues as well as verifying filename matches:
awk ' # stick with default field delimiter of white space but ...
{ md5sum=$1
n=split($2,arr,"/") # split 2nd field on "/" delimiter
fname=arr[n]
if (FNR==NR)
filearray[fname]=md5sum
else {
if (fname in filearray && filearray[fname] == $1)
next
printf "%s has a different md5sum\n",fname
}
}
' downloaded.list server.list
This generates:
fastq1_L001_R5_911.fastq.gz has a different md5sum
fastq1_L999_R6_922.fastq.gz has a different md5sum
fastq1_L001_R7_933.fastq.gz has a different md5sum
The whitespace on $1 used as an array key is causing problems. Removing it:
awk -F"/" '{gsub(/ /, "", $1)}; FNR==NR{filearray[ $1]=$NF; next }!($1 in filearray){printf "%s has a different md5sum\n",$NF}' list1.txt list2.txt

Renaming Sequentially Named Files with Date Embedded

The Situation:
I have hundreds of zip files with an arbitrary date/time mixed into its name (4-6-2021 12-34-09 AM.zip). I need to get all of these files in order such that (0.zip, 1.zip 2.zip etc) with in a Linux cli system.
What I've tried:
I've tried ls -tr | while read i; do n=$((n+1)); mv -- "$i" "$(printf '%03d' "$n").zip"; done which almost does what I want but still seems to be out of order (I think its taking the order of when the file was created rather than the filename (which is what I need)).
If I can get this done, my next step would be to rename the file (yes a single file) in each zip to the name of the zip file. I'm not sure how I'd go about this either.
tl;dr
I have these files named with a weird date system. I need the date to be in order and renamed sequentially like 0.zip 1.zip 2.zip etc. It's 3:00 AM and I don't know why I'm up still trying to solve this and I have no idea how I'll rename the files in the zips to that sequential number (read above for more detail on this).
Thanks in advance!
GNU awk is an option here, redirecting the result of the file listing back into awk:
awk '{
fil=$0; # Set a variable fil to the line
gsub("-"," ",$1); # Replace "-" for " " in the first space delimited field
split($1,map," "); # Split the first field into the array map, using " " as the delimiter
if (length(map[1])==1) {
map[1]="0"map[1] # If the length of the day is 1, pad out with "0"
};
if (length(map[2])==1) {
map[2]="0"map[2] # Do the same for month
}
$1=map[1]" "map[2]" "map[3]; # Rebuilt first field based on array values
gsub("-"," ",$2); # Change "-" for " " in time
map1[mktime($1" "$2)]=fil # Get epoch format of date/time using mktime function and use this as an index for array map1 with the original line (fil) as the value
}
END {
PROCINFO["sorted_in"]="#ind_num_asc"; # At the end of processing, set the array sorting to index number ascending
cnt=0; # Initialise a cnt variable
for (i in map1) {
print "mv \""map1[i]"\" \""cnt".zip\""; # Loop through map1 array printing values and using these values along with cnt to generate and print move command
cnt++
}
}' <(for fil in *AM.zip;do echo $fil;done)
Once you are happy with the way the print command are printed, pipe the result into bash and so:
awk '{ fil=$0;gsub("-"," ",$1);split($1,map," ");if (length(map[1])==1) { map[1]="0"map[1] };if (length(map[2])==1) { map[2]="0"map[2] }$1=map[1]" "map[2]" "map[3];gsub("-"," ",$2);map1[mktime($1" "$2)]=fil} END { PROCINFO["sorted_in"]="#ind_num_asc";cnt=0;for (i in map1) { print "mv \""map1[i]"\" \""cnt".zip\"";cnt++ } }' <(for fil in *AM.zip;do echo $fil;done) | bash

Rename and increment on duplicate

I have a series of image files with filename formats as follows.
9407-C-9406-B-038.jpg
9407-C-9406-B-118.jpg
9422-AC-012.jpg
9422-AC-112.jpg
9422-BD-043.jpg
9422-BD-Still-001.jpg
9405_M.jpg
9792A.jpg
9792B.jpg
The relevant portion for my purpose is the first 4 characters, the rest is irrelevant.
I'd like to rename the files such that the leading 4 char string is retained, the extension is retained, but the rest is incremented when a duplicate is encountered :
e.g :
9422-AC-012.jpg becomes 942200.jpg
9422-AC-112.jpg becomes 942201.jpg
9422-BD-043.jpg becomes 942202.jpg
9422-BD-Still-001.jpg becomes 942203.jpg
9405_M.jpg becomes 940500.jpg
9792A.jpg becomes 979200.jpg
9792B.jpg becomes 979201.jpg
Using 'rename' I can strip the string after the first 4 chars, and I can increment in totality, which results with the incremental portion of the filename ending up in the thousands.
rename -n -N 0001 's/(?<=.{4}).*/$N.jpg/' *.jpg
Can anyone suggest a way to strip the filename after the first 4 chars, rename the file, and increment only when a duplicate is encountered?
awk to the rescue!
With the filenames listed in file
$ while read f t;
do
echo "mv $f $t";
done < <(awk -F\. '{k=substr($0,1,4); printf "%s\t%s%04d.%s\n", $0,k,a[k]++,$NF}' file)
mv 9407-C-9406-B-038.jpg 94070000.jpg
mv 9407-C-9406-B-118.jpg 94070001.jpg
mv 9422-AC-012.jpg 94220000.jpg
mv 9422-AC-112.jpg 94220001.jpg
mv 9422-BD-043.jpg 94220002.jpg
mv 9422-BD-Still-001.jpg 94220003.jpg
mv 9405_M.jpg 94050000.jpg
mv 9792A.jpg 97920000.jpg
mv 9792B.jpg 97920001.jpg
remove echo for the actual renaming, assumes no white space in the file name, at least 4 chars long and has extension.
the example shows two digit counter, you mentioned the counter to be in the thousands, here it's controlled by the print format %04d, change 4 to 2 if you don't expect more than 100. Regardless, the counter will create unique file but may extend from the allocated number of digits (e.g. it will generate 00,01,..99, and for the next one will be 100.)

Remove a null character (Shell Script)

I've looked everywhere and I'm out of luck.
I am trying to count the files in my current directory and all sub directories so that when I run the shell script count_files.sh it will produce a similar output to:
$
2 sh
4 html
1 css
2 noexts
(EDIT the above output should have each count and extension on a newline)
$
where noexts are either files without any period as an extension (ex: fileName ) or files with a period but no extension (ex: fileName. ).
this pipeline:
find * | awf -F . '{print $NF}'
gives me a comprehensive list of all the files, and I've figured out how to remove files without any period (ex: fileName ) using sed '/\//d'
MY ISSUE is that I cannot remove the files from the output of the above pipeline that are separated by a period but have NULL after the period (ex: fileName. ), as it is separated by the delimiter '.'
How can I use sed like above to remove a null character from a pipe input?
I understand this could be a quick fix, but I've been googling like a madman with no luck. Thanks in advance.
Chip
To filter filenames that end with ., since filenames are the whole input line in find's output, you could use
sed '/\.$/d'
Where \. matches a literal dot and $ matches the end of the line.
However, I think I'd do the whole thing in awk. Since sorting does not appear to be necessary:
EDIT: Found a nicer way to do it with awk and find's -printf action.
find . -type f -printf '%f\n' | awk -F. '!/\./ || $NF == "" { ++count["noext"]; next } { ++count[$NF] } END { for(k in count) { print k " " count[k] } }'
Here we pass -printf '%f\n' to find to make it print only the file name without the preceding directory, which makes it much easier to work with for our purposes -- this way there's no need to worry about periods in directory names (such as /etc/somethingorother.d). The field separator is '.', the awk code is
!/\./ || $NF == "" { # if the line (the filename) does not contain
# a period or there's nothing after the last .
++count["noext"] # increment the "noext" counter
# note that this will be collated with files that
# have ".noext" as filename extension. see below.
next # go to the next line
}
{ # in all other lines
++count[$NF] # increment the counter for the file extension
}
END { # in the very end:
for(k in count) { # print the counters.
print count[k] " " k
}
}
Note that this way, if there is a file "foo.noext", it will be counted among the files without a filename extension. If this is a worry, use a special counter for files without an extension -- either apart from the array or with a key that cannot be a filename extension (such as one that includes a . or the empty string).

Bash script to list files periodically

I have a huge set of files, 64,000, and I want to create a Bash script that lists the name of files using
ls -1 > file.txt
for every 4,000 files and store the resulted file.txt in a separate folder. So, every 4000 files have their names listed in a text files that is stored in a folder. The result is
folder01 contains file.txt that lists files #0-#4000
folder02 contains file.txt that lists files #4001-#8000
folder03 contains file.txt that lists files #8001-#12000
.
.
.
folder16 contains file.txt that lists files #60000-#64000
Thank you very much in advance
You can try
ls -1 | awk '
{
if (! ((NR-1)%4000)) {
if (j) close(fnn)
fn=sprintf("folder%02d",++j)
system("mkdir "fn)
fnn=fn"/file.txt"
}
print >> fnn
}'
Explanation:
NR is the current record number in awk, that is: the current line number.
NR starts at 1, on the first line, so we subtract 1 such that the if statement is true for the first line
system calls an operating system function from within awk
print in itself prints the current line to standard output, we can redirect (and append) the output to the file using >>
All uninitialized variables in awk will have a zero value, so we do not need to say j=0 in the beginning of the program
This will get you pretty close;
ls -1 | split -l 4000 -d - folder
Run the result of ls through split, breaking every 4000 lines (-l 4000), using numeric suffixes (-d), from standard input (-) and start the naming of the files with folder.
Results in folder00, folder01, ...
Here an exact solution using awk:
ls -1 | awk '
(NR-1) % 4000 == 0 {
dir = sprintf("folder%02d", ++nr)
system("mkdir -p " dir);
}
{ print >> dir "/file.txt"} '
There are already some good answers above, but I would also suggest you take a look at the watch command. This will re-run a command every n seconds, so you can, well, watch the output.

Resources