How to find files with similar filename and how many of them there are with awk

How to find files with similar filename and how many of them there are with awk - linux

I was tasked to delete old backup files from our Linux database (all except for the newest 3). Since we have multiple kinds of backups, I have to leave at least 3 backup files for each backup type.
My script should group all files with similar (matched) names together and delete all except for the last 3 files (I assume, that the OS will sort those files for me, so the newest backups will also be the last ones)
The files are in the format project_name.000000-000000.svndmp.bz2 where 0 can be any arbitrary digit and project_name can be any arbitrary name. The first 6 digits are part of the name, while the last 6 digits describe the backup's version.
So far, my code looks like this:
for i in *.svndmp.bz2 # only check backup files
do
nOfOccurences = # Need to find out, how many files have the same name
currentFile = 0
for f in awk -F"[.-]" '{print $1,$2}' $i # This doesn't work
do
if [nOfOccurences - $currentFile -gt 3]
then
break
else
rm $f
currentFile++
fi
done
done
I'm aware, that my script may try to remove old versions of a backup 4 times before moving on to the next backup. I'm not looking for performance or efficiency (we don't have that many backups).
My code is a result of 4 hours of searching the net and I'm running out of good Google queries (and my boss is starting to wonder why I'm still not back to my usual tasks)
Can anybody give me inputs, as to how I can solve my problems?
Find nOfOccurences
Make awk find files that fit the pattern "$1.$2-*"

Try this one, an see if it does what you want.
for project in `ls -1 | awk -F'-' '{ print $1}' | uniq`; do
files=`ls -1 ${project}* | sort`
n_occur=`echo "$files" | wc -l`
for f in $files; do
if ((n_occur < 3)); then
break
fi
echo "rm" $f;
((--n_occur))
done
done
If the output seems to be OK just replace the echo line.
Ah, and don't beat me if anything goes own. Use at your own risk only.

Related

List file using ls to find meet the condition

I am writing a batch program to delete all file in a directory with condition in filename.
In the directory there's a large number of text file (~ hundreds of thousand of files) with filename fixed as "abc" + date
abc_20180820.txt
abc_20180821.txt
abc_20180822.txt
abc_20180823.txt
abc_20180824.txt
The program try to grep all the file, compare the date to a fixed-date, delete it if filename's date < fixed date.
But the problem is it took so long to handle that large amount of file (~1 hour to delete 300k files).
My question: Is there a way to compare the date when running ls command? Not get all file in a list then compare to delete, but list only file already meet the condition then delete. I think that will have better performance.
My code is
TARGET_DATE = "5-12"
DEL_DATE = "20180823"
ls -t | grep "[0-9]\{8\}".txt\$ > ${LIST}
for EACH_FILE in `cat ${LIST}` ;
do
DATE=`echo ${EACH_FILE} | cut -c${TARGET_DATE }`
COMPARE=`expr "${DATE}" \< "${DEL_DATE}"`
if [ $COMPARE -eq 1 ] ;
then
rm -f ${EACH_FILE}
fi
done
Found some similar problem but I dont know how to get it done
List file using ls with a condition and process/grep files that only whitespaces

Here is a refactoring which gets rid of the pesky ls. Looping over a large directory is still going to be somewhat slow.
# Use lowercase for private variables
# to avoid clobbering a reserved system variable
# You can't have spaces around the equals sign
del_date="20180823"
# No need for ls here
# No need for a temporary file
for filename in *[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].txt
do
# Avoid external process; use the shell's parameter substitution
date=${filename%.txt}
# This could fail if the file name contains literal shell metacharacters!
date=${date#${date%?????????}}
# Avoid expr
if [ "$date" -lt "$del_date" ]; then
# Just print the file name, null-terminated for xargs
printf '%s\0' "$filename"
fi
done |
# For efficiency, do batch delete
xargs -r0 rm
The wildcard expansion will still take a fair amount of time because the shell will sort the list of filenames. A better solution is probably to refactor this into a find command which avoids the sorting.
find . -maxdepth 1 -type f \( \
-name '*1[89][0-9][0-9][0-9][0-9][0-9][0-9].txt' \
-o -name '*201[0-7][0-9][0-9][0-9][0-9].txt' \
-o -name '*20180[1-7][0-9][0-9].txt ' \
-o -name '*201808[01][0-9].txt' \
-o -name '*2018082[0-2].txt' \
\) -delete

You could do something like:
rm 201[0-7]*.txt # remove all files from 2010-2017
rm 20180[1-4]*.txt # remove all files from Jan-Apr 2018
# And so on
...
to remove a large number of files. Then your code would run faster.

Yes it takes a lot of time if you have so many files in one folder.
It is bad idea to keep so many files in one folder. Even simple ls or find will be killing storage. And if you have some scripts which iterate over your files, you are for sure killing storage.
So after you wait for one hour to clean it. Take time and make better folders structure. It is good idea to sort files according to years/month/days ... possibly hours
e.g.
somefolder/2018/08/24/...files here
Then you can easily delete, move compress ... whole month or year.

I found a solution in this thread.
https://unix.stackexchange.com/questions/199554/get-files-with-a-name-containing-a-date-value-less-than-or-equal-to-a-given-inpu
The awk command is so powerful, only take me ~1 minute to deal with hundreds of thousand of files (1/10 compare to the loop).
ls | awk -v date="$DEL_DATE" '$0 <= date' | xargs rm -vrf
I can even count, copy, move with that command with the fastest answer I've ever seen.
COUNT="$(ls | awk -v date="${DEL_DATE}" '$0 <= target' | xargs rm -vrf | wc -l)"

Keep newest x amount of files delete rest bash

I have this this bash script as a crontab running every hour. I want to keep the latest 1,000 images in a folder, deleting the oldest files. I don't want to delete by mtime because if no new files are being uploaded, I want to keep them, it's fine to keep if image is 1 day or 50 days old, I just want when image 1,001 is uploaded (newest) image_1 (oldest) will be deleted, cycling through folder to keep a static amount of 1,000 images.
This works, However at ever hour, there could be now 1,200 by the time it executes. Running the crontab every say minute seems to be overkill. Can I make it so once the folder hits 1,001 images it auto executes? Basically I want the folder to be self-scanning and keep the newest 1,000 images, deleted the oldest one.
#!/bin/sh
cd /folder/to/execute; ls -t | sed -e '1,1000d' | xargs -d '\n' rm

keep=10 #set this to how many files want to keep
discard=$(expr $keep - $(ls|wc -l))
if [ $discard -lt 0 ]; then
ls -Bt|tail $discard|tr '\n' '\0'|xargs -0 printf "%b\0"|xargs -0 rm --
fi
This first calculates the number of files to delete, then safely passes them to rm. It uses negative numbers intentionally, since that conveniently works as the argument to tail.
The use of tr and xargs -0 is to ensure that this works even if file names contain spaces. The printf bit is to handle file names containing newlines.
EDIT: added -- to rm args to be safe if any of the files to be deleted start with a hyphen.

Try the following script.It first checks the count in the current directory and then , if the count is greater than 1000 , it evaluates the difference and then gets the oldest such files.
#/bin/bash
count=`ls -1 | wc -l`
if [ $count -gt 1000 ]
then
difference=${count-1000}
dirnames=`ls -t * | tail -n $difference`
arr=($dirnames)
for i in "${arr[#]}"
do
echo $i
done
fi

Trying to scrub 700 000 data against 15 million data

I am trying to scrub 700 000 data obtained from single file, which need to be scrubbed against a data of 15 million present in multiple files.
Example: 1 file of 700 000 say A. Multiple files pool which have 15 million call it B.
I want a pool B of files with no data of file A.
Below is the shell script I am trying to use it is working fine. But it is taking massive time of more than 8 Hours in scrubbing.
IFS=$'\r\n' suppressionArray=($(cat abhinav.csv1))
suppressionCount=${#suppressionArray[#]}
cd /home/abhinav/01-01-2015/
for (( j=0; j<$suppressionCount; j++));
do
arrayOffileNameInWhichSuppressionFound=`grep "${suppressionArray[$j]}," *.csv| awk -F ':' '{print $1}' > /home/abhinav/fileNameContainer.txt`
IFS=$'\r\n' arrayOffileNameInWhichSuppressionFound=($(cat /home/abhinav/fileNameContainer.txt))
arrayOffileNameInWhichSuppressionFoundCount=${#arrayOffileNameInWhichSuppressionFound[#]}
if [ $arrayOffileNameInWhichSuppressionFoundCount -gt 0 ];
then
echo -e "${suppressionArray[$j]}" >> /home/abhinav/emailid_Deleted.txt
for (( k=0; k<$arrayOffileNameInWhichSuppressionFoundCount; k++));
do
sed "/^${suppressionArray[$j]}/d" /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$k]} > /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}".tmp" && mv -f /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}".tmp" /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}
done
fi
done
Another solution clicked in my mind is to breakdown 700k data into smaller size files of 50K and send across 5-available servers, also POOL A will be available at each server.
Each server will serve for 2-Smaller file.

These two lines are peculiar:
arrayOffileNameInWhichSuppressionFound=`grep "${suppressionArray[$j]}," *.csv| awk -F ':' '{print $1}' > /home/abhinav/fileNameContainer.txt`
IFS=$'\r\n' arrayOffileNameInWhichSuppressionFound=($(cat /home/abhinav/fileNameContainer.txt))
The first assigns an empty string to the mile-long variable name because the standard output is directed to the file. The second then reads that file into the array. ('Tis curious that the name is not arrayOfFileNameInWhichSuppressionFound, but the lower-case f for file is consistent, so I guess it doesn't matter beyond making it harder to read the variable name.)
That could be reduced to:
ArrFileNames=( $(grep -l "${suppressionArray[$j]}," *.csv) )
You shouldn't need to keep futzing with carriage returns in IFS; either set it permanently, or make sure there are no carriage returns before you start.
You're running these loops 7,00,000 times (using the Indian notation). That's a lot. No wonder it is taking hours. You need to group things together.
You should probably simply take the lines from abhinav.csv1 and arrange to convert them into appropriate sed commands, and then split them up and apply them. Along the lines of:
sed 's%.*%/&,/d%' abhinav.csv1 > names.tmp
split -l 500 names.tmp sed-script.
for script in sed-script.*
do
sed -f "$script" -i.bak *.csv
done
This uses the -i option to backup the files. It may be necessary to do redirection explicitly if your sed does not support the -i option:
for file in *.csv
do
sed -f "$script" "$file" > "$file.tmp" &&
mv "$file.tmp" "$file"
done
You should experiment to see how big the scripts can be. I chose 500 in the split command as a moderate compromise. Unless you're on antique HP-UX, that should be safe, but you may be able to increase the size of the script more, which will reduce the number of times you have to edit each file, which speeds up the processing. If you can use 5,000 or 50,000, you should do so. Experiment to see what the upper limit. I'm not sure that you'd find doing all 700,000 lines at once is feasible — but it should be fastest if you can do it that way.

Rm and Egrep -v combo

I want to remove all the logs except the current log and the log before that.
These log files are created after 20 minutes.So the files names are like
abc_23_19_10_3341.log
abc_23_19_30_3342.log
abc_23_19_50_3241.log
abc_23_20_10_3421.log
where 23 is today's date(might include yesterday's date also)
19 is the hour(7 o clock),10,30,50,10 are the minutes.
In this case i want i want to keep abc_23_20_10_3421.log which is the current log(which is currently being writen) and abc_23_19_50_3241.log(the previous one)
and remove the rest.
I got it to work by creating a folder,putting the first files in that folder and removing the files and then deleting it.But that's too long...
I also tried this
files_nodelete=`ls -t | head -n 2 | tr '\n' '|'`
rm *.txt | egrep -v "$files_nodelete"
but it didnt work.But if i put ls instead of rm it works.
I am an amateur in linux.So please suggest a simple idea..or a logic..xargs rm i tried but it didnt work.
Also read about mtime,but seems abit complicated since I am new to linux
Working on a solaris system

Try the logadm tool in Solaris, it might be the simplest way to rotate logs. If you just want to get things done, it will do it.
http://docs.oracle.com/cd/E23823_01/html/816-5166/logadm-1m.html

If you want a solution similar (but working) to your try this:
ls abc*.log | sort | head -n-2 | xargs rm
ls abc*.log: list all files, matching the pattern abc*.log
sort: sorts this list lexicographical (by name) from oldes to to newest logfile
head -n-2: return all but the last two entry in the list (you can give -n a negativ count too)
xargs rm: compose the rm command with the entries from stdin
If there are two or less files in the directory, this command will return an error like
rm: missing operand
and will not delete any files.

It is usually not a good idea to use ls to point to files. Some files may cause havoc (files which have a [Newline] or a weird character in their name are the usual exemples ....).
Using shell globs : Here is an interresting way : we count the files newer than the one we are about to remove!
pattern='abc*.log'
for i in $pattern ; do
[ -f "$i" ] || break ;
#determine if this is the most recent file, in the current directory
# [I add -maxdepth 1 to limit the find to only that directory, no subdirs]
if [ $(find . -maxdepth 1 -name "$pattern" -type f -newer "$i" -print0 | tr -cd '\000' | tr '\000' '+' | wc -c) -gt 1 ];
then
#there are 2 files more recent than $i that match the pattern
#we can delete $i
echo rm "$i" # remove the echo only when you are 100% sure that you want to delete all those files !
else
echo "$i is one of the 2 most recent files matching '${pattern}', I keep it"
fi
done
I only use the globbing mechanism to feed filenames to "find", and just use the terminating "0" of the -printf0 to count the outputed filenames (thus I have no problems with any special characters in those filenames, I just need to know how many files were outputted)
tr -cd "\000" will keep only the \000, ie the terminating NUL character outputed by print0. Then I translate each \000 to a single + character, and I count them with the wc -c. If I see 0, "$i" was the most recent file. If I see 1, "$i" was the one just a bit older (so the find sees only the most recent one). And if I see more than 1, it means the 2 files (mathching the pattern) that we want to keep are newer than "$i", so we can delete "$i"
I'm sure someone will step in with a better one, but the idea could be reused, I guess...

Thanks guyz for all the answers.
I found my answer
files=`ls -t *.txt | head -n 2 | tr '\n' '|' | rev |cut -c 2- |rev`
rm `ls -t | egrep -v "$files"`
Thank you for the help

Find files not in numerical list

I have a giant list of files that are all currently numbered in sequential order with different file extensions.
3400.PDF
3401.xls
3402.doc
There are roughly 1400 of these files in a directory. What I would like to know is how to find numbers that do not exist in the sequence.
I've tried to write a bash script for this but my bash-fu is weak.
I can get a list of the files without their extensions by using
FILES=$(ls -1 | sed -e 's/\..*$//')
but a few places I've seen say to not use ls in this manner.
(15 days after asking, I couldn't relocate where I read this, if it existed at all...)
I can also get the first file via ls | head -n 1 but Im pretty sure I'm making this a whole lot more complicated that I need to.

Sounds like you want to do something like this:
shopt -s nullglob
for i in {1..1400}; do
files=($i.*)
(( ${#files[#]} > 0 )) || echo "no files beginning with $i";
done
This uses a glob to make an array of all files 1.*, 2.* etc. It then compares the length of the array to 0. If there are no files matching the pattern, the message is printed.
Enabling nullglob is important as otherwise, when there are no files matching the array will contain one element: the literal value '1.*'.

Based on deleted answer that was largely correct:
for i in $(seq 1 1400); do ls $i.* > /dev/null 2>&1 || echo $i; done

ls [0-9]* \
| awk -F. ' !seen[$1]++ { ++N }
END { for (n=1; N ; ++n) if (!seen[n]) print n; else --N }
'
Will stop when it's filled the last gap, sub in N>0 || n < 3000 to go at least that far.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to find files with similar filename and how many of them there are with awk - linux

Related

List file using ls to find meet the condition

Keep newest x amount of files delete rest bash

Trying to scrub 700 000 data against 15 million data

Rm and Egrep -v combo

Find files not in numerical list

Categories

Resources