shell script for deleting files, where files have sequence numbers - linux

I need a shell script for deleting files, where files have sequence numbers.
For example, there is a directory like /abc/def inside of which I have files like:
xyz_1000_1_pqr.arc
xyz_1001_1_pqr.arc
xyz_1002_1_pqr.arc
xyz_1003_1_pqr.arc
xyz_1004_1_pqr.arc
xyz_1005_1_pqr.arc
xyz_1006_1_pqr.arc
xyz_1007_1_pqr.arc
xyz_1008_1_pqr.arc
Here I need to delete all the .arc files that have sequence number less than(<) 1004. That is, only files:
xyz_1000_1_pqr.arc
xyz_1001_1_pqr.arc
xyz_1002_1_pqr.arc
xyz_1003_1_pqr.arc
should be deleted.
(P.S : each file is 4-5GB and critical )
EDIT:
sorry for not mentioning earlier
Requirements:deleting the files that have been backed up and preventing the file system reaching 100% ,the backup team provides the sequence number of latest file that as been backed up
it would be very convinient if i can get a shell script that takes an argument(sequence number of the last backed up file) and delete all the files having sequence number less than the sequence number provided by the back up team
there are more than 30 servers on which i have the same scenario,and the starting sequence number(sequence number of the oldest file ) will be different for each and will not be known unless logging into each and checking the directory manualy
hence a for loop with a starting sequence number till ending sequence and rm is out of the question
A generic script that can be deployed on all servers which can work only with the sequence number of the file that has been backed up recently,is what im looking for so that it can be called as a event -reaction from a tool(OEM12c oracle related that generates Filesystem alerts )
As of now I am logging int each server manually removing them using regular expressions every time a alert is triggered file system crosses 70% which is repetitive and hectic as I have other concerns of (DBA), hence a automated script would save me a lot of time .
thanks

One way
rm xyz_100{0..3}_1_pqr.arc
if you have the start and end sequence number, its just a matter of looping over and deleting them
for (( i=$start_num ; i<=$end_num; i++ ))
do
rm xyz_${i}_*arc
done

This script starts with the number that you specify on the command line and deletes all files with that number or lower until it gets down to a number where the file doesn't exist.
#!/bin/sh
num=$1
for ((i=$num; i>=0; i--))
do
name=$(printf 'xyz_%03i_1_pqr.arc' $i)
[ -f "$name" ] || break
rm "$name"
done
The "%03i" format in the printf statement assures that the number, once formatted, will be three digits or longer. (That means that a number such as 99 is padded with zeros to become 099.) For printf, "%i" would mean format an integer, "%3i" would mean format an integer and give it at least three spaces, and "%03i" means format an integer into three spaces, left-padded with zeros as needed.
In an earlier version of this answer, I had the script check all numbers down to zero looking for files to delete. In the comments, you mention that sequence numbers may be up to 7 digits. That could make the exhaustive approach overly time consuming. In this version, I have it count down until it reaches a sequence number for which the backup has already been deleted and it stops there.

Related

Compare and sync two (huge) directories - consider only filenames

I want to do a one-way sync in Linux between two directories. One contains files and the other one contains processed files but has the same directory structure and the same filenames but some files might be missing.
Right now I am doing:
cd $SOURCE
find * -type f | while read fname; do
if [ ! -e "$TARGET$fname" ]
then
# process the file and copy it to the target. Create directories if needed.
fi
done
which works, but is painfully slow.
Is there a better way to do this?
There are rughly 50.000.000 files, spread among directories and sub-directories. Each directory does not contain more than 255 files/subdirs.
I looked at
rsync: seems like it always does size or timestamp comparison. This will result on every file being flagged as different since the processing takes some time and changes the file contents.
diff -qr: could not figure out how to make it ignore file-sizes and content
Edit
Valid assumptions:
comparisons are being made solely on directory/file names
we don't care about file metadata and/or attributes (eg, size, owner, permissions, date/time last modified, etc)
we don't care about files that may reside in the target directory but without a matching file in the source directory. This is only partially true, but deletions from the source are rare and happen in bulk, so I will do a special case for that.
Assumptions:
comparisons are being made solely on directory/file names
we don't care about file metadata and/or attributes (eg, size, owner, permissions, date/time last modified, etc)
we don't care about files that may reside in the target directory but without a matching file in the source directory
I don't see a way around comparing 2x lists of ~50 million entries but we can try to eliminate the entry-by-entry approach of a bash looping solution ...
One idea:
# obtain sorted list of all $SOURCE files
srcfiles=$(mktemp)
cd "${SOURCE}"
find * -type f | sort > "${srcfiles}"
# obtain sorted list of all $TARGET files
tgtfiles=$(mktemp)
cd "${TARGET}"
find * -type f | sort > "${tgtfiles}"
# 'comm -23' => extract list of items that only exist in the first file - ${srcfiles}
missingfiles=$(mktemp)
comm -23 "${srcfiles}" "${tgtfiles}" > "${missingfiles}"
# process list of ${SOURCE}-only files
while read -r missingfile
do
process_and_copy "${missingfile}"
done < "${missingsfiles}"
'rm' -rf "${srcfiles}" "${tgtfiles}" "${missingfiles}"
This solution is (still) serial in nature so if there are a 'lot' of missing files the overall time to process said missing files could be appreciable.
With enough system resources (cpu, memory, disk throughput) a 'faster' solution would look at methods of parallelizing the work, eg:
running parallel find/sort/comm/process threads on different $SOURCE/$TARGET subdirectories (may work well if the number of missing files is evenly distributed across the different subdirectories) or ...
stick with the serial find/sort/comm but split ${missingfiles} into chunks and then spawn separate OS processes to process_and_copy the different chunks

What is the structure of the binary files produced by nfcapd (one of the nfdump tools)?

I want to split files produced by nfcapd (a netflow producing daemon) into multiple files, because the file initially produced by nfcapd might be too big.
My problem is that I have no idea what the structure of the files produced are, I suppose there is a header and then a list of netflows but I can't figure out at which byte ends the header and at which byte begins and ends a netflow, and if there is a footer.
I tried to understand it from reading the source C code on github but as I am not really a beast in C, it is quite hard for me to comprehend.
At first, I thought nfdump could solve my problem by reading a number of netflows at a time in the initial file but there is no built-in way to do this, you can use nfdump to read the first N netflows but you can't go from 1 to N then from N to N+N, you can only read from 1 to N.
If anyone knows a way to split those binary files into multiple files that can be used by nfdump, I would really like to know it.
you can set the time interval to less than 5 minutes (which is the default) using the -t parameter . That is the way to create a smaller files from advance.
For example:
nfcapd -w 1 -l -p -t 60
Please note that the -w should be set accordingly:
if the -t is 60 (seconds)
the -w should be 1 (minute)
there is more here: https://manpages.debian.org/testing/nfdump/nfcapd.1.en.html

Strange results using Linux find

I am trying to set up a backup shell script that shall run once per week on my server and keep the weekly backups for ten weeks and it all works well, except for one thing...
I have a folder that contains many rather large files, so the ten weekly backups of that folder take up quite a large amount of disk space and many of the larger files in that folder rarely change, so I thought I would split the backup of that folder in two: one for the smaller files that is included in the 'normal' weekly backup (and kept for ten weeks) and one file for the larger files that is just updated every week, without the older weekly versions being kept.
I have used the following command for the larger files:
/usr/bin/find /other/projects -size +100M -print0 | /usr/bin/xargs -0 /bin/tar -rvPf /backup/PRJ-files_LARGE.tar
That works as expected. The tar -v option is there for debugging. However, when archiving the smaller files, I use a similar command:
/usr/bin/find /other/projects -size -100M -print0 | /usr/bin/xargs -0 /bin/tar -rvPf /backup/PRJ-files_$FILE_END.tar
Where $FILE_END is the weekly number. The line above does not work. I had the script run the other day and it took hours and produced a file that was 70 Gb, though the expected output size is about 14 Gb (there are a lot of files). It seems there is some duplication of files in the large file, I have not been able to fully check though. Yesterday I ran the command above for the smaller files from the command line and I could see that files I know to be larger than 100 Mb were included.
However, just now I ran find /other/projects -size -100M from the command line and that produced the expected list of files.
So, if anyone has any ideas what I am doing wrong I would really appreciate tips or pointers. The file names include spaces and all sorts of characters, e.g. single quote, if that has something to do with it.
The only thing I can think of is that I am not using xargs properly and admittedly I am not very familiar with that, but I still think that the problem lies in my use of find since it is find that gives the input to xargs.
First of all, I do not know if it is considered bad form or not to answer your own question, but I am doing it anyway since I realised my error and I wanted to close this and hopefully be able to help someone having the same problem as I had.
Now, once I realised what I did wrong I frankly am a bit embarrassed that I did not see it earlier, but this is it:
I did some experimental runs from the command line and after a while I realised that the output not only listed all files, but it also listed the directories themselves. Directories are of course files too and they are smaller than 100M so they have (most likely anyway) been included and when they have been included, all files in them have also been included, regardless of their sizes. This would also explain why the output file was five times larger than expected.
So, in order to overcome this I added -type f, which includes only regular files, to the find command and lo and behold, it worked!
To recap, the adjusted command I use for the smaller files is now:
/usr/bin/find /other/projects -size -100M -type f -print0 | /usr/bin/xargs -0 /bin/tar -rvPf /backup/PRJ-files_$FILE_END.tar

Joining two files with regular expression in Unix (ideally with perl)

I have following two files disconnect.txt and answered.txt:
disconnect.txt
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 40397400012 to:40397400032
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 4035350012 to:40677400032
answered.txt
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 40397643433 to:403###34**
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 3455334459 to:1222
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032
I would like to create a join on these files based on the from: and to: fields and the output should be matching field from answered.txt. For example, in the above two files, the output would be:
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032
I'm currently doing it by comparing each line in file 1 with each line in file 2, but want to know if an efficient way exists (these files will be in tens of gigabytes).
Thank you
Sounds like you have hundreds of millions of lines?
Unless the files are sorted in such a way that you can expect the order of the from: and to: to at least vaguely correlate, this is a job for a database.
If the files are large the quadratic algorithm will take a lifetime.
Here is a Ruby script that uses just a single hash table lookup per line in answered.txt:
def key s
s.split('from:')[1].split('to:').map(&:strip).join('.')
end
h = {}
open 'disconnect.txt', 'r' do |f|
while s = f.gets
h[key(s)] = true
end
end
open 'answered.txt', 'r' do |f|
while a = f.gets
puts a if h[key(a)]
end
end
Like ysth says, it all depends on the number of lines in disconnect.txt. If that's a really big1 number, then you will probably not be able to fit all the keys in memory and you will need a database.
1. The number of lines in disconnect.txt multiplied by (roughly) 64 should be less than the amount of memory in your machine.
First, sort the files on the from/to timestamps if they are not already sorted that way. (Yes, I know the from/to appear to be stored as epoch seconds, but that's still a timestamp.)
Then take the sorted files and compare the first lines of each.
If the timestamps are the same, you have a match. Hooray! Advance a line in one or both files (depending on your rules for duplicate timestamps in each) and compare again.
If not, grab the next line in whichever file has the earlier timestamp and compare again.
This is the fastest way to compare two (or more) sorted files and it guarantees that no line will be read from disk more than once.
If your files aren't appropriately sorted, then the initial sorting operation may be somewhat expensive on files in the "tens of gigabytes each" size range, but:
You can split the files into arbitrarily-sized chunks (ideally small enough for each chunk to fit into memory), sort each chunk independently, and then generalize the above algorithm from two files to as many as are necessary.
Even if you don't do that and you deal with the disk thrashing involved with sorting files larger than the available memory, sorting and then doing a single pass over each file will still be a lot faster than any solution involving a cartesian join.
Or you could just use a database as mentioned in previous answers. The above method will be more efficient in most, if not all, cases, but a database-based solution would be easier to write and would also provide a lot of flexibility for analyzing your data in other ways without needing to do a complete scan through each file every time you need to access anything in it.

Splitting long input into multiple text files

I have some code which will generate an infinite number of lines in output. So, I can't store those values in a single output file.
Instead, I split the output file into more files. I am splitting the file according to the index numbers. Now my doubt is I don't know how many numbers my file will be having. So is it possible to split the file into different output without giving index? For example:
first 100,000 lines in m.txt
from 100,001 to next 200,000 in n.txt
If you don't need to be able to find a particular line based on the file name, you can split the output based on the file size. Write lines to m1.txt until the next line will make it >1MB; then move to the next file - m2.txt.
split(1) appears to be exactly the tool for your job.
Generate files with a running index. Start with opening e.g. m_000001.txt. Write a fixed nuber of lines to that file. Close file. Open next file, e.g. m_000002.txt, and continue.
Making sure that you don't overflow the disk is an housekeepting task to be done separately. Here one can think of backups, compression, file rotation and so on.
You may want to use logrotate for this purpose. It has a lot of options: check out the man page.
Here's the introduction of the man page:
"logrotate is designed to ease administration of systems that generate
large numbers of log files. It allows automatic rotation, compression,
removal, and mailing of log files. Each log file may be handled daily,
weekly, monthly, or when it grows too large."
4 ways to split while writing:
A) Fixed no of characters (Size)
B) Fixed no of lines
C) Fixed Interval of time before writing
D) Fixed Counter of a function before calling a write
Based on those splitings, You can name the output file.

Resources