Linux/Perl Returning list of folders that have not been modified for over x minutes - linux

I have a directory that has multiple folders. I want to get a list of the names of folders that have not been modified in the last 60 minutes.
The folders will have multiple files that will remain old so I can't use -mmin +60
I was thinking I could do something with inverse though. Get a list of files that have been modified in 60 minutes -mmin -60 and then output the inverse of this list.
Not sure to go about doing that or if there is a simpler way to do so?
Eventually I will take these list of folders in a perl script and will add them to a list or something.
This is what I have so far to get the list of folders
find /path/to/file -mmin -60 | sed 's/\/path\/to\/file\///' | cut -d "/" -f1 | uniq
Above will give me just the names of the folders that have been updated.

There is a neat trick to do set operations on text lines like this with sort and uniq. You have already the paths that have been updated. Assume they are in a file called upd. A simple find -type d can give you all folders. Lets assume to have them in file all. Then run
cat all upd | sort |uniq -c |grep '^1'
All paths that appear in both files will have a count of 2 prefixed. All paths only appearing in file all will be prefixed with a 1. The lines prefixed with 1 represent the set difference between all and upd, i.e. the paths that were not touched. (I take it you are able to remove the prefix 1 yourself.)
Surely this can be done with perl or any other scripting language, but this simple sort|uniq is just too nice.-)

The diff command is made for this.
Given two files, "all":
# cat all
/dir1/old2
/dir2/old4
/dir2/old5
/dir1/new1
/dir2/old2
/dir2/old3
/dir1/old1
/dir1/old3
/dir1/new4
/dir2/new1
/dir2/old1
/dir1/new2
/dir2/new2
and "updated":
# cat updated
/dir2/new1
/dir1/new4
/dir2/new2
/dir1/new2
/dir1/new1
We can sort the files and run diff. For this task,I prefer inline sorting:
# diff <(sort all) <(sort updated)
4,6d3
< /dir1/old1
< /dir1/old2
< /dir1/old3
9,13d5
< /dir2/old1
< /dir2/old2
< /dir2/old3
< /dir2/old4
< /dir2/old5
If there are any files in "updated" that aren't in "all", they'll be prefixed with '>'.

Related

pasting many files to a single large file

i have many text files in a directory like 1.txt 2.txt 3.txt 4.txt .......2000.txt and i want to paste them to make a large file.
In this regard i did something like
paste *.txt > largefile.txt
but the above command reads the .txt file randomly, so i need to read the files sequentially and paste as 1.txt 2.txt 3.txt....2000.txt
please suggest a better solution for pasting many files.
Thanks and looking forward to hearing from you.
Sort the file names numerically yourself then.
printf "%s\n" *.txt | sort -n | xargs -d '\n' paste
When dealing with many files, you may hit ulimit -n. On my system ulimit -n is 1024, but this is a soft limit and can be raised with just like ulimit -n 99999.
Without raising the soft limit, go with a temporary file that would accumulate results each "round" of ulimit -n count of files, like:
touch accumulator.txt
... | xargs -d '\n' -n $(($(ulimit -n) - 1)) sh -c '
paste accumulator.txt "$#" > accumulator.txt.sav;
mv accumulator.txt.sav accumulator.txt
' _
cat accumulator.txt
Instead use the wildcard * to enumerate all your files in a directory, if your file names pattern are sequentially ordered, you can manually list all files in order and concatenate to a large file. The output order of * enumeration might look different in different environment, as it not works as you expect.
Below is a simple example
$ for i in `seq 20`;do echo $i > $i.txt;done
# create 20 test files, 1.txt, 2.txt, ..., 20.txt with number 1 to 20 in each file respectively
$ cat {1..20}.txt
# show content of all file in order 1.txt, 2.txt, ..., 20.txt
$ cat {1..20}.txt > 1_20.txt
# concatenate them to a large file named 1_20.txt
In bash or any other shell, glob expansions are done in lexicographical order. When having files numberd, this sadly means that 11.txt < 1.txt < 2.txt. This weird ordering comes from the fact that, lexicographically, 1 < . (<dot>-character (".")).
So here are a couple of ways to operate on your files in order:
rename all your files:
for i in *.txt; do mv "$i" "$(sprintf "%0.5d.txt" ${i%.*}"); done
paste *.txt
use brace-expansion:
Brace expansion is a mechanism that allows for the generation of arbitrary strings. For integers you can use {n..m} to generate all numbers from n to m or {n..m..s} to generate all numbers from n to m in steps of s:
paste {1..2000}.txt
The downside here is that it is possible that a file is missing (eg. 1234.txt). So you can do
shopt -s extglob; paste ?({1..2000}.txt)
The pattern ?(pattern) matches zero or one glob-matches. So this will exclude the missing files but keeps the order.

Quickly list random set of files in directory in Linux

Question:
I am looking for a performant, concise way to list N randomly selected files in a Linux directory using only Bash. The files must be randomly selected from different subdirectories.
Why I'm asking:
In Linux, I often want to test a random selection of files in a directory for some property. The directories contain 1000's of files, so I only want to test a small number of them, but I want to take them from different subdirectories in the directory of interest.
The following returns the paths of 50 "randomly"-selected files:
find /dir/of/interest/ -type f | sort -R | head -n 50
The directory contains many files, and resides on a mounted file system with slow read times (accessed through ssh), so the command can take many minutes. I believe the issue is that the first find command finds every file (slow), and only then prints a random selection.
If you are using locate and updatedb updates regularly (daily is probably the default), you could:
$ locate /home/james/test | sort -R | head -5
/home/james/test/10kfiles/out_708.txt
/home/james/test/10kfiles/out_9637.txt
/home/james/test/compr/bar
/home/james/test/10kfiles/out_3788.txt
/home/james/test/test
How often do you need it? Do the work periodically in advance to have it quickly available when you need it.
Create a refreshList script.
#! /bin/env bash
find /dir/of/interest/ -type f | sort -R | head -n 50 >/tmp/rand.list
mv -f /tmp/rand.list ~
Put it in your crontab.
0 7-20 * * 1-5 nice -25 ~/refresh
Then you will always have a ~/rand.list that's under an hour old.
If you don't want to use cron and aren't too picky about how old it is, just write a function that refreshes the file after you use it every time.
randFiles() {
cat ~/rand.list
{ find /dir/of/interest/ -type f |
sort -R | head -n 50 >/tmp/rand.list
mv -f /tmp/rand.list ~
} &
}
If you can't run locate and the find command is too slow, is there any reason this has to be done in real time?
Would it be possible to use cron to dump the output of the find command into a file and then do the random pick out of there?

Unable to cat ~9000 files using command line

I am trying to cat ~9000 fasta like files into one larger file. All of the files are in a single subfolder. I keep getting the argument list is to long error.
This is a sample name from one of the files
efetch.fcgi?db=nuccore&id=CL640905.1&rettype=fasta&retmode=text
They are considered a document type file by the computer.
You can't use cat * > concatfile as you have limits on command line size. So take them one at a time and append:
ls | while read; do cat "$REPLY" >> concatfile; done
(Make sure concatfile doesn't exist beforehand.)
EDIT: As user6292850 rightfully points out, I might be overthinking it. This suffices, if your files don't have too weird names:
ls | xargs cat > concatfile
(but files with spaces in them, for example, would blow it up)
There is a limit on how many arguments you can place on the commandline.
You could use a for loop to handle this:
while read file;do
cat "${file}" >> path/to/output_folder;
done < <(find path/to/output_folder -maxdepth 1 -type f -print)
This will bypass the problem of an expanded glob with too many arguments.

Clearing archive files with linux bash script

Here is my problem,
I have a folder where is stored multiple files with a specific format:
Name_of_file.TypeMM-DD-YYYY-HH:MM
where MM-DD-YYYY-HH:MM is the time of its creation. There could be multiple files with the same name but not the same time of course.
What i want is a script that can keep the 3 newest version of each file.
So, I found one example there:
Deleting oldest files with shell
But I don't want to delete a number of files but to keep a certain number of newer files. Is there a way to get that find command, parse in the Name_of_file and keep the 3 newest???
Here is the code I've tried yet, but it's not exactly what I need.
find /the/folder -type f -name 'Name_of_file.Type*' -mtime +3 -delete
Thanks for help!
So i decided to add my final solution in case anyone liked to get it. It's a combination of the 2 solutions given.
ls -r | grep -P "(.+)\d{4}-\d{2}-\d{2}-\d{2}:\d{2}" | awk 'NR > 3' | xargs rm
One line, super efficiant. If anything changes on the pattern of date or name just change the grep -P pattern to match it. This way you are sure that only the files fitting this pattern will get deleted.
Can you be extra, extra sure that the timestamp on the file is the exact same timestamp on the file name? If they're off a bit, do you care?
The ls command can sort files by timestamp order. You could do something like this:
$ ls -t | awk 'NR > 3' | xargs rm
THe ls -t lists the files by modification time where the newest are first.
The `awk 'NR > 3' prints out the list of files except for the first three lines which are the three newest.
The xargs rm will remove the files that are older than the first three.
Now, this isn't the exact solution. There are possible problems with xargs because file names might contain weird characters or whitespace. If you can guarantee that's not the case, this should be okay.
Also, you probably want to group the files by name, and keep the last three. Hmm...
ls | sed 's/MM-DD-YYYY-HH:MM*$//' | sort -u | while read file
do
ls -t $file* | awk 'NR > 3' | xargs rm
done
The ls will list all of the files in the directory. The sed 's/\MM-DD-YYYY-HH:MM//' will remove the date time stamp from the files. Thesort -u` will make sure you only have the unique file names. Thus
file1.txt-01-12-1950
file2.txt-02-12-1978
file2.txt-03-12-1991
Will be reduced to just:
file1.txt
file2.txt
These are placed through the loop, and the ls $file* will list all of the files that start with the file name and suffix, but will pipe that to awk which will strip out the newest three, and pipe that to xargs rm that will delete all but the newest three.
Assuming we're using the date in the filename to date the archive file, and that is possible to change the date format to YYYY-MM-DD-HH:MM (as established in comments above), here's a quick and dirty shell script to keep the newest 3 versions of each file within the present working directory:
#!/bin/bash
KEEP=3 # number of versions to keep
while read FNAME; do
NODATE=${FNAME:0:-16} # get filename without the date (remove last 16 chars)
if [ "$NODATE" != "$LASTSEEN" ]; then # new file found
FOUND=1; LASTSEEN="$NODATE"
else # same file, different date
let FOUND="FOUND + 1"
if [ $FOUND -gt $KEEP ]; then
echo "- Deleting older file: $FNAME"
rm "$FNAME"
fi
fi
done < <(\ls -r | grep -P "(.+)\d{4}-\d{2}-\d{2}-\d{2}:\d{2}")
Example run:
[me#home]$ ls
another_file.txt2011-02-11-08:05
another_file.txt2012-12-09-23:13
delete_old.sh
not_an_archive.jpg
some_file.exe2011-12-12-12:11
some_file.exe2012-01-11-23:11
some_file.exe2012-12-10-00:11
some_file.exe2013-03-01-23:11
some_file.exe2013-03-01-23:12
[me#home]$ ./delete_old.sh
- Deleting older file: some_file.exe2012-01-11-23:11
- Deleting older file: some_file.exe2011-12-12-12:11
[me#home]$ ls
another_file.txt2011-02-11-08:05
another_file.txt2012-12-09-23:13
delete_old.sh
not_an_archive.jpg
some_file.exe2012-12-10-00:11
some_file.exe2013-03-01-23:11
some_file.exe2013-03-01-23:12
Essentially, but changing the file name to dates in the form to YYYY-MM-DD-HH:MM, a normal string sort (such as that done by ls) will automatically group similar files together sorted by date-time.
The ls -r on the last line simply lists all files within the current working directly print the results in reverse order so newer archive files appear first.
We pass the output through grep to extract only files that are in the correct format.
The output of that command combination is then looped through (see the while loop) and we can simply start deleting after 3 occurrences of the same filename (minus the date portion).
This pipeline will get you the 3 newest files (by modification time) in the current dir
stat -c $'%Y\t%n' file* | sort -n | tail -3 | cut -f 2-
To get all but the 3 newest:
stat -c $'%Y\t%n' file* | sort -rn | tail -n +4 | cut -f 2-

find files with same name in different directories and count duplicates

I hope you can help me with the following problem. I have 24 directories each containing many (1000's) of files. I would like to find out which combination of directories contains the most number of duplicate (by name only) files. For example if we only consider 4 directories
dir1 dir2 dir3 dir4
with the following directory contents
dir1
1.fa 2.fa 3.fa 4.fa 5.fa
dir2
1.fa 10.fa 15.fa
dir3
1.fa 2.fa 3.fa
dir4
1.fa 2.fa 3.fa 5.fa 8.fa 10.fa
Therefore, the combination of directories dir1 and dir4 contain the most duplicate files (4).
The problem becomes quite large with 24 directories so I was thinking that I might use a brute force approach. Something along the lines of
count all duplicate files that occur in all 24 directories
drop a directory and count the number of duplicate files
replace the directory and drop another one then count number
repeat for all directories
get the subset of 23 directories with max number of duplicate files
repeat the above 2-5 and keep the 22 directories with most duplicate files
repeat until only 2 directories left
choose the combination of directories with the max number of duplicate files
If any one has a way of doing this I would be very grateful for some advice. I thought of using fdupes or diff but cant figure out how to parse the output and summarise.
I tagged your question with algorithm as I am unaware of any existing bash / linux tools that can help you directly solve this problem. The easiest way would be to construct algorithm for this in a programming language such as Python, C++, or Java instead of using bash shells.
That being said, here's a high level analysis of your problem: At first glance it looks like a mininum set cover problem, but it's actually broken down into 2 parts:
Part 1 - What is the set of files to cover?
You want to find the combination of directories that cover the most number of duplicate files. But first you need to know what the maximum set of duplicate files are within your 24 directories.
Since the intersection of files between 2 directories is always greater than or equal to the intersection with a 3rd directory, you go through all pairs of directories and find what the maximum intersection set is:
(24 choose 2) = 276 comparisons
You take the largest intersection set found and use that as the set you are actually trying to cover.
Part 2 - The minimum set cover problem
This is a well-studied problem in computer science, so you are better served reading from the writings of people much smarter than I.
The only thing I have to note that it's a NP-Complete problem, so it's not trivial.
This is the best I can do to address the original formulation of your question, but I have a feeling that it's overkill for what you actually need to accomplish. You should consider updating your question with the actual problem that you need to solve.
Count duplicate file names in shell:
#! /bin/sh
# directories to test for
dirs='dir1 dir2 dir3 dir4'
# directory pairs already seen
seen=''
for d1 in $dirs; do
for d2 in $dirs; do
if echo $seen | grep -q -e " $d1:$d2;" -e " $d2:$d1;"; then
: # don't count twice
elif test $d1 != $d2; then
# remember pair of directories
seen="$seen $d1:$d2;"
# count duplicates
ndups=`ls $d1 $d2 | sort | uniq -c | awk '$1 > 1' | wc -l`
echo "$d1:$d2 $ndups"
fi
done
# sort decreasing and take the first
done | sort -k 2rn | head -1
./count_dups.sh:
1 files are duplicated Comparing dir1 to dir2.
3 files are duplicated Comparing dir1 to dir3.
4 files are duplicated Comparing dir1 to dir4.
1 files are duplicated Comparing dir2 to dir3.
2 files are duplicated Comparing dir2 to dir4.
3 files are duplicated Comparing dir3 to dir4.
./count_dups.sh | sort -n | tail -1
4 files are duplicated Comparing dir1 to dir4.
Using the script count_dups.sh:
#!/bin/bash
# This assumes (among other things) that the dirs don't have spaces in the names
cd testdirs
declare -a DIRS=(`ls`);
function count_dups {
DUPS=`ls $1 $2 | sort | uniq -d | wc -l`
echo "$DUPS files are duplicated comparing $1 to $2."
}
LEFT=0
while [ $LEFT -lt ${#DIRS[#]} ] ; do
RIGHT=$(( $LEFT + 1 ))
while [ $RIGHT -lt ${#DIRS[#]} ] ; do
count_dups ${DIRS[$LEFT]} ${DIRS[$RIGHT]}
RIGHT=$(( $RIGHT + 1 ))
done
LEFT=$(( $LEFT + 1 ))
done
Can we create hash table for all of these 24 directories?
If the filename is just number , the hash function will be very easy to design.
If we can use hash table, it will be faster to search and find duplication.
Just for curiosity, I've done some simple tests: 24 directories with approximately 3900 files in each (a random number between 0 and 9999). Both bash-scripts take around 10 seconds each. Here is a basic python-script doing the same in ~0.2s:
#!/usr//bin/python
import sys, os
def get_max_duplicates(path):
items = [(d,set(os.listdir(os.path.join(path,d)))) \
for d in os.listdir(path) if os.path.isdir(os.path.join(path, d))]
if len(items) < 2:
# need at least two directories
return ("","",0)
values = [(items[i][0],items[j][0],len(items[i][1].intersection(items[j][1]))) \
for i in range(len(items)) for j in range(i+1, len(items))]
return max(values, key=lambda a: a[2])
def main():
path = sys.argv[1] if len(sys.argv)==2 else os.getcwd()
r = get_max_duplicates(path)
print "%s and %s share %d files" % r
if __name__ == '__main__':
main()
As mentioned by Richard, by using a hash-table (or set in python), we can speed things up. The intersection of two sets is O(min(len(set_a), len(set_b))) and we have to do N(N-1)/2=720 comparisons.

Resources