Linux: List file names, if last modified between a date interval - linux

I have 2 variables, which contains dates like this: 2001.10.10
And i want to use ls with a filter, that only list files if last modified were between the first and second date

The best solution I can think of involves creating temporary files with the boundary timestamps, and then using find:
touch -t YYYYMMDD0000 oldest_file
touch -t YYYYMMDD0000 newest_file
find -maxdepth 1 -newer oldest_file -and -not -newer newest_file
rm oldest_file newest_file
You can use the -print0 option to find if you want to strip off the leading ./ from all the filenames.
If creating temporary files isn't an option, you might consider writing a script to calculate and print the age of a file, such as described here, and then using that as a predicate.

Sorry, it is not the simplest. I just now developed it, only for you. :-)
ls -l --full-time|awk '{s=$6;gsub(/[-\.]/,"",s);if ((s>="'"$from_variable"'") && (s<="'"$to_variable"'")) {print $0}}';
The problem is, that these simple commandline tools doesn't handle date type. So first we convert them to integers removing the separating "-" and "." characters (by you is it ".", by me a "-" so I remove both, this can you see in
gsub(/[-\.]/,"",s)
After the removal, we can already compare them with integers. In this example, we compare them with the integers $from_variable and with $to_variable. So, this will list files modified between $from_variable and $to_variable .
Both of "from_variable" and "to_variable" need to be environment variables in the form 20070707 (for 7. July, 2007).

Related

how to find a last updated file with the prefix name in bash?

How can I find a last updated file with the specific prefix in bash?
For example, I have three files, and I just want to see a file that has "ABC" and where the last Last_updatedDateTime desc.
fileName Last_UpdatedDateTime
abc123 7/8/2020 10:34am
abc456 7/6/2020 10:34am
def123 7/8/2020 10:34am
You can list files sorted in the order they were modified with ls -t:
-t sort by modification time, newest first
You can use globbing (abc*) to match all files starting with abc.
Since you will get more than one match and only want the newest (that is first):
head -1
Combined:
ls -t abc* | head -1
If there are a lot of these files scattered across a variety of directories, find mind be better.
find -name abc\* -printf "%T# %f\n" |sort -nr|sed 's/^.* //; q;'
Breaking that out -
find -name 'abc*' -printf "%T# %f\n" |
find has a ton of options. This is the simplest case, assuming the current directory as the root of the search. You can add a lot of refinements, or just give / to search the whole system.
-name 'abc*' picks just the filenames you want. Quote it to protect any globs, but you can use normal globbing rules. -iname makes the search case-insensitive.
-printf defines the output. %f prints the filename, but you want it ordered on the date, so print that first for sorting so the filename itself doesn't change the order. %T accepts another character to define the date format - # is the unix epoch, seconds since 00:00:00 01/01/1970, so it is easy to sort numerically. On my git bash emulation it returns fractions as well, so it's great granularity.
$: find -name abc\* -printf "%T# %f\n"
1594219755.7741618000 abc123
1594219775.5162510000 abc321
1594219734.0162554000 abc456
find may not return them in the order you want, though, so -
sort -nr |
-n makes it a numeric sort. -r sorts in reverse order, so that the latest file will pop out first and you can ignore everything after that.
sed 's/^.* //; q;'
Since the first record is the one we want, sed can just use s/^.* //; to strip off everything up to the space, which we know will be the timestamp numbers since we controlled the output explicitly. That leaves only the filename. q explicitly quits after the s/// scrubs the record, so sed spits out the filename and stops without reading the rest, which prevents the need for another process (head -1) in the pipeline.

Find files whose content match a line from text file

I have a text file - accessions.txt (below is a subset of this file):
KRO94967.1
KRO95967.1
KRO96427.1
KRO94221.1
KRO94121.1
KRO94145.1
WP_088442850.1
WP_088252850.1
WP_088643726.1
WP_088739685.1
WP_088283155.1
WP_088939404.1
And I have a directory with multiple files (*.align).
I want to find the filenames (*.align) which content matches any line within my accessions.txt text file.
I know that find . -exec grep -H 'STRING' {} + works to find specific strings (e.g replacing STRING with WP_088939404.1 returns every filename where the string WP_088939404.1 is present).
Is there a way to replace STRING with "all strings inside my text file" ?
Or
Is there another (better) way to do this?
I was trying to avoid writing a loop that reads the content of all my files as there are too many of them.
Many thanks!
You're looking for grep's -f option.
find . -name '*.align' -exec grep -Fxqf accessions.txt {} \; -print
grep can take a list of patterns to match with -f.
grep -lFf accessions.txt directory/*.align
-F tells grep to interpret the lines as fixed strings, not regex patterns.
Sometimes, -w is also needed to prevent matching inside words, e.g.
abcd
might match not only abcd, but also xabcd or abcdy. Sometimes, preprocessing the input list is needed to prevent unwanted matching if the rules are more complex.

How to sort by name then date modification in BASH

Lets say I have a folder of .txt files that have a dd-MM-yyyy_HH-mm-ss time followed by _name.txt. I want to be able to sort by name first then time after. Example:
BEFORE
15-2-2010_10-01-55_greg.txt
10-2-1999_10-01-55_greg.txt
10-2-1999_10-01-55_jason.txt
AFTER
greg_1_10-2-1999_10-01-55
greg_2_15-2-2010_10-01-55
jason_1_10-2-1999_10-01-55
Edit: Apologies, from my "cp" line I was meant to copy them into another directory with a different name to them.
Something I tried to do is make a copy with the count, but it doesn't sort the files with the same name properly in terms of dates:
cd data/unfilteredNames
for filename in *.txt; do
n=${filename%.*}
n=${filename##*_}
filteredName=${n%.*}
count=0
find . -type f -name "*_$n" | while read name; do
count=$(($count+1))
cp -p $name ../filteredNames/"$filteredName"_"$count"
done
done
Not sure that the renaming of files is one of your expectation. At least for only sorting file name, you don't need to.
You can do this by only using GNU sort command:
sort -t- -k5.4 -k3.1,3.4 -k2.1,2.1 -k1.1,1.2 -k3.6,3.13 <(printf "%s\n" *.txt)
-t sets the field separator to a dash -.
-k enables to sort based on fields. As explained in man sort page, the syntax is -k<start>,<stop> where <start> or is composed of <field number>.<position>. Adding several -k option to the command allows to sort on multiple fields; the first in he command line having more precedence than the other.
For example, the first -k5.4 tells to sort based on the 5th fields with an offset of 4 characters. There isn't a stop field because this is the end of the filename.
The -k3.1,3.4 option sorts based on the 3rd field starting from offset 1 to 4.
The same principle applies to other -k options.
In your example the month field only has 1 digit. If you have files with a month coded with 2 digits, you might want to pad with 0 all month filenames. This can be done by adding to the printf statement this <(... | sed 's/-0\?\([0-9]\)/-0\1/') and change the -k 2.1,2.1 by -k2.1,2.2.

Copy the latest updated file based on substring from filename in bash

I have to archive some files (based on date which is there in file) from a folder but there can be multiple files with same name (substring). I have to copy only the latest one to a saperate folder.
for eg.
20180730.abc.xyz2.jkl.20180729.164918.csv.gz
In this -> 20180730 and 20180729 are representing date from which I have to search by (first date) 20180730. This part is done.
The searching part which i wrote is :
for FILE in $SOURCE_DIR/$BUSINESS_DT*
{
do
# Here I have to search if this FILENAME exists and if yes, then copy that latest file
cp "${FILE}" $TARGET_DIR/
done
Now I have to search if the same SOURCE_DIR contains a file with the name similar to 20180730.abc.xyz2.jkl. and if it exists then I have to copy it.
so basically, I have to extract the portion abc.xyz2.jkl. I can't use cut with fields as the filename could either be like abc.xyz2.jkl or abc.xyz. The portion is variable and can also have numberthe last two numbers are also variable and can change.
Some eg are:
20180730.abc.xyz2.jkl.20170729.890789.csv.gz
20180730.abc.xyz2.20180729.121212.csv.gz
20180730.ab.xy.20180729.11111.csv.gz
Can anybody please help me in doing that. I tried find and cut but didn't got required results.
Many Thanks
Python might be a better choice for implementing something like this, but here is a bash example. You can use sed positional parameter to extract the portion of the filename that you want. Then use an associative array to store the filename of the newest file containing the substring found.
Once that's done, you can go back and do the copy operations. Here is an example which extracts the string between the two 8-digit numbers and periods. This sed expression may not work for your complete data set, but it works for the 3 examples you gave. Also this won't handle cases where one unique identifier is a subset of another unique identifier.
declare -A LATEST
for FILE in $SOURCE_DIR/$BUSINESS_DT*
do
# Extract the substring unique identifier
HASH=$(echo "${FILE}" | sed "s/[0-9]\{8\}\.\(.*\)\.[0-9]\{8\}.*$/\\1/g")
# If this is the first time on this unique identifier,
# then get the latest matching file
if [ ${LATEST[${HASH}]}abc == abc ]
then
LATEST[${HASH}]=$(find . -type f -name '*${HASH}*' -printf '%T# %p\n' | sort -n | tail -1 | cut -f2- -d" ")
fi
done
for FILE in "${!LATEST[#]}"
do
cp "${FILE}" $TARGET_DIR/
done

Using grep to identify a pattern

I have several documents hosted on a cloud instance. I want to extract all words conforming to a specific pattern into a .txt file. This is the pattern:
ABC123A
ABC123B
ABC765A
and so one. Essentially the words start with a specific character string 'ABC', have a fixed number of numerals, and end with a letter. This is my code:
grep -oh ABC[0-9].*[a-zA-Z]$ > /home/user/abcLetterMatches.txt
When I execute the query, it runs for several hours without generating any output. I have over 1100 documents. However, when I run this query:
grep -r ABC[0-9].*[a-zA-Z]$ > /home/user/abcLetterMatches.txt
the list of files with the strings is generated in a matter for seconds.
What do I need to correct in my query? Also, what is causing the delay?
UPDATE 1
Based on the answers, it's evident that the command is missing the file name on which it needs to be executed. I want to run the code on multiple document files (>1000)
The documents I want searched are in multiple sub-directories within a directory. What is a good way to search through them? Doing
grep -roh ABC[0-9].*[a-zA-Z]$ > /home/user/abcLetterMatches.txt
only returns the file names.
UPDATE 2
If I use the updated code from the answer below:
find . -exec grep -oh "ABC[0-9].*[a-zA-Z]$" >> ~/abcLetterMatches.txt {} \;
I get a no file or directory error
UPDATE 3
The pattern can be anywhere in the line.
You can use this regexp :
~/ grep -E "^ABC[0-9]{3}[A-Z]$" docs > filename
ABC123A
ABC123B
ABC765A
There is no delay, grep is just waiting for the input you didn't give it (and therefore it waits on standard input, by default). You can correct your command by supplying argument with filename:
grep -oh "ABC[0-9].*[a-zA-Z]$" file.txt > /home/user/abcLetterMatches.txt
Source (man grep):
SYNOPSIS
grep [OPTIONS] PATTERN [FILE...]
To perform the same grepping on several files recursively, combine it with find command:
find . -exec grep -oh "ABC[0-9].*[a-zA-Z]$" >> ~/abcLetterMatches.txt {} \;
This does what you ask for:
grep -hr '^ABC[0-9]\{3\}[A-Za-z]$'
-h to not get the filenames.
-r to search recursively. If no directory is given (as above) the current one is used. Otherwise just specify one as the last argument.
Quotes around the pattern to avoid accidental globbing, etc.
^ at the beginning of the pattern to — together with $ at the end — only match whole lines. (Not sure if this was a requirement, but the sample data suggests it.)
\{3\} to specify that there should be three digits.
No .* as that would match a whole lot of other things.

Resources