How to sort by name then date modification in BASH - linux

Lets say I have a folder of .txt files that have a dd-MM-yyyy_HH-mm-ss time followed by _name.txt. I want to be able to sort by name first then time after. Example:
BEFORE
15-2-2010_10-01-55_greg.txt
10-2-1999_10-01-55_greg.txt
10-2-1999_10-01-55_jason.txt
AFTER
greg_1_10-2-1999_10-01-55
greg_2_15-2-2010_10-01-55
jason_1_10-2-1999_10-01-55
Edit: Apologies, from my "cp" line I was meant to copy them into another directory with a different name to them.
Something I tried to do is make a copy with the count, but it doesn't sort the files with the same name properly in terms of dates:
cd data/unfilteredNames
for filename in *.txt; do
n=${filename%.*}
n=${filename##*_}
filteredName=${n%.*}
count=0
find . -type f -name "*_$n" | while read name; do
count=$(($count+1))
cp -p $name ../filteredNames/"$filteredName"_"$count"
done
done

Not sure that the renaming of files is one of your expectation. At least for only sorting file name, you don't need to.
You can do this by only using GNU sort command:
sort -t- -k5.4 -k3.1,3.4 -k2.1,2.1 -k1.1,1.2 -k3.6,3.13 <(printf "%s\n" *.txt)
-t sets the field separator to a dash -.
-k enables to sort based on fields. As explained in man sort page, the syntax is -k<start>,<stop> where <start> or is composed of <field number>.<position>. Adding several -k option to the command allows to sort on multiple fields; the first in he command line having more precedence than the other.
For example, the first -k5.4 tells to sort based on the 5th fields with an offset of 4 characters. There isn't a stop field because this is the end of the filename.
The -k3.1,3.4 option sorts based on the 3rd field starting from offset 1 to 4.
The same principle applies to other -k options.
In your example the month field only has 1 digit. If you have files with a month coded with 2 digits, you might want to pad with 0 all month filenames. This can be done by adding to the printf statement this <(... | sed 's/-0\?\([0-9]\)/-0\1/') and change the -k 2.1,2.1 by -k2.1,2.2.

Related

Find and copy specific files by date

I've been trying to get a script working to backup some files from one machine to another but have been running into an issue.
Basically what I want to do is copy two files, one .log and one (or more) .dmp. Their format is always as follows:
something_2022_01_24.log
something_2022_01_24.dmp
I want to do three things with these files:
find the second to last one .log file (i.e. something_2022_01_24.log is the latest,I want to find the one before that say something_2022_01_22.log)
get a substring with just the date (2022_01_22)
copy every .dmp that matches the date (i.e something_2022_01_24.dmp, something01_2022_01_24.dmp)
For the first one from what I could find the best way is to do: ls -t *.log | head-2 as it displays the second to last file created.
As for the second one I'm more at a loss because I'm not sure how to parse the output of the first command.
The third one I think I could manage with something of the sort:
[ -f "/var/www/my_folder/*$capturedate.dmp" ] && cp "/var/www/my_folder/*$capturedate.dmp" /tmp/
What do you guys think is there any way to do this? How can I compare the substring?
Thanks!
Would you please try the following:
#!/bin/bash
dir="/var/www/my_folder"
second=$(ls -t "$dir/"*.log | head -n 2 | tail -n 1)
if [[ $second =~ .*_([0-9]{4}_[0-9]{2}_[0-9]{2})\.log ]]; then
capturedate=${BASH_REMATCH[1]}
cp -p "$dir/"*"$capturedate".dmp /tmp
fi
second=$(ls -t "$dir"/*.log | head -n 2 | tail -n 1) will pick the
second to last log file. Please note it assumes that the timestamp
of the file is not modified since it is created and the filename
does not contain special characters such as a newline. This is an easy
solution and we may need more improvement for the robustness.
The regex .*_([0-9]{4}_[0-9]{2}_[0-9]{2})\.log will match the log
filename. It extracts the date substring (enclosed with the parentheses) and assigns the bash variable
${BASH_REMATCH[1]} to it.
Then the next cp command will do the job. Please be cateful
not to include the widlcard * within the double quotes so that
the wildcard is properly expanded.
FYI here are some alternatives to extract the date string.
With sed:
capturedate=$(sed -E 's/.*_([0-9]{4}_[0-9]{2}_[0-9]{2})\.log/\1/' <<< "$second")
With parameter expansion of bash (if something does not include underscores):
capturedate=${second%.log}
capturedate=${capturedate#*_}
With cut command (if something does not include underscores):
capturedate=$(cut -d_ -f2,3,4 <<< "${second%.log}")

How to sort and print array listing of specific file type in shell

I am trying to write a loop with which I want to extract text file names in all sub-directories and append certain strings to it. Additionally, I want the text file name sorted for numbers after ^.
For example, I have three sub directories mydir1, mydir2, mydir3. I have,
in mydir1,
file223^1.txt
file221^2.txt
file666^3.txt
in mydir2,
file111^1.txt
file4^2.txt
In mydir3,
file1^4.txt
file5^5.txt
The expected result final.csv:
STRINGmydir1file223^1
STRINGmydir1file221^2
STRINGmydir1file666^3
STRINGmydir2file111^1
STRINGmydir2file4^2
STRINGmydir3file1^4
STRINGmydir3file5^5
This is the code I tried:
for dir in my*/; do
array=(${dir}/*.txt)
IFS=$'\n' RGBASE=($(sort <<<"${array[#]}"));
for RG in ${RGBASE[#]}; do
RGTAG=$(basename ${RG/.txt//})
echo "STRING${dir}${RGTAG}" >> final.csv
done
done
Can someone please explain what is wrong with my code? Also, there could be other better ways to do this, but I want to use the for-loop.
The output with this code:
$ cat final.csv
STRINGdir1file666^3.txt
STRINGdir2file4^2.txt
STRINGdir3file5^5.txt
As a starting point which works for your special case, I got a two liner for this.
mapfile -t array < <( find my* -name "*.txt" -printf "STRING^^%H^^%f\n" | cut -d"." -f1 | LANG=C sort -t"^" -k3,3 -k6 )
printf "%s\n" "${array[#]//^^/}"
To restrict the directory depth, you can add -maxdepth with the number of subdirs to search. The find command can also use regex in the search, which is applied to the whole path, which can be used to work on a more complex directory-tree.
The difficulty was the sort on two positions and the delimiter.
My idea was to add a delimiter, which easily can be removed afterwards.
The sort command can only handle one delimiter, therefore I had to use the double hat as delimiter which can be removed without removing the single hat in the filename.
A solution using decorate-sort-undecorate idiom could be:
printf "%s\n" my*/*.txt |
sed -E 's_(.*)/(.*)\^([0-9]+).*_\1\t\3\tSTRING\1\2^\3_' |
sort -t$'\t' -k1,1 -k2,2n |
cut -f3
assuming filenames don't contain tab or newline characters.
A basic explanation: The printf prints each pathname on a separate line. The sed converts the pathname dir/file^number.txt into dir\tnumber\tSTRINGdirfile^number (\t represents a tab character). The aim is to use the tab character as a field separator in the sort command. The sort sorts the lines by the first (lexicographically) and second fields (numerically). The cut discards the first and second fields; the remaining field is what we want.

how to find a last updated file with the prefix name in bash?

How can I find a last updated file with the specific prefix in bash?
For example, I have three files, and I just want to see a file that has "ABC" and where the last Last_updatedDateTime desc.
fileName Last_UpdatedDateTime
abc123 7/8/2020 10:34am
abc456 7/6/2020 10:34am
def123 7/8/2020 10:34am
You can list files sorted in the order they were modified with ls -t:
-t sort by modification time, newest first
You can use globbing (abc*) to match all files starting with abc.
Since you will get more than one match and only want the newest (that is first):
head -1
Combined:
ls -t abc* | head -1
If there are a lot of these files scattered across a variety of directories, find mind be better.
find -name abc\* -printf "%T# %f\n" |sort -nr|sed 's/^.* //; q;'
Breaking that out -
find -name 'abc*' -printf "%T# %f\n" |
find has a ton of options. This is the simplest case, assuming the current directory as the root of the search. You can add a lot of refinements, or just give / to search the whole system.
-name 'abc*' picks just the filenames you want. Quote it to protect any globs, but you can use normal globbing rules. -iname makes the search case-insensitive.
-printf defines the output. %f prints the filename, but you want it ordered on the date, so print that first for sorting so the filename itself doesn't change the order. %T accepts another character to define the date format - # is the unix epoch, seconds since 00:00:00 01/01/1970, so it is easy to sort numerically. On my git bash emulation it returns fractions as well, so it's great granularity.
$: find -name abc\* -printf "%T# %f\n"
1594219755.7741618000 abc123
1594219775.5162510000 abc321
1594219734.0162554000 abc456
find may not return them in the order you want, though, so -
sort -nr |
-n makes it a numeric sort. -r sorts in reverse order, so that the latest file will pop out first and you can ignore everything after that.
sed 's/^.* //; q;'
Since the first record is the one we want, sed can just use s/^.* //; to strip off everything up to the space, which we know will be the timestamp numbers since we controlled the output explicitly. That leaves only the filename. q explicitly quits after the s/// scrubs the record, so sed spits out the filename and stops without reading the rest, which prevents the need for another process (head -1) in the pipeline.

Copy the latest updated file based on substring from filename in bash

I have to archive some files (based on date which is there in file) from a folder but there can be multiple files with same name (substring). I have to copy only the latest one to a saperate folder.
for eg.
20180730.abc.xyz2.jkl.20180729.164918.csv.gz
In this -> 20180730 and 20180729 are representing date from which I have to search by (first date) 20180730. This part is done.
The searching part which i wrote is :
for FILE in $SOURCE_DIR/$BUSINESS_DT*
{
do
# Here I have to search if this FILENAME exists and if yes, then copy that latest file
cp "${FILE}" $TARGET_DIR/
done
Now I have to search if the same SOURCE_DIR contains a file with the name similar to 20180730.abc.xyz2.jkl. and if it exists then I have to copy it.
so basically, I have to extract the portion abc.xyz2.jkl. I can't use cut with fields as the filename could either be like abc.xyz2.jkl or abc.xyz. The portion is variable and can also have numberthe last two numbers are also variable and can change.
Some eg are:
20180730.abc.xyz2.jkl.20170729.890789.csv.gz
20180730.abc.xyz2.20180729.121212.csv.gz
20180730.ab.xy.20180729.11111.csv.gz
Can anybody please help me in doing that. I tried find and cut but didn't got required results.
Many Thanks
Python might be a better choice for implementing something like this, but here is a bash example. You can use sed positional parameter to extract the portion of the filename that you want. Then use an associative array to store the filename of the newest file containing the substring found.
Once that's done, you can go back and do the copy operations. Here is an example which extracts the string between the two 8-digit numbers and periods. This sed expression may not work for your complete data set, but it works for the 3 examples you gave. Also this won't handle cases where one unique identifier is a subset of another unique identifier.
declare -A LATEST
for FILE in $SOURCE_DIR/$BUSINESS_DT*
do
# Extract the substring unique identifier
HASH=$(echo "${FILE}" | sed "s/[0-9]\{8\}\.\(.*\)\.[0-9]\{8\}.*$/\\1/g")
# If this is the first time on this unique identifier,
# then get the latest matching file
if [ ${LATEST[${HASH}]}abc == abc ]
then
LATEST[${HASH}]=$(find . -type f -name '*${HASH}*' -printf '%T# %p\n' | sort -n | tail -1 | cut -f2- -d" ")
fi
done
for FILE in "${!LATEST[#]}"
do
cp "${FILE}" $TARGET_DIR/
done

Rename the most recent file in each group

i try to create a script that should detect the latest file of each group, and add prefix to its original name.
ll $DIR
asset_10.0.0.1_2017.11.19 #latest
asset_10.0.0.1_2017.10.28
asset_10.0.0.2_2017.10.02 #latest
asset_10.0.0.2_2017.08.15
asset_10.1.0.1_2017.11.10 #latest
...
2 questions:
1) how to find the latest file of each group?
2) how to rename adding only a prefix
I tried the following procedure, but it looks for the latest file in the entire directory, and doesn't keep the original name to add a prefix to it:
find $DIR -type f ! -name 'asset*' -print | sort -n | tail -n 1 | xargs -I '{}' cp -p '{}' $DIR...
What would be the best approach to achieve this? (keeping xargs if possible)
Selecting the latest entry in each group
You can use sort to select only the latest entry in each group:
find . -print0 | sort -r -z | sort -t_ -k2,2 -u -z | xargs ...
First, sort all files in reversed lexicographical order (so that the latest entry appears first for each group). Then, by sorting on group name only (that's second field -k2,2 when split on underscores via -t_) and printing unique groups we get only the first entry per each group, which is also the latest.
Note that this works because sort uses a stable sorting algorithm - meaning the order or already sorted items will not be altered by sorting them again. Also note we can't use uniq here because we can't specify a custom field delimiter for uniq (it's always whitespace).
Copying with prefix
To add prefix to each filename found, we need to split each path find produces to a directory and a filename (basename), because we need to add prefix to filename only. The xargs part above could look like:
... | xargs -0 -I '{}' sh -c 'd="${1%/*}"; f="${1##*/}"; cp -p "$d/$f" "$d/prefix_$f"' _ '{}'
Path splitting is done with shell parameter expansion, namely prefix (${1##*/}) and suffix (${1%/*}) substring removal.
Note the use of NUL-terminated output (paths) in find (-print0 instead of -print), and the accompanying use of -z in sort and -0 in xargs. That way the complete pipeline will properly handle filenames (paths) with "special" characters like newlines and similar.
If you want to do this in bash alone, rather than using external tools like find and sort, you'll need to parse the "fields" in each filename.
Something like this might work:
declare -A o=() # declare an associative array (req bash 4)
for f in asset_*; do # step through the list of files,
IFS=_ read -a a <<<"$f" # assign filename elements to an array
b="${a[0]}_${a[1]}" # define a "base" of the first two elements
if [[ "${a[2]}" > "${o[$b]}" ]]; then # compare the date with the last value
o[$b]="${a[2]}" # for this base and reassign if needed
fi
done
for i in "${!o[#]}"; do # now that we're done, step through results
printf "%s_%s\n" "$i" "${o[$i]}" # and print them.
done
This doesn't exactly sort, it just goes through the list of files and grabs the highest sorting value for each filename base.

Resources