Copy the latest updated file based on substring from filename in bash - linux

I have to archive some files (based on date which is there in file) from a folder but there can be multiple files with same name (substring). I have to copy only the latest one to a saperate folder.
for eg.
20180730.abc.xyz2.jkl.20180729.164918.csv.gz
In this -> 20180730 and 20180729 are representing date from which I have to search by (first date) 20180730. This part is done.
The searching part which i wrote is :
for FILE in $SOURCE_DIR/$BUSINESS_DT*
{
do
# Here I have to search if this FILENAME exists and if yes, then copy that latest file
cp "${FILE}" $TARGET_DIR/
done
Now I have to search if the same SOURCE_DIR contains a file with the name similar to 20180730.abc.xyz2.jkl. and if it exists then I have to copy it.
so basically, I have to extract the portion abc.xyz2.jkl. I can't use cut with fields as the filename could either be like abc.xyz2.jkl or abc.xyz. The portion is variable and can also have numberthe last two numbers are also variable and can change.
Some eg are:
20180730.abc.xyz2.jkl.20170729.890789.csv.gz
20180730.abc.xyz2.20180729.121212.csv.gz
20180730.ab.xy.20180729.11111.csv.gz
Can anybody please help me in doing that. I tried find and cut but didn't got required results.
Many Thanks

Python might be a better choice for implementing something like this, but here is a bash example. You can use sed positional parameter to extract the portion of the filename that you want. Then use an associative array to store the filename of the newest file containing the substring found.
Once that's done, you can go back and do the copy operations. Here is an example which extracts the string between the two 8-digit numbers and periods. This sed expression may not work for your complete data set, but it works for the 3 examples you gave. Also this won't handle cases where one unique identifier is a subset of another unique identifier.
declare -A LATEST
for FILE in $SOURCE_DIR/$BUSINESS_DT*
do
# Extract the substring unique identifier
HASH=$(echo "${FILE}" | sed "s/[0-9]\{8\}\.\(.*\)\.[0-9]\{8\}.*$/\\1/g")
# If this is the first time on this unique identifier,
# then get the latest matching file
if [ ${LATEST[${HASH}]}abc == abc ]
then
LATEST[${HASH}]=$(find . -type f -name '*${HASH}*' -printf '%T# %p\n' | sort -n | tail -1 | cut -f2- -d" ")
fi
done
for FILE in "${!LATEST[#]}"
do
cp "${FILE}" $TARGET_DIR/
done

Related

Batch renaming files using variable extracted from file text

Apologies if this has been answered, but I've spent several hours experimenting and searching online for the solution.
I have a folder with several thousand text files named e.g. '1.dat', '2.dat', 3.dat' etc. I would like to rename all of the files by extracting an 8-digit numerical ID from within the file text (the ID is always on the last line in columns 65-73), so that '1.dat' becomes '60741308.dat' etc.
I have made it as far as extracting the ID (using tail and cut) from the text file and assigning it to a variable, which I can then use to rename the file, on a single file,but I am unable to make it work as a batch process in a 'for' loop.
Here is what I have tried:
for i in *.dat
tmpname=$(tail -1 $i| cut -c 65-73)
mv $i $tmpname.dat
done
I get the following error: bash: syntax error near unexpected token `tmpname=$(tail -1 $i| cut -c 65-73)'
Any help much appreciated.
The syntax of a for loop in Bash is:
for i in {1..10}
do
echo $i
done
I can see that, you are missing the do keyword in your example. So, the correct version would be:
for i in *.dat
do
tmpname=$(tail -1 "$i" | cut -c 65-73)
mv "$i" "$tmpname.dat"
done

Find and copy specific files by date

I've been trying to get a script working to backup some files from one machine to another but have been running into an issue.
Basically what I want to do is copy two files, one .log and one (or more) .dmp. Their format is always as follows:
something_2022_01_24.log
something_2022_01_24.dmp
I want to do three things with these files:
find the second to last one .log file (i.e. something_2022_01_24.log is the latest,I want to find the one before that say something_2022_01_22.log)
get a substring with just the date (2022_01_22)
copy every .dmp that matches the date (i.e something_2022_01_24.dmp, something01_2022_01_24.dmp)
For the first one from what I could find the best way is to do: ls -t *.log | head-2 as it displays the second to last file created.
As for the second one I'm more at a loss because I'm not sure how to parse the output of the first command.
The third one I think I could manage with something of the sort:
[ -f "/var/www/my_folder/*$capturedate.dmp" ] && cp "/var/www/my_folder/*$capturedate.dmp" /tmp/
What do you guys think is there any way to do this? How can I compare the substring?
Thanks!
Would you please try the following:
#!/bin/bash
dir="/var/www/my_folder"
second=$(ls -t "$dir/"*.log | head -n 2 | tail -n 1)
if [[ $second =~ .*_([0-9]{4}_[0-9]{2}_[0-9]{2})\.log ]]; then
capturedate=${BASH_REMATCH[1]}
cp -p "$dir/"*"$capturedate".dmp /tmp
fi
second=$(ls -t "$dir"/*.log | head -n 2 | tail -n 1) will pick the
second to last log file. Please note it assumes that the timestamp
of the file is not modified since it is created and the filename
does not contain special characters such as a newline. This is an easy
solution and we may need more improvement for the robustness.
The regex .*_([0-9]{4}_[0-9]{2}_[0-9]{2})\.log will match the log
filename. It extracts the date substring (enclosed with the parentheses) and assigns the bash variable
${BASH_REMATCH[1]} to it.
Then the next cp command will do the job. Please be cateful
not to include the widlcard * within the double quotes so that
the wildcard is properly expanded.
FYI here are some alternatives to extract the date string.
With sed:
capturedate=$(sed -E 's/.*_([0-9]{4}_[0-9]{2}_[0-9]{2})\.log/\1/' <<< "$second")
With parameter expansion of bash (if something does not include underscores):
capturedate=${second%.log}
capturedate=${capturedate#*_}
With cut command (if something does not include underscores):
capturedate=$(cut -d_ -f2,3,4 <<< "${second%.log}")

How to sort by name then date modification in BASH

Lets say I have a folder of .txt files that have a dd-MM-yyyy_HH-mm-ss time followed by _name.txt. I want to be able to sort by name first then time after. Example:
BEFORE
15-2-2010_10-01-55_greg.txt
10-2-1999_10-01-55_greg.txt
10-2-1999_10-01-55_jason.txt
AFTER
greg_1_10-2-1999_10-01-55
greg_2_15-2-2010_10-01-55
jason_1_10-2-1999_10-01-55
Edit: Apologies, from my "cp" line I was meant to copy them into another directory with a different name to them.
Something I tried to do is make a copy with the count, but it doesn't sort the files with the same name properly in terms of dates:
cd data/unfilteredNames
for filename in *.txt; do
n=${filename%.*}
n=${filename##*_}
filteredName=${n%.*}
count=0
find . -type f -name "*_$n" | while read name; do
count=$(($count+1))
cp -p $name ../filteredNames/"$filteredName"_"$count"
done
done
Not sure that the renaming of files is one of your expectation. At least for only sorting file name, you don't need to.
You can do this by only using GNU sort command:
sort -t- -k5.4 -k3.1,3.4 -k2.1,2.1 -k1.1,1.2 -k3.6,3.13 <(printf "%s\n" *.txt)
-t sets the field separator to a dash -.
-k enables to sort based on fields. As explained in man sort page, the syntax is -k<start>,<stop> where <start> or is composed of <field number>.<position>. Adding several -k option to the command allows to sort on multiple fields; the first in he command line having more precedence than the other.
For example, the first -k5.4 tells to sort based on the 5th fields with an offset of 4 characters. There isn't a stop field because this is the end of the filename.
The -k3.1,3.4 option sorts based on the 3rd field starting from offset 1 to 4.
The same principle applies to other -k options.
In your example the month field only has 1 digit. If you have files with a month coded with 2 digits, you might want to pad with 0 all month filenames. This can be done by adding to the printf statement this <(... | sed 's/-0\?\([0-9]\)/-0\1/') and change the -k 2.1,2.1 by -k2.1,2.2.

Rename the most recent file in each group

i try to create a script that should detect the latest file of each group, and add prefix to its original name.
ll $DIR
asset_10.0.0.1_2017.11.19 #latest
asset_10.0.0.1_2017.10.28
asset_10.0.0.2_2017.10.02 #latest
asset_10.0.0.2_2017.08.15
asset_10.1.0.1_2017.11.10 #latest
...
2 questions:
1) how to find the latest file of each group?
2) how to rename adding only a prefix
I tried the following procedure, but it looks for the latest file in the entire directory, and doesn't keep the original name to add a prefix to it:
find $DIR -type f ! -name 'asset*' -print | sort -n | tail -n 1 | xargs -I '{}' cp -p '{}' $DIR...
What would be the best approach to achieve this? (keeping xargs if possible)
Selecting the latest entry in each group
You can use sort to select only the latest entry in each group:
find . -print0 | sort -r -z | sort -t_ -k2,2 -u -z | xargs ...
First, sort all files in reversed lexicographical order (so that the latest entry appears first for each group). Then, by sorting on group name only (that's second field -k2,2 when split on underscores via -t_) and printing unique groups we get only the first entry per each group, which is also the latest.
Note that this works because sort uses a stable sorting algorithm - meaning the order or already sorted items will not be altered by sorting them again. Also note we can't use uniq here because we can't specify a custom field delimiter for uniq (it's always whitespace).
Copying with prefix
To add prefix to each filename found, we need to split each path find produces to a directory and a filename (basename), because we need to add prefix to filename only. The xargs part above could look like:
... | xargs -0 -I '{}' sh -c 'd="${1%/*}"; f="${1##*/}"; cp -p "$d/$f" "$d/prefix_$f"' _ '{}'
Path splitting is done with shell parameter expansion, namely prefix (${1##*/}) and suffix (${1%/*}) substring removal.
Note the use of NUL-terminated output (paths) in find (-print0 instead of -print), and the accompanying use of -z in sort and -0 in xargs. That way the complete pipeline will properly handle filenames (paths) with "special" characters like newlines and similar.
If you want to do this in bash alone, rather than using external tools like find and sort, you'll need to parse the "fields" in each filename.
Something like this might work:
declare -A o=() # declare an associative array (req bash 4)
for f in asset_*; do # step through the list of files,
IFS=_ read -a a <<<"$f" # assign filename elements to an array
b="${a[0]}_${a[1]}" # define a "base" of the first two elements
if [[ "${a[2]}" > "${o[$b]}" ]]; then # compare the date with the last value
o[$b]="${a[2]}" # for this base and reassign if needed
fi
done
for i in "${!o[#]}"; do # now that we're done, step through results
printf "%s_%s\n" "$i" "${o[$i]}" # and print them.
done
This doesn't exactly sort, it just goes through the list of files and grabs the highest sorting value for each filename base.

Linux: List file names, if last modified between a date interval

I have 2 variables, which contains dates like this: 2001.10.10
And i want to use ls with a filter, that only list files if last modified were between the first and second date
The best solution I can think of involves creating temporary files with the boundary timestamps, and then using find:
touch -t YYYYMMDD0000 oldest_file
touch -t YYYYMMDD0000 newest_file
find -maxdepth 1 -newer oldest_file -and -not -newer newest_file
rm oldest_file newest_file
You can use the -print0 option to find if you want to strip off the leading ./ from all the filenames.
If creating temporary files isn't an option, you might consider writing a script to calculate and print the age of a file, such as described here, and then using that as a predicate.
Sorry, it is not the simplest. I just now developed it, only for you. :-)
ls -l --full-time|awk '{s=$6;gsub(/[-\.]/,"",s);if ((s>="'"$from_variable"'") && (s<="'"$to_variable"'")) {print $0}}';
The problem is, that these simple commandline tools doesn't handle date type. So first we convert them to integers removing the separating "-" and "." characters (by you is it ".", by me a "-" so I remove both, this can you see in
gsub(/[-\.]/,"",s)
After the removal, we can already compare them with integers. In this example, we compare them with the integers $from_variable and with $to_variable. So, this will list files modified between $from_variable and $to_variable .
Both of "from_variable" and "to_variable" need to be environment variables in the form 20070707 (for 7. July, 2007).

Resources