Sorting files based on a timestamp within the file - linux

I have a bunch of files which contain an ascii header with a time stamp WITHIN the file, followed by a large chunck of binary data. I would like to list the files sorted by this time stamp, at the command line (bash, etc).
The file headers look similar to the following:
encoding: raw
endian: big
dimension: 4
sizes: 128 128 1 4
spacings: 1.0 1.0 1.0 NaN
position: -3164,-13678
date_time: 06.02.12.18:59
user_name: Operator1
sample_name:
dwell_time: 4.000
count_time: 65.536
duration: 202.000
raster: 79912
pixel_width: 624.3125
pixel_height: 624.3125
....binary data....
I would like to sort based on the "date_time" time stamp, which uses the format dd.mm.yy.hh:mm
The sort --key option looks promising but all my attempts have failed. Any help is much appreciated. Thanks.

Assuming these files are images, so you can use a tool like exiftool to rename them based on their creation dates, then sort them by name.
If you can't rename them, just dump the names with the creation date to STDOUT and sort that, e.g.:
exiftool -p '$dateTimeOriginal $filename' -q -f DIRECTORY/WHERE/IMAGES/ARE | sort -n
If you only want the filenames in the output, append a | cut -f 2 -d " " to the end.
If it's file format which is not recognized by exiftool this might or might not work:
for f in YOURFILES* ; do
filedate=`grep --binary-file=text -i -o 'date_time: ...........:..' $f | head -1`
echo "$filedate $f"
done | sort -n
Note: this won't work when there are spaces in the filenames (and I'm leaving that to you to resolve). And if you want to output only the sorted filenames, append | awk '{print $NF}' after the sort -n.

Related

pasting many files to a single large file

i have many text files in a directory like 1.txt 2.txt 3.txt 4.txt .......2000.txt and i want to paste them to make a large file.
In this regard i did something like
paste *.txt > largefile.txt
but the above command reads the .txt file randomly, so i need to read the files sequentially and paste as 1.txt 2.txt 3.txt....2000.txt
please suggest a better solution for pasting many files.
Thanks and looking forward to hearing from you.
Sort the file names numerically yourself then.
printf "%s\n" *.txt | sort -n | xargs -d '\n' paste
When dealing with many files, you may hit ulimit -n. On my system ulimit -n is 1024, but this is a soft limit and can be raised with just like ulimit -n 99999.
Without raising the soft limit, go with a temporary file that would accumulate results each "round" of ulimit -n count of files, like:
touch accumulator.txt
... | xargs -d '\n' -n $(($(ulimit -n) - 1)) sh -c '
paste accumulator.txt "$#" > accumulator.txt.sav;
mv accumulator.txt.sav accumulator.txt
' _
cat accumulator.txt
Instead use the wildcard * to enumerate all your files in a directory, if your file names pattern are sequentially ordered, you can manually list all files in order and concatenate to a large file. The output order of * enumeration might look different in different environment, as it not works as you expect.
Below is a simple example
$ for i in `seq 20`;do echo $i > $i.txt;done
# create 20 test files, 1.txt, 2.txt, ..., 20.txt with number 1 to 20 in each file respectively
$ cat {1..20}.txt
# show content of all file in order 1.txt, 2.txt, ..., 20.txt
$ cat {1..20}.txt > 1_20.txt
# concatenate them to a large file named 1_20.txt
In bash or any other shell, glob expansions are done in lexicographical order. When having files numberd, this sadly means that 11.txt < 1.txt < 2.txt. This weird ordering comes from the fact that, lexicographically, 1 < . (<dot>-character (".")).
So here are a couple of ways to operate on your files in order:
rename all your files:
for i in *.txt; do mv "$i" "$(sprintf "%0.5d.txt" ${i%.*}"); done
paste *.txt
use brace-expansion:
Brace expansion is a mechanism that allows for the generation of arbitrary strings. For integers you can use {n..m} to generate all numbers from n to m or {n..m..s} to generate all numbers from n to m in steps of s:
paste {1..2000}.txt
The downside here is that it is possible that a file is missing (eg. 1234.txt). So you can do
shopt -s extglob; paste ?({1..2000}.txt)
The pattern ?(pattern) matches zero or one glob-matches. So this will exclude the missing files but keeps the order.

How to sort by name then date modification in BASH

Lets say I have a folder of .txt files that have a dd-MM-yyyy_HH-mm-ss time followed by _name.txt. I want to be able to sort by name first then time after. Example:
BEFORE
15-2-2010_10-01-55_greg.txt
10-2-1999_10-01-55_greg.txt
10-2-1999_10-01-55_jason.txt
AFTER
greg_1_10-2-1999_10-01-55
greg_2_15-2-2010_10-01-55
jason_1_10-2-1999_10-01-55
Edit: Apologies, from my "cp" line I was meant to copy them into another directory with a different name to them.
Something I tried to do is make a copy with the count, but it doesn't sort the files with the same name properly in terms of dates:
cd data/unfilteredNames
for filename in *.txt; do
n=${filename%.*}
n=${filename##*_}
filteredName=${n%.*}
count=0
find . -type f -name "*_$n" | while read name; do
count=$(($count+1))
cp -p $name ../filteredNames/"$filteredName"_"$count"
done
done
Not sure that the renaming of files is one of your expectation. At least for only sorting file name, you don't need to.
You can do this by only using GNU sort command:
sort -t- -k5.4 -k3.1,3.4 -k2.1,2.1 -k1.1,1.2 -k3.6,3.13 <(printf "%s\n" *.txt)
-t sets the field separator to a dash -.
-k enables to sort based on fields. As explained in man sort page, the syntax is -k<start>,<stop> where <start> or is composed of <field number>.<position>. Adding several -k option to the command allows to sort on multiple fields; the first in he command line having more precedence than the other.
For example, the first -k5.4 tells to sort based on the 5th fields with an offset of 4 characters. There isn't a stop field because this is the end of the filename.
The -k3.1,3.4 option sorts based on the 3rd field starting from offset 1 to 4.
The same principle applies to other -k options.
In your example the month field only has 1 digit. If you have files with a month coded with 2 digits, you might want to pad with 0 all month filenames. This can be done by adding to the printf statement this <(... | sed 's/-0\?\([0-9]\)/-0\1/') and change the -k 2.1,2.1 by -k2.1,2.2.

How to stop creating files at a specific number?

I want to create 900 files with 3 letter names.
Example: xyz.txt, where x, y, and z are different letters from a to z.
I'm trying to do something like this:
for ((i=1; i < 900; i++))
do touch {a..z}{a..z}{a..z}$i.txt
done
I expect it to stop at 900.
But in the end I'm creating over 10K of files. Can someone please help me accomplish this?
To avoid any names with repeating characters before the period, as alluded to in comments, you could do this:
printf '%s.txt\n' {a..z}{a..z}{a..z} | grep -vE '(.).?\1.?\.' \
| head -n 900 | xargs touch
The printf statement prints the list from aaa.txt, aab.txt to zzz.txt.
grep -vE '(.).?\1.?\.' filters any names where the three characters before the period are not unique: aaa.txt, aab.txt, aba.txt and baa.txt are all filtered.
head -n 900 gets the first 900 names from the list
xargs touch calls touch as few times as possible but will make sure the command line never is too long.
If you want to randomize the file names (but still adhere to these criteria), you can shuffle them before selecting 900:
printf '%s.txt\n' {a..z}{a..z}{a..z} | grep -vE '(.).?\1.?\.' \
| shuf | head -n 900 | xargs touch
You can do this with 2 lines:
all=( {a..z}{a..z}{a..z}.txt )
touch "${all[#]:0:900}"
(assuming your OS allows a commandline with 900 arguments). If y ouwant something other than the first 900 such files, you'll need to do something more complicated.
The filenames remind the split default output naming. So you could split an empty file to 900 files.
split /dev/null -n900 -a3 --additional-suffix=".txt" ""
The following generates 910 names, and discards the last 10 before passing them to touch.
printf '%s.txt\n' {a..m}{p..v}{0..9} | head -n 900 | xargs touch
You might be able to find a combination which generates exactly 900 alternatives so you can just touch <pattern> and have the shell expand it to the required number of unique names.

Linux sort -Help Wanted

I'm stuck in a problem for few days. Here it is maybe u got bigger brains than me!
I got a bunch of CSV files and i want them concatenated into a single .csv file, numeric sorted. Ok, first encountered problem is with the ID (i want to sort unly by ID) name.
eg
sort -f *.csv > output.csv This would work if i had standard ids like id001, id002, id010, id100
but my ids are like id1, id2, id10, id100 and this make my sort job inaccurate.
Ok
sort -t, -V *.csv > output.csv - This works perfectly on my test machine (sort --version GNU coreutils 8.5.0) but my live machine from work got 5.3.0 sort version (and they didn't had implemented -V syntax on it) and i cannot update it!
I'm feel so noob and unlucky
If you have a better idea please bring it on.
my csv file looks like
cn41 AQ34070YTW CDEAQ34070YTW 9C:B6:54:08:A3:C6 9C:B6:54:08:A3:C4
cn42 AQ34070YTY CDEAQ34070YTY 9C:B6:54:08:A4:22 9C:B6:54:08:A4:20
cn43 AQ34070YV1 CDEAQ34070YV1 9C:B6:54:08:9F:0E 9C:B6:54:08:9F:0C
cn44 AQ34070YV3 CDEAQ34070YV3 9C:B6:54:08:A3:7A 9C:B6:54:08:A3:78
cn45 AQ34070YW7 CDEAQ34070YW7 9C:B6:54:08:25:22 9C:B6:54:08:25:20
This is actually copy / paste from a csv. So let's say, this is my first CSV. and the other one looks like
cn201 AQ34070YTW CDEAQ34070YTW 9C:B6:54:08:A3:C6 9C:B6:54:08:A3:C4
cn202 AQ34070YTY CDEAQ34070YTY 9C:B6:54:08:A4:22 9C:B6:54:08:A4:20
cn203 AQ34070YV1 CDEAQ34070YV1 9C:B6:54:08:9F:0E 9C:B6:54:08:9F:0C
cn204 AQ34070YV3 CDEAQ34070YV3 9C:B6:54:08:A3:7A 9C:B6:54:08:A3:78
cn205 AQ34070YW7 CDEAQ34070YW7 9C:B6:54:08:25:22 9C:B6:54:08:25:20
Looking forward reading you!
Regards
You can use the -kX.Y for column X starting on Y character, together with -n for numeric:
sort -t, -k2.3 -n *csv
Given your sample file, it produces:
$ sort -t, -k2.3 -n file
,id1,aaaaaa,bbbbbbbbbb,cccccccccccc,ddddddd
,id2,aaaaaa,bbbbbbbbbb,cccccccccccc,ddddddd
,id10,aaaaaa,bbbbbbbbbb,cccccccccccc,ddddddd
,id40,aaaaaa,bbbbbbbbbb,cccccccccccc,ddddddd
,id101,aaaaaa,bbbbbbbbbb,cccccccccccc,ddddddd
,id201,aaaaaaaaa,bbbbbbbbbb,ccccccccccc,ddddddd
Update
For your given input, I would do:
$ cat *csv | sort -k1.3 -n
cn41 AQ34070YTW CDEAQ34070YTW 9C:B6:54:08:A3:C6 9C:B6:54:08:A3:C4
cn42 AQ34070YTY CDEAQ34070YTY 9C:B6:54:08:A4:22 9C:B6:54:08:A4:20
cn43 AQ34070YV1 CDEAQ34070YV1 9C:B6:54:08:9F:0E 9C:B6:54:08:9F:0C
cn44 AQ34070YV3 CDEAQ34070YV3 9C:B6:54:08:A3:7A 9C:B6:54:08:A3:78
cn45 AQ34070YW7 CDEAQ34070YW7 9C:B6:54:08:25:22 9C:B6:54:08:25:20
cn201 AQ34070YTW CDEAQ34070YTW 9C:B6:54:08:A3:C6 9C:B6:54:08:A3:C4
cn202 AQ34070YTY CDEAQ34070YTY 9C:B6:54:08:A4:22 9C:B6:54:08:A4:20
cn203 AQ34070YV1 CDEAQ34070YV1 9C:B6:54:08:9F:0E 9C:B6:54:08:9F:0C
cn204 AQ34070YV3 CDEAQ34070YV3 9C:B6:54:08:A3:7A 9C:B6:54:08:A3:78
cn205 AQ34070YW7 CDEAQ34070YW7 9C:B6:54:08:25:22 9C:B6:54:08:25:20
If your CSV format is fixed, you can use the shell equivalent of the decorate-sort-undecorate pattern:
cat *.csv | sed 's/^,id//' | sort -n | sed 's/^/,id/' >output.csv
The -n option is present even in ancient version of sort.
UPDATE: the updated input contains a number with a different prefix, and at a different position in the line. Here is a version that handles both kinds of input, as well as other inputs that have a number somewhere in the line, sorting by the first number:
cat *.csv | sed 's/^\([^0-9]*\)\([0-9][0-9]*\)/\2 \1\2/' \
| sort -n \
| sed 's/^[^ ]* //' > output.csv
You could try the -g option:
sort -t, -k 2.3 -g fileName
-t seperator
-k key/column
-g general numeric sort

Clearing archive files with linux bash script

Here is my problem,
I have a folder where is stored multiple files with a specific format:
Name_of_file.TypeMM-DD-YYYY-HH:MM
where MM-DD-YYYY-HH:MM is the time of its creation. There could be multiple files with the same name but not the same time of course.
What i want is a script that can keep the 3 newest version of each file.
So, I found one example there:
Deleting oldest files with shell
But I don't want to delete a number of files but to keep a certain number of newer files. Is there a way to get that find command, parse in the Name_of_file and keep the 3 newest???
Here is the code I've tried yet, but it's not exactly what I need.
find /the/folder -type f -name 'Name_of_file.Type*' -mtime +3 -delete
Thanks for help!
So i decided to add my final solution in case anyone liked to get it. It's a combination of the 2 solutions given.
ls -r | grep -P "(.+)\d{4}-\d{2}-\d{2}-\d{2}:\d{2}" | awk 'NR > 3' | xargs rm
One line, super efficiant. If anything changes on the pattern of date or name just change the grep -P pattern to match it. This way you are sure that only the files fitting this pattern will get deleted.
Can you be extra, extra sure that the timestamp on the file is the exact same timestamp on the file name? If they're off a bit, do you care?
The ls command can sort files by timestamp order. You could do something like this:
$ ls -t | awk 'NR > 3' | xargs rm
THe ls -t lists the files by modification time where the newest are first.
The `awk 'NR > 3' prints out the list of files except for the first three lines which are the three newest.
The xargs rm will remove the files that are older than the first three.
Now, this isn't the exact solution. There are possible problems with xargs because file names might contain weird characters or whitespace. If you can guarantee that's not the case, this should be okay.
Also, you probably want to group the files by name, and keep the last three. Hmm...
ls | sed 's/MM-DD-YYYY-HH:MM*$//' | sort -u | while read file
do
ls -t $file* | awk 'NR > 3' | xargs rm
done
The ls will list all of the files in the directory. The sed 's/\MM-DD-YYYY-HH:MM//' will remove the date time stamp from the files. Thesort -u` will make sure you only have the unique file names. Thus
file1.txt-01-12-1950
file2.txt-02-12-1978
file2.txt-03-12-1991
Will be reduced to just:
file1.txt
file2.txt
These are placed through the loop, and the ls $file* will list all of the files that start with the file name and suffix, but will pipe that to awk which will strip out the newest three, and pipe that to xargs rm that will delete all but the newest three.
Assuming we're using the date in the filename to date the archive file, and that is possible to change the date format to YYYY-MM-DD-HH:MM (as established in comments above), here's a quick and dirty shell script to keep the newest 3 versions of each file within the present working directory:
#!/bin/bash
KEEP=3 # number of versions to keep
while read FNAME; do
NODATE=${FNAME:0:-16} # get filename without the date (remove last 16 chars)
if [ "$NODATE" != "$LASTSEEN" ]; then # new file found
FOUND=1; LASTSEEN="$NODATE"
else # same file, different date
let FOUND="FOUND + 1"
if [ $FOUND -gt $KEEP ]; then
echo "- Deleting older file: $FNAME"
rm "$FNAME"
fi
fi
done < <(\ls -r | grep -P "(.+)\d{4}-\d{2}-\d{2}-\d{2}:\d{2}")
Example run:
[me#home]$ ls
another_file.txt2011-02-11-08:05
another_file.txt2012-12-09-23:13
delete_old.sh
not_an_archive.jpg
some_file.exe2011-12-12-12:11
some_file.exe2012-01-11-23:11
some_file.exe2012-12-10-00:11
some_file.exe2013-03-01-23:11
some_file.exe2013-03-01-23:12
[me#home]$ ./delete_old.sh
- Deleting older file: some_file.exe2012-01-11-23:11
- Deleting older file: some_file.exe2011-12-12-12:11
[me#home]$ ls
another_file.txt2011-02-11-08:05
another_file.txt2012-12-09-23:13
delete_old.sh
not_an_archive.jpg
some_file.exe2012-12-10-00:11
some_file.exe2013-03-01-23:11
some_file.exe2013-03-01-23:12
Essentially, but changing the file name to dates in the form to YYYY-MM-DD-HH:MM, a normal string sort (such as that done by ls) will automatically group similar files together sorted by date-time.
The ls -r on the last line simply lists all files within the current working directly print the results in reverse order so newer archive files appear first.
We pass the output through grep to extract only files that are in the correct format.
The output of that command combination is then looped through (see the while loop) and we can simply start deleting after 3 occurrences of the same filename (minus the date portion).
This pipeline will get you the 3 newest files (by modification time) in the current dir
stat -c $'%Y\t%n' file* | sort -n | tail -3 | cut -f 2-
To get all but the 3 newest:
stat -c $'%Y\t%n' file* | sort -rn | tail -n +4 | cut -f 2-

Resources