How to chain 'mimetype -b' and 'find' command to get file names and file type in same csv? - linux

I would like to get filenames, creation dates, modification dates and file mime-types from directory structure. I've made a script which reads as follows :
#!/bin/bash
output="file_list.csv"
## columns
echo '"File name";"Creation date";"Modification date";"Mime type"' > $output
## content
find $1 -type f -printf '"%f";"%Tc";"%Cc";"no idea!"\n' >> $output
which gives me encouraging results :
"File name";"Creation date";"Modification date";"Mime type"
"Exercice 4 Cluster.xlsx";"ven. 27 mars 2020 10:35:46 CET";"mar. 17 mars 2020 19:14:18 CET";"no idea!"
"Exercice 5 Bayes.xlsx";"ven. 27 mars 2020 10:36:30 CET";"ven. 20 mars 2020 16:18:54 CET";"no idea!"
"Exercice 3 Régression.xlsx";"ven. 27 mars 2020 10:36:46 CET";"mer. 28 août 2019 17:21:10 CEST";"no idea!"
"Archers et Clustering.xlsx";"ven. 27 mars 2020 10:37:34 CET";"lun. 16 mars 2020 14:12:05 CET";"no idea!"
...
but I'm missing a capital thing : how do I get the files mime-types ? It would be great if I could chain the command 'mimetype -b' on each file found with 'find' command, and write it in the convenient column.
Thanks in advance,
Cyril

You might try using the -exec option of the find command, in which the brackets {} represent the name of the current file.
Then, you could remove the new line when appending to an existing file: AFAIK default behavior automatically appends new content to a new line, so the \n should not be necessary.
Last, you want to have a closing quote after your mimetype, so you should not only use the -b option, but the --output-format one, which will give you more control over what you want to display.
Hence the third command of your script should look like this:
find $1 -type f -printf '"%f";"%Tc";"%Cc";"' -exec mimetype --output-format %m\" {} \; >> $output

This is what I came up with:
for entry in *; do stat --printf='"%n";"%z";"%y";"' $entry; file -00 --mime-type $entry | cut -d $'\0' -f2; echo '"'; done
Uses a shell "for loop", to perform a stat on the directory entries in the current directory. Then uses file to get the mime type, and pipes that to cut to get only the mime type (by excluding the file name which is also printed by file).
The format for stat is what I believe was requested -- the file name, the last change date, the last modification date (both in ISO format, but could easily be made to UNIX seconds-since-epoch by upper-casing Z and Y).
Availability:
file: probably its own package if you are on Linux? But should be preinstalled on macOS I'm guessing.
bash/zsh: easily accessible both on Linux and macOS.
stat and cut: part of coreutils so should be preinstalled on most systems.

Related

Remove first 3 columns from `ls`?

If I do ls -o I get
-rw-rw-r-- 1 louise 347967 Aug 28 2017 Screenshot from 2017-08-28 09-33-01.png
-rw-rw-r-- 1 louise 377739 Aug 29 2017 Screenshot from 2017-08-29 10-39-49.png
-rw-rw-r-- 1 louise 340682 Aug 29 2017 Screenshot from 2017-08-29 10-40-02.png
I really want to remove the first 3 columns, so I get
347967 Aug 28 2017 Screenshot from 2017-08-28 09-33-01.png
377739 Aug 29 2017 Screenshot from 2017-08-29 10-39-49.png
340682 Aug 29 2017 Screenshot from 2017-08-29 10-40-02.png
ls can't do this, it seems. There are other questions here at SO about removing multiple columns, but not from the beginning.
ls is an interactive tool, whose output is not supposed to be parsed.
Consider using an alternative tool such as stat (GNU version recommended):
stat -c '%s %y %n' *
The output isn't quite the same but you have full control over the format. stat --help gives more information about the possible format sequences.
With GNU stat you can also use --printf to add escape characters such as newlines or tabs in the format string, to make parsing easier:
stat --printf '%s\t%Y\t%n\n' *
%Y (last modification, seconds since Epoch) is more readily suited to parsing than %y (human-readable).
This would still break in cases where the filename contained a newline, so depending on how you plan on using this information, you may want to use a \0 instead of a \n at the end of the format string and process records terminated with a null-byte instead of a newline.
Alternatively, you may find it easier to just loop through the files and call stat on them one by one, extracting whatever you need:
for file in *; do
read -r size modified name < <(stat '%s %Y %n' "$file")
# do whatever with $size, $modified and $name here
done
Assuming you go with the loop-based approach, you can convert the date to any format you want using date, for example:
date -d #"$modified" +'%b %d %H:%m'

Bash command to archive files daily based on date added

I have a suite of scripts that involve downloading files from a remote server and then parsing them. Each night, I would like to create an archive of the files downloaded that day.
Some constraints are:
Downloading from a Windows server to an Ubuntu server.
Inability to delete files on the remote server.
Require the date added to the local directory, not the date the file was created.
I have deduplication running at the downloading stage; however, (using ncftp), the check involves comparing the remote and local directories. A strategy is to create a new folder each day, download files into it and then tar it sometime after midnight. A problem arises in that the first scheduled download on the new day will grab ALL files on the remote server because the new local folder is empty.
Because of the constraints, I considered simply archiving files based on "date added" to a central folder. This works very well using a Mac because HFS+ stores extended metadata such as date created and date added. So I can combine a tar command with something like below:
mdls -name kMDItemFSName -name kMDItemDateAdded -raw *.xml | \
xargs -0 -I {} echo {} | \
sed 'N;s/\n/ /' | \
but there doesn't seem to be an analogue under linux (at least not with EXT4 that I am aware of).
I am open to any form of solution to get around doubling up files into a subsequent day. The end result should be an archives directory full of tar.gz files looking something like:
files_$(date +"%Y-%m-%d").tar.gz
Depending on the method that is used to backup the files, the modified or changed date should reflect the time it was copied - for example if you used cp -p to back them up, the modified date would not change but the changed date would reflect the time of copy.
You can get this information using the stat command:
stat <filename>
which will return the following (along with other file related info not shown):
Access: 2016-05-28 20:35:03.153214170 -0400
Modify: 2016-05-28 20:34:59.456122913 -0400
Change: 2016-05-29 01:39:52.070336376 -0400
This output is from a file that I copied using cp -p at the time shown as 'change'.
You can get just the change time by calling stat with a specified format:
stat -c '%z' <filename>
2016-05-29 01:39:56.037433640 -0400
or with capital Z for that time in seconds since epoch. You could combine that with the date command to pull out just the date (or use grep, etc)
date -d "`stat -c '%z' <filename>" -I
2016-05-29
The command find can be used to find files by time frame, in this case using the flags -cmin 'changed minutes', -mmin 'modified minutes', or unlikely, -amin 'accessed minutes'. The sequence of commands to get the minutes since midnight is a little ugly, but it works.
We have to pass find an argument of "minutes since a file was last changed" (or modified, if that criteria works). So first you have to calculate the minutes since midnight, then run find.
min_since_mid=$(echo $(( $(date +%s) - $(date -d "(date -I) 0" +%s) )) / 60 | bc)
Unrolling that a bit:
$(date +%s) == seconds since epoch until 'now'
"(date -I) 0" == todays date in format "YYYY-MM-DD 0" with 0 indicating 0 seconds into the day
$(date -d "(date -I 0" +%s)) == seconds from epoch until today at midnight
Then we (effectively) echo ( $now - $midnight ) / 60 to bc to convert the results into minutes.
The find call is passed the minutes since midnight with a leading '-' indicating up to X minutes ago. A'+' would indicate X minutes or more ago.
find /path/to/base/folder -cmin -"$min_since_mid"
The actual answer
Finally to create a tgz archive of files in the given directory (and subdirectories) that have been changed since midnight today, use these two commands:
min_since_mid=$(echo $(( $(date +%s) - $(date -d "(date -I) 0" +%s) )) / 60 | bc)
find /path/to/base/folder -cmin -"${min_since_mid:-0}" -print0 -exec tar czvf /path/to/new/tarball.tgz {} +
The -print0 argument to find tells it to delimit the files with a null string which will prevent issues with spaces in names, among other things.
The only thing I'm not sure on is you should use the changed time (-cmin), the modified time (-mmin) or the accessed time (-amin). Take a look at your backup files and see which field accurately reflects the date/time of the backup - I would think changed time, but I'm not certain.
Update: changed -"$min_since_mid" to -"${min_since_mid:-0}" so that if min_since_mid isn't set you won't error out with invalid argument - you just won't get any results. You could also surround the find with an if statement to block the call if that variable isn't set properly.

Bash - Get files for last 12 hours / sophisticated name format

I have a set of logs which have the names as follows:
SystemOut_15.07.20_23.00.00.log SystemOut_15.07.21_10.27.17.log
SystemOut_15.07.21_16.48.29.log SystemOut_15.07.22_15.57.46.log
SystemOut_15.07.22_13.03.46.log
From that list I need to get only files for last 12 hours.
So as an output I will receive:
SystemOut_15.07.22_15.57.46.log SystemOut_15.07.22_13.03.46.log
I had similar issue with files having below names but was able to resolve that quickly as the date comes in an easy format:
servicemix.log.2015-07-21-11 servicemix.log.2015-07-22-12
servicemix.log.2015-07-22-13
So I created a variable called 'day':
day=$(date -d '-12 hour' +%F-%H)
And used below command to get the files for last 12 hours:
ls | awk -F. -v x=$day '$3 >= x'
Can you help to have that done with SystemOut files as they have such name syntax containing underscore which confuses me.
Assuming the date-time in log file's name is in the format
YY.MM.DD_HH24.MI.SS,
day=$(date -d '-12 hour' +%Y.%m.%d_%H.%M.%S.log)
Prepend the century to the 2 digit year in the log file name and then compare
ls | awk -F_ -v x=$day '"20"$2"_"$3 >= x'
Alternatively, as Ed Morton suggested, find can be used like so:
find . -type f -name '*.log' -cmin -720
This returns the log files created within last 720 minutes. To be precise, this means file status was last changed within the past 720 minutes. -mmin option can be used to search by modification time.

How can i format the output of stat expression in Linux Gnome Terminal?

I am really newbie in Linux(Fedora-20) and I am trying to learn basics
I have the following command
echo "`stat -c "The file "%n" was modified on ""%y" *Des*`"
This command returns me this output
The file Desktop was modified on 2014-11-01 18:23:29.410148517 +0000
I want to format it as this:
The file Desktop was modified on 2014-11-01 at 18:23
How can I do this?
You can't really do that with stat (unless you have a smart version of stat I'm not aware of).
With date
Very likely, your date is smart enough and handles the -r switch.
date -r Desktop +"The file Desktop was modified on %F at %R"
Because of your glob, you'll need a loop to handle all files that match *Des* (in Bash):
shopt -s nullglob
for file in *Des*; do
date -r "$file" +"The file ${file//%/%%} was modified on %F at %R"
done
With find
Very likely your find has a rich -printf option:
find . -maxdepth 1 -name '*Des*' -printf 'The file %f was modified on %TY-%Tm-%Td at %TH:%TM\n'
I want to use stat
(because your date doesn't handle the -r switch, you don't want to use find or just because you like using as most tools as possible to impress your little sister). Well, in that case, the safest thing to do is:
date -d "#$(stat -c '%Y' Desktop)" +"The file Desktop was modified on %F at %R"
and with your glob requirement (in Bash):
shopt -s nullglob
for file in *Des*; do
date -d "#$(stat -c '%Y' -- "$file")" +"The file ${file//%/%%} was modified on %F at %R"
done
stat -c "The file "%n" was modified on ""%y" *Des* | awk 'BEGIN{OFS=" "}{for(i=1;i<=7;++i)printf("%s ",$i)}{print "at " substr($8,0,6)}'
I have use here awk modify your code. what i have done in this code, from field 1,7 i printed it using for loop, i need to modify field 8, so i used substr to extract 1st 5 character.

tar extracting most recent file

Using bash, I have dir of /home/user/logs/
Aug 2 15:34 backup.20120802.tar.gz
Aug 3 00:26 backup.20120803.tar.gz
Aug 4 00:25 backup.20120804.tar.gz
Aug 15 06:39 backup.20120816.tar.gz
This gets updated every few days, but if something goes wrong I want it to automatically restore the most recent backup, how can I use bash only extract the most recent?
ls -t1 /home/user/logs/ | head -1
gives you the most recent modified file in /home/user/logs/.
So you could do:
cd /dir/to/extract
tar -xzf "$(ls -t1 /home/user/logs/ | head -1)"
NOTE:
this assumes that /home/user/logs/ is flat and contains nothing but "*.tar.gz" files
If the time stamps may not always be reliable, try sorting by date.
ls -1 /home/user/logs/backup.*.tar.gz | sort -t . -k2rn | head -1
Ideally, you should not parse the output from ls, but if there are only regularly named files matching the wildcard, it may be the easiest solution; sort expects line-oriented input, anyway, so the task becomes more involved in the general case of completely arbitrary file names. (This may make no sense to you, but it would be perfectly okay as far as Unix is concerned to have a file named backup.20120816.tar.gz(newline)backup.20380401.tar.gz.)

Resources