Adding file size and other info to a find - linux

I am currently using the following linux command:
find /folder -size +1000k | grep txt
to find in the "folder" any files with "txt" in it that has a size greater than 1000k bytes.
This succesfully returns the list of files that I want. But I would also like to print out the file size, and if I can, to see last date modified within the returned list of files (much like what the command ll returns)
I tried using printf %s, but this simply returns a list of numbers, thus the grep doesn't work.

First -- there's absolutely no reason to use grep on output from find; you can just tell find to do the filtering itself.
Second -- the -printf action takes a format string which can have more than one specifier. For instance, %s %P\n, to print first size, then name, then a newline. (This ordering is desirable since size will always be a single string of digits, whereas a name can be undefined -- so putting the name first makes it harder to parse).
find /folder -size +1000k -name '*txt*' -printf '%s %P\n'
Mind you -- to be completely unambiguously parsed, you'll want to use NUL specifiers rather than newlines, since newlines are valid inside filenames.
An example which reads filenames and sizes into a pair of bash arrays, after sorting by size:
files=( )
sizes=( )
while IFS= read -r -d' ' size && IFS= read -r -d '' filename; do
files+=( "$filename" )
sizes+=( "$size" )
done < <(find /folder -size +1000k -name '*txt*' -printf '%s %P\0' | sort -n -z)

Related

Format xargs output to grep

I have a script that I'm trying to optimize with xargs. The current version uses find with -exec to call the command:
find -type f -iname "*.mp4" -print0 -printf '\n' -exec getfattr -d --absolute-names {} \;
after which I can pipe to grep with something like:
grep -z -P user\.md5\=\"$input_search_hash\"
to filter the results while keeping the whole output with -z.
I need the whole output returned from getfattr to be "preserved", per file, because I need the filename for which there is a matching extended attribute, which then is then passed to sed to extract it. There are also cases where I have multiple grep commands in sequence if I need to search for files with multiple matches in the extended attributes. The problem is that the output of:
find -type f -iname "*.mp4" -print0 | xargs -0 getfattr -d --absolute-names
is not formatted in such a way that grep will filter in this way. This does work with the -exec method. Can I pass an addional option to xargs or pipe in some additional command that will format the output to make grep properly replicate the behaviour of -exec? I'm guessing I need some sort of line-break before feeding to grep like what -printf '\n' does in the -exec method. I would just use getfattr to "search" the extended attributes instead of needing to grep the output at all, but it has no way to do this by suppling a xattr name and value.
Example
The input comes from the find command, which is a list of video files in an arbitrary directory structure. The output of each getfattr command, for each file is such:
# file: /path/to/file/test.mp4
user.md5="0e29a7f555af518872771689e28d998d"
user.quality="10"
user.sha256="d49ba58e3b30f4ef8c81d19ce960edcf6552977bb8adb79b5b9a677ba9a54b2b"
user.size="1645645"
If I attempt to grep the output of find using the + method, say for a value of "10" on the quality, I will get results like this:
# file: /path/to/file/test.mp4
user.md5="8cf97b888e6fdbed27b02233cd6779f5"
user.quality="12"
user.sha256="613d16b2a0270e2e5f81cfd58b1eacf710a65b82ce2dab49a1e415275440f429"
user.size="1645645"
# file: /path/to/file/test1.mp4
user.md5="3c5a39f1ceefce1e124bcd6786a99155"
user.quality="10"
user.sha256="0d7128a7642d24ea879bbfb3de812b7939b618d8af639f07d5104c954c8049c3"
user.size="5674567"
# file: /path/to/file/test2.mp4
user.md5="0e29a7f555af518872771689e28d998d"
user.quality="6"
user.sha256="d49ba58e3b30f4ef8c81d19ce960edcf6552977bb8adb79b5b9a677ba9a54b2b"
user.size="15645"
All files that find locates are returned and the string to be searched from grep, in this example user.quality="10", is highlighted, but the other files test.mp4 and test2.mp4 still have the output printed post-grep. In other words, find may locate 1000 mp4 files of which maybe 20 have a user.quality="10" entry, but even applying grep to search for that string still returns 1000 filenames (after sed).
This does not happen when using \;. The only thing I would get out from grep would be:
# file: /path/to/file/test.mp4
user.md5="3c5a39f1ceefce1e124bcd6786a99155"
user.quality="10"
user.sha256="0d7128a7642d24ea879bbfb3de812b7939b618d8af639f07d5104c954c8049c3"
user.size="5674567"
This is the expected behaviour.
xargs vs find -exec
To me it seems like you want to use xargs instead of find -exec {} \; to speed things up.
Yes, xargs is faster than find -exec {} \;, not because it does the same work more efficiently, but because it does different work!
find -exec {} \; calls once for each file (getfattr file1, then getfattr file2, and so on).
xargs crams as many files into one call as possible (getfattr file1 file2 file3 ...).
The same behavior (and even more speedup) can be achieved with find -exec {} + -- no need to use xargs for that.
With xargs and find -exec {} + you loose control over the output format. There is only one call of getfattr so that program decides what to print between file1, file2 and so on. getfattr has no option to customize its output format.
No problem! You can ...
Parse getfattr's output
... pretty easily.
For starters, we assume that all path names are pretty normal. Spaces, *, and ? are ok though. For really unusual path names containing backslashes and linebreaks see the last section.
If you output only the relevant attribute using -n user.md5 instead of -d, then you know that the output (if any) for each file is always of the form
# file: path in a single line
user.md5=encoded value of the attribute
Files without the attribute user.md5 are not printed at all. They cause a warning on stderr which can be suppressed by 2> /dev/null.
Now, grep for matching attributes. Use grep -B1 to print the line above each match (i.e. the path) too. Then use sed -n or grep -o to extract the filenames.
find -type f -iname '*.mp4' -exec getfattr -n user.md5 --absolute-names {} + 2> /dev/null |
grep -B1 -Fx "user.md5=\"$input_search_hash\"" |
sed -n 's/^# file: //p'
Above command prints the paths of all mp4 files having the attribute user.md5 with value $input_search_hash.
Handling Unusual Filenames
At least my version (getfattr 2.4.48 by Andreas Gruenbacher) on Debian 10 always prints the file name in a single line. Linebreaks are encoded using \012 and backslashes are encoded using \134. Therefore, safe processing of those files is possible.
Above command works, but prints only the encoded file names. To get the actual filenames you have to extend the sed command or add another command to interpret octal escape sequences. For me, getfattr only escapes \n, \r and \\, thus sed 's:\\012:\n:g;s:\\015:\r:g;s:\\134:\\:g' should be sufficient for printing. For further processing, you may want to use tr \\n \\0 | sed -z ... instead, such that filenames are separated by null bytes.
To test which characters are escaped for you, create a filename containing all allowed bytes and let getfattr print its name:
f=$(printf $(printf '\\%o' $(seq 1 255)) | tr -d /)
touch "$f"
setfattr -n user.md5 -v 123 "$f"
getfattr -n user.md5 "$f"
rm "$f"

How to write a Linux shell script that removes files older than X days, but leaves the first file of the day by modification time?

As the title says, how could that shell script be implemented. I know its easy to find files and delete files older than v.gr. 29 days using:
find /some_folder/ -name "file_prefix*" -mtime +30 -exec rm {} \;
But how to add exception, that first file of each day by modification time is not removed?
Not the most elegant - but is a combination of a few answers - something like this will work:
d=2020-01-01
end_date=2020-02-03
while [ "$d" != $end_date ]; do
d2=$(date -I -d "$d - 1 day")
d=$(date -I -d "$d + 1 day")
echo $d2
echo $d
find -type f -newerct "${d2}" ! -newerct "${d}" -printf "%T# %Tc %p\n" | sort -n | tail -n +2 | awk '{print $9}' | xargs rm
done
I'd suggest adding paths and hashing out the xargs rm bit (just to print and double check what you're removing).
There's probably a more elegant way to do this other than the print stuff but it works.
For common filenames
find /some_folder/ -name "file_prefix*" -mtime +30 -printf '%TD %TT %p\n' |
sort |
awk '{if ($1==prevdate) print $3; prevdate=$1}' |
xargs rm
The find command will print %TD %TT %p, i.e. the last modification date followed by the last modification time and then the filepath (folder and filename).
The list is sorted by sort. Because of the date/time/filepath structure, this will sort by date then time so that the oldest files are printed first, which is important afterward.
awk parses each line and calls {if ($1==prevdate) print $3; prevdate=$1}. Because of the date/time/filepath structure, the date is $1, the time is $2 and the filepath is $3. This prints the filepath whenever the date if similar to the previously parsed date. So, the first file of the day is not printed out because its date differs from the date of the previous day, and all subsequent files of the same day are printed. Please note that prevdate is initially unassigned which is roughly equivalent to the null string. You can call this if you find it more readable:
awk 'BEGIN{prevdate=""} {if ($1==prevdate) print $3; prevdate=$1}'
Finally, xargs rm will call rm for each line from the standard input which, at this moment, contains the list of files printed by awk.
Handling spaces
If your filenames contain space characters, the solution can be slightly adjusted:
find /some_folder/ -name "file_prefix*" -mtime +30 -printf '%TD %TT %p\n' |
sort |
awk '{if ($1==prevdate) print; prevdate=$1}' |
cut -d ' ' -f3- |
xargs rm
awk prints the whole line instead of printing the filepath only, then the filename is extracted with cut -d ' ' -f3- before calling xargs rm.
Handling weird filenames
The above solutions do not work with filenames containing newlines and possibly won’t work with backslashes either.
I assume you won’t run into these issues because if you want to clean up the directory, chances are you already know what’s inside this directory, and these are probably log files or other type of files created automatically.
However, should you need to handle all types of filenames, the command below will do the trick:
unset prevdate currentdate filepath
find /some_folder/ -name "file_prefix*" -mtime +30 -printf '%TD %TT %p\0' |
sort -z |
while IFS= read -r -d '' line
do
currentdate=${line%% *}
if [ "$currentdate" = "$prevdate" ]
then
filepath=$(cut -d ' ' -f3- <<< $line)
rm -- "$filepath"
fi
prevdate=$currentdate
done
It behaves essentially like the initial solution but strings are separated by the null character (which is a forbidden character in a filename) instead of the traditional newline separation.
find outputs results with %TD %TT %p\0 instead of %TD %TT %p\n which means results are null-separated, then sort -z makes use of this null-separated result, and finally the while loop is a rewrite of the awk script, but makes use of null-separated strings (which is almost impossible to do with awk). There is no call to xargs rm because rm is called directly inside the while loop.
While the ability to handle all types of filenames is tempting, please note that this solution is significantly less efficient than the other solutions. The code that I wrote is non-optimal for educational purposes, but it would still be slower even if I optimized it.
Same date and time
If several "first file of the day" occur at the exact same time within a same day, it will only skip the first file with the "lowest" file path, i.e. sorted by its alphanumeric characters. If you want to keep all first files of the day with the exact same time, this is slightly more complicated but this is doable.

How do I classify files in Linux server by their names?

How can use the ls command and options to list the repetitious filenames that are in different directories?
You can't use a single, basic ls command to do this. You'd have to use a combination of other POSIX/Unix/GNU utilities. For example, to find the duplicate filenames first:
find . -type f -exec basename "\{}" \; | sort | uniq -d > dupes
This means find all the files (-type f) through the entire directory hierarchy in the current directory (.), and execute (-exec) the command basename (which strips the directory portion) on the found file (\{}), end of command (\;). These files then sort and print out duplicate lines (uniq -d). The result goes in the file dupes. Now you have the filenames that are duplicated, but you don't know what directory they are in. Use find again to find them. Using bash as your shell:
while read filename; do find . -name "$filename" -print; done < dupes
This means loop through (while) all contents of file dupes and read into the variable filename each line. For each line, execute find again and search for the specific -name of the $filename and print it out (-print, but it's implicit so this is redundant).
Truth be told you can combine these without using an intermediate file:
find . -type f -exec basename "\{}" \; | sort | uniq -d | while read filename; do find . -name "$filename" -print; done
If you're not familiar with it, the | operator means, execute the following command using the output of the previous command as the input to the following command. Example:
eje#EEWANCO-PC:~$ mkdir test
eje#EEWANCO-PC:~$ cd test
eje#EEWANCO-PC:~/test$ mkdir 1 2 3 4 5
eje#EEWANCO-PC:~/test$ mkdir 1/2 2/3
eje#EEWANCO-PC:~/test$ touch 1/0000 2/1111 3/2222 4/2222 5/0000 1/2/1111 2/3/4444
eje#EEWANCO-PC:~/test$ find . -type f -exec basename "\{}" \; | sort | uniq -d | while read filename; do find . -name "$filename" -print; done
./1/0000
./5/0000
./1/2/1111
./2/1111
./3/2222
./4/2222
Disclaimer: The requirement stated that the filenames were all numbers. While I have tried to design the code to handle filenames with spaces (and in tests on my system, it works), the code may break when it encounters special characters, newlines, nuls, or other unusual situations. Please note that the -exec parameter has special security considerations and should not be used by root over arbitrary user files. The simplified example provided is intended for illustrative and didactic purposes only. Please consult your man pages and relevant CERT advisories for full security implications.
I have a function in my bash profile (bash 4.4) for duplicate files.
It is true that find is the correct tool.
I use find combined with -print0 options which separates the find results with null char instead of new lines (default find action). Now i can catch all files under current directory and subdirectories.
This will ensure that results will be correct no matter if filenames contain special chars like spaces or new lines (in some very rare cases). Instead of double running find against find, you can built an array and just locate the duplicate files in this array. Then you grep the whole array using the "duplicates" as pattern.
So something like this works ok for my function:
$ IFS= readarray -t -d '' fn< <(find . -name 'file*' -print0)
$ dupes=$(LC_ALL=C sort <(printf '\<%s\>$\n' "${fn[#]##*/}") |uniq -d)
$ grep -e "$dupes" <(printf '%s\n' "${fn[#]}") |awk -F/ '{print $NF,"==>",$0}' |LC_ALL=C sort
This is a test:
$ IFS= readarray -t -d '' fn< <(find . -name 'file*' -print0)
# find all files and load them in an array using null delimiter
$ printf '%s\n' "${fn[#]}" #print the array
./tmp/file7
./tmp/file14
./tmp/file11
./tmp/file8
./tmp/file9
./tmp/tmp2/file09 99
./tmp/tmp2/file14.txt
./tmp/tmp2/file15.txt
./tmp/tmp2/file$100
./tmp/tmp2/file14.txt.bak
./tmp/tmp2/file15.txt.bak
./tmp/file1
./tmp/file4
./file09 99
./file14
./file$100
./file1
$ dupes=$(LC_ALL=C sort <(printf '\<%s\>$\n' "${fn[#]##*/}") |uniq -d)
#Locate duplicate files
$ echo "$dupes"
\<file$100\>$ #Mind this one with special char $ in filename
\<file09 99\>$ #Mind also this one with spaces
\<file14\>$
\<file1\>$
#I have on purpose enclose the results between \<...\> to force grep later to capture full words and avoid file1 to match file1.txt or file11
$ grep -e "$dupes" <(printf '%s\n' "${fn[#]}") |awk -F/ '{print $NF,"==>",$0}' |LC_ALL=C sort
file$100 ==> ./file$100 #File with special char correctly captured
file$100 ==> ./tmp/tmp2/file$100
file09 99 ==> ./file09 99 #File with spaces in name also correctly captured
file09 99 ==> ./tmp/tmp2/file09 99
file1 ==> ./file1
file1 ==> ./tmp/file1
file14 ==> ./file14 #other files named file14 like file14.txt and file14.txt.bak not captured since they are not duplicates.
file14 ==> ./tmp/file14
Tips:
This one <(printf '\<%s\>$\n' "${fn[#]##*/}") uses process substitution on the basename of the find results using bash built in parameter expansion techniques.
LC_ALL=C is required on sorting in order filenames to be sorted correctly.
In bash versions before 4.4 , the readarray does not accept -d option (delimiter). In this case you can transform find results to an array with
while IFS= read -r -d '' res;do fn+=( "$res" );done < <(find.... -print0)

Find top 500 oldest files

How can I find top 500 oldest files?
What I've tried:
find /storage -name "*.mp4" -o -name "*.flv" -type f | sort | head -n500
Find 500 oldest files using GNU find and GNU sort:
#!/bin/bash
typeset -a files
export LC_{TIME,NUMERIC}=C
n=0
while ((n++ < 500)) && IFS=' ' read -rd '' _ x; do
files+=("$x")
done < <(find /storage -type f \( -name '*.mp4' -o -name '*.flv' \) -printf '%T# %p\0' | sort -zn)
printf '%q\n' "${files[#]}"
Update - some explanation:
As mentioned by Jonathan in the comments, the proper way to handle this involves a lot of non-standard features which allows producing and consuming null-delimited lists so that arbitrary filenames can be handled safely.
GNU find's -printf produces the mtime (using the undocumented %T# format. My guess would be that whether or not this works depends upon your C library) followed by a space, followed by the filename with a terminating \0. Two additional non-standard features process the output: GNU sort's -z option, and the read builtin's -d option, which with an empty option argument delimits input on nulls. The overall effect is to have sort order the elements by the mtime produced by find's -printf string, then read the first 500 results into an array, using IFS to split read's input on space and discard the first element into the _ variable, leaving only the filename.
Finally, we print out the array using the %q format just to display the results unambiguously with a guarantee of one file per line.
The process substitution (<(...) syntax) isn't completely necessary but avoids the subshell induced by the pipe in versions that lack the lastpipe option. That can be an advantage should you decide to make the script more complicated than merely printing out the results.
None of these features are unique to GNU. All of this can be done using e.g. AST find(1), openbsd sort(1), and either Bash, mksh, zsh, or ksh93 (v or greater). Unfortunately the find format strings are incompatible.
The following finds the oldest 500 files with the oldest file at the top of the list:
find . -regex '.*.\(mp4\|flv\)' -type f -print0 | xargs -0 ls -drt --quoting-style=shell-always 2>/dev/null | head -n500
The above is a pipeline. The first step is to find the file names which is done by find. Any of find's options can be used to select the files of interest to you. The second step does the sorting. This is accomplished with xargs passing the file names to ls with sorts on time in reverse order so that the oldest files are at the top. The last step is head -n500 which takes just the first 500 file names. The first of those names will be the oldest file.
If there are more than 500 files, then head terminates before ls. If this happens, ls will issue a message: terminated by signal 13. I redirected stderr from the xargs command to eliminate this harmless message.
The above solution assumes that all the filenames can fit on one command line in your shell.

bash script collecting filenames seems to get confused by spaces

I'm trying to build a script that lists all the zip files in a set of directories, with some filters and get it to spit them out to file but when a filename has a space in it it seems to appear on a new line.
This list will eventually be used as an input to tar to gzip all the zip files, script is below:
#!/bin/bash
rm -f set1.txt
rm -f set2.txt
for line in $(find /home -type d -name assets ;);
do
echo $line >> set1.txt
for line in $(find $line -type f -name \*.zip -mtime +2 ;);
do
echo \"$line\" >> set2.txt
done;
This works as expected until you get a space in a filename then set2.txt contains entries like this:
"/home/xxxxxx/oldwebroot/htdocs/upload/assets/jobbags/rbjbCost"
"in"
"use"
"sept"
"2010.zip"
Does anyone know how I can get it to keep these filenames with spaces in in a single line with the whole lot wrapped in one set of quotes?
Thanks!
The correct way to loop over a set of files located via find is with a while read construct, thus:
while IFS= read -r -d '' line ; do
echo "$line" >> set1.txt
while IFS= read -r -d '' file ; do
printf '"%s"\n' "$file" >> set2.txt
done < <(find "$line" -type f -name \*.zip -mtime +2 -print0)
done < <(find /home -type d -name assets -print0)
For clarity I have given the inner loop variable a different name.
If you didn't have bash you'd have to issue the find command separately and redirect the output to a file, then read the file with while read ; do .. done < filename.
Note that each expansion of each variable is double-quoted. This is necessary.
Note also, however, that for what you want you can simply use the -printf switch to find, if you have GNU find.
find /home -type f -path '*/assets/*.zip' -mtime +2 -printf '"%p"\n' > set2.txt
Although, as #sarnold notes, this is not safe.
You should probably be executing your tar(1) command through some other mechanism; the find(1) program supports a -print0 option to request ASCII NUL-separated filename output, and the xargs(1) program supports a -0 option to tell it that the input is separated by ASCII NUL characters. (Since NUL is the only character that is not allowed in filenames, this is the only way to get reliable filename handling.)
Simply using the -print0 and -0 options will help but this still leaves the script open to another problem -- xargs(1) might decide to execute the tar(1) command two, three, or more times, depending upon its input. The last execution is the one that will "win", and the data from earlier invocations will be lost for ever. (This is useless as a backup.)
So you should also look into adding the --concatenate command line option to tar(1), too, so that it will add to the archive. It might make sense to perform the compression after all the files have been added, via gzip(1) or bzip2(1). (This does mean you need to remove the archive before a "fresh run" of this script.)

Resources