Count number of files (with arbitrary filenames) in a given directory - linux

I am trying to count the number of files in a given directory matching a specific name pattern. While this initially sounded like a no-brainer the issue turned out to be more complicated than I ever thought because the filenames can contain spaces and other nasty characters.
So, starting from an initial find -name "${filePattern}" | wc -l I have by now now reached this expression:
find . -maxdepth 1 -regextype posix-egrep -regex "${filePattern}" -print0 | wc -l --files0-from=-
The maxdepth option restricts to the current directory only. The -print0 and the -files0-from options of find and wc, respectively, emit and accept filenames which are null-byte terminated. This is supposed to take care of possible special characters contained in the filenames.
BUT: the --files0-from= option interprets the strings as filenames and wc thus counts the lines contained in those files. But I simply need is the number of files themselves (i.e. the number of null-byte terminated strings emitted by the find). For that wc would need a -l0 (or possibly a -w0) option, which it doesn't seem to have. Any idea how can I count just the number of those names/strings?
And - yes: I realized that the syntax for the filePattern has to be different in the two variants. The former one uses shell syntax while the latter one requires "real" regex-syntax. But that's OK and actually what I want: it allows me to search for multiple file patterns in one go. The question is really just to count null-byte terminated strings.

You could delete all the non-NUL characters with tr, then count the number of characters remaining.
find . -maxdepth 1 -regextype posix-egrep -regex "${filePattern}" -print0 | tr -cd '\0' | wc -c
If you're dealing with a small-to-medium number of files, an alternate solution would be to store the matches in an array and check the array size. (As you touch on in your question, this would use glob syntax rather than regexes.)
files=(*foo* *bar*)
echo "${#files[#]}"

Related

Linux count files with a specific string at a specific position in filename

I have a directory which contains data for several years and several months.
Filenames have the format yy/mm/dd, f.e.
20150415,
20170831,
20121205
How can I find all data with month = 3?
F.e.
20150302,
20160331,
20190315
Thanks for your help!
ls -ltra ????03??
A question mark is a wildcard which stands for one character, so as your format seems to be YYYYmmDD, the regular expression ????03?? should stand for all files having 03 as mm.
Edit
Apparently the files have format YYYYmmDDxxx, where xxx is the rest of the filename, having an unknown length. This would correspond with regular expression *, so instead of ????03?? you might use ????03??*.
As far as the find is concerned: the same regular expression holds here, but as you seem to be working inside a directory (no subdirectories, at first sight), you might consider the -maxdepth switch):
find . -name "????03??*" | wc -l // including subdirectories
find . -maxdepth 1 -name "????03??*" | wc -l // only current directory
I would highly advise you to check without wc -l first for checking the results. (Oh, I just see the switch -type f, that one might still be useful too :-) )

'find' files containing an integer in a specified range (in bash)

You'd think I could find an answer to this already somewhere, but I am struggling to do so. I want to find some log files with names like
myfile_3.log
however I only want to find the ones with numbers in a certain range. I tried things like this:
find <path> -name myfile_{0..67}.log #error: find: paths must precede expression
find <path> -name myfile_[0-67].log #only return 0-7, not 67
find <path> -name myfile_[0,67].log #only returns 0,6,7
find <path> -name myfile_*([0,67]).log # returns only 0,6,7,60,66,67,70,76,77
Any other ideas?
If you want to match an integer range using regular expression, use the option -regex in the your find command.
For example to match all files from 0 to 67, use this:
find <path> -regextype egrep -regex '.*file([0-5][0-9]|6[0-7])\.txt'
There are 2 parts in the regex:
[0-5][0-9] matches the range 0-59
6[0-7] matches the range 60-67
Note the option -regextype egrep to have extended regular expression.
Note also the option -regex matches the whole filename, including path, that's the reason of .* at the beginning of the regex.
You can do this simply and concisely, but admittedly not very efficiently, with GNU Parallel:
parallel find . -name "*file{}.txt" ::: {0..67}
In case, you are wondering why I say it is not that efficient, it is because it starts 68 parallel instances of find - each looking for a different number in the filename... but that may be ok.
The following will find all files named myfile_X.log - whereby the X part is a digit ranging from 0-67.
find <path> -type f | grep -E "/myfile_([0-9]|[0-5][0-9]|6[0-7])\.log$"
Explanation:
-type f finds files whose type is file.
| pipes the filepath(s) to grep for filtering.
grep -E "/myfile_([0-9]|[0-5][0-9]|6[0-7])\.log$" performs an extended (-E) regexp to find the last part of the path (i.e. the filename) which:
begins with myfile_
followed with a digit(s) ranging from 0-67.
ends with .log
Edit:
Alternatively, as suggested by #ghoti in the comments, you can utilize the -regex option in the find command instead of piping to grep. For example:
find -E <path> -type f -regex ".*/myfile_([0-9]|[0-5][0-9]|6[0-7])\.log$"
Note: The regexp is very similar to the previous grep example shown previously. However, it begins with .*/ to match all parts of the filepath up to and including the final forward slash. For some reason, unknown to me, the .*/ part is not necessary with grep1.
Footnotes:
1If any readers know why the ERE utilized with find's -regex option requires the initial .* and the same ERE with grep does not - then please leave a comment. You'll make me sleep better at night ;)
One possibility is to build up the range from several ranges that can be matched by glob patterns. For example:
find . -name 'myfile_[0-9].log' -o -name 'myfile_[1-5][0-9].log' -o -name 'myfile_6[0-7].log'
You cannot represent a general range with a regular expression, although you can craft a regex for a specific range. Better use find to get files with a number and filter the output with another tool that perform the range checking, like awk.
START=0
END=67
while IFS= read -r -d '' file
do
N=$(echo "$file" | sed 's/file_\([0-9]\+\).log/\1/')
if [ "$N" -ge "$START" -a "$N" -le "$END" ]
then
echo "$file"
fi
done < <(find <path> -name "myfile_*.log" -print0)
In that script, you perform a find of all the files that have the desired pattern, then you loop through the found files and sed is used to capture the number in the filename. Finally, you compare that number with your range limits. If the comparisons succeed, the file is printed.
There are many other answers that give you a regex for the specific range in the example, but they are not general. Any of them allows for easy modification of the range involved.

Find top 500 oldest files

How can I find top 500 oldest files?
What I've tried:
find /storage -name "*.mp4" -o -name "*.flv" -type f | sort | head -n500
Find 500 oldest files using GNU find and GNU sort:
#!/bin/bash
typeset -a files
export LC_{TIME,NUMERIC}=C
n=0
while ((n++ < 500)) && IFS=' ' read -rd '' _ x; do
files+=("$x")
done < <(find /storage -type f \( -name '*.mp4' -o -name '*.flv' \) -printf '%T# %p\0' | sort -zn)
printf '%q\n' "${files[#]}"
Update - some explanation:
As mentioned by Jonathan in the comments, the proper way to handle this involves a lot of non-standard features which allows producing and consuming null-delimited lists so that arbitrary filenames can be handled safely.
GNU find's -printf produces the mtime (using the undocumented %T# format. My guess would be that whether or not this works depends upon your C library) followed by a space, followed by the filename with a terminating \0. Two additional non-standard features process the output: GNU sort's -z option, and the read builtin's -d option, which with an empty option argument delimits input on nulls. The overall effect is to have sort order the elements by the mtime produced by find's -printf string, then read the first 500 results into an array, using IFS to split read's input on space and discard the first element into the _ variable, leaving only the filename.
Finally, we print out the array using the %q format just to display the results unambiguously with a guarantee of one file per line.
The process substitution (<(...) syntax) isn't completely necessary but avoids the subshell induced by the pipe in versions that lack the lastpipe option. That can be an advantage should you decide to make the script more complicated than merely printing out the results.
None of these features are unique to GNU. All of this can be done using e.g. AST find(1), openbsd sort(1), and either Bash, mksh, zsh, or ksh93 (v or greater). Unfortunately the find format strings are incompatible.
The following finds the oldest 500 files with the oldest file at the top of the list:
find . -regex '.*.\(mp4\|flv\)' -type f -print0 | xargs -0 ls -drt --quoting-style=shell-always 2>/dev/null | head -n500
The above is a pipeline. The first step is to find the file names which is done by find. Any of find's options can be used to select the files of interest to you. The second step does the sorting. This is accomplished with xargs passing the file names to ls with sorts on time in reverse order so that the oldest files are at the top. The last step is head -n500 which takes just the first 500 file names. The first of those names will be the oldest file.
If there are more than 500 files, then head terminates before ls. If this happens, ls will issue a message: terminated by signal 13. I redirected stderr from the xargs command to eliminate this harmless message.
The above solution assumes that all the filenames can fit on one command line in your shell.

find and delete files with non-ascii names

I have some old migrated files that contain non-printable characters. I would like to find all files with such names and delete them completely from the system.
Example:
ls -l
-rwxrwxr-x 1 cws cws 0 Dec 28 2011 ??"??
ls -lb
-rwxrwxr-x 1 cws cws 0 Dec 28 2011 \a\211"\206\351
I would like to find all such files.
Here is an example screenshot of what I'm seeing when I do a ls in such folders:
I want to find these files with the non-printable characters and just delete them.
Non-ASCII characters
ASCII character codes range from 0x00 to 0x7F in hex. Therefore, any character with a code greater than 0x7F is a non-ASCII character. This includes the bulk of the characters in UTF-8 (ASCII codes are essentially a subset of UTF-8). For example, the Japanese character
あ
is encoded in hex in UTF-8 as
E3 81 82
UTF-8 has been the default character encoding on, among others, Red Hat Linux since version 8.0 (2002), SuSE Linux since version 9.1 (2004), and Ubuntu Linux since version 5.04 (2005).
ASCII control characters
Out of the ASCII codes, 0x00 through 0x1F and 0x7F represent control characters such as ESC (0x1B). These control characters were not originally intended to be printable even though some of them, like the line feed character 0x0A, can be interpreted and displayed.
On my system, ls displays all control characters as ? by default, unless I pass the --show-control-chars option. I'm guessing that the files you want to delete contain ASCII control characters, as opposed to non-ASCII characters. This is an important distinction: if you delete filenames containing non-ASCII characters, you may blow away legitimate files that just happen to be named in another language.
Regular expressions for character codes
POSIX
POSIX provides a very handy collection of character classes for dealing with these types of characters (thanks to bashophil for pointing this out):
[:cntrl:] Control characters
[:graph:] Graphic printable characters (same as [:print:] minus the space character)
[:print:] Printable characters (same as [:graph:] plus the space character)
PCRE
Perl Compatible Regular Expressions allow hexadecimal character codes using the syntax
\x00
For example, a PCRE regex for the Japanese character あ would be
\xE3\x81\x82
In addition to the POSIX character classes listed above, PCRE also provides the [:ascii:] character class, which is a convenient shorthand for [\x00-\x7F].
GNU's version of grep supports PCRE using the -P flag, but BSD grep (on Mac OS X, for example) does not. Neither GNU nor BSD find supports PCRE regexes.
Finding the files
GNU find supports POSIX regexes (thanks to iscfrc for pointing out the pure find solution to avoid spawning additional processes). The following command will list all filenames (but not directory names) below the current directory that contain non-printable control characters:
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$'
The regex is a little complicated because the -regex option has to match the entire file path, not just the filename, and because I'm assuming that we don't want to blow away files with normal names simply because they are inside directories with names containing control characters.
To delete the matching files, simply pass the -delete option to find, after all other options (this is critical; passing -delete as the first option will blow away everything in your current directory):
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -delete
I highly recommend running the command without the -delete first, so you can see what will be deleted before it's too late.
If you also pass the -print option, you can see what is being deleted as the command runs:
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -print -delete
To blow away any paths (files or directories) that contain control characters, the regex can be simplified and you can drop the -type option:
find -regextype posix-basic -regex '.*[[:cntrl:]].*' -print -delete
With this command, if a directory name contains control characters, even if none of the filenames inside the directory do, they will all be deleted.
Update: Finding both non-ASCII and control characters
It looks like your files contain both non-ASCII characters and ASCII control characters. As it turns out, [:ascii:] is not a POSIX character class, but it is provided by PCRE. I couldn't find a POSIX regex to do this, so it's Perl to the rescue. We'll still use find to traverse our directory tree, but we'll pass the results to Perl for processing.
To make sure we can handle filenames containing newlines (which seems likely in this case), we need to use the -print0 argument to find (supported on both GNU and BSD versions); this separates records with a null character (0x00) instead of a newline, since the null character is the only character that can't be in a valid filename on Linux. We need to pass the corresponding flag -0 to our Perl code so it knows how records are separated. The following command will print every path inside the current directory, recursively:
find . -print0 | perl -n0e 'print $_, "\n"'
Note that this command only spawns a single instance of the Perl interpreter, which is good for performance. The starting path argument (in this case, . for CWD) is optional in GNU find but is required in BSD find on Mac OS X, so I've included it for the sake of portability.
Now for our regex. Here is a PCRE regex matching names that contain either non-ASCII or non-printable (i.e. control) characters (or both):
[[:^ascii:][:cntrl:]]
The following command will print all paths (directories or files) in the current directory that match this regex:
find . -print0 | perl -n0e 'chomp; print $_, "\n" if /[[:^ascii:][:cntrl:]]/'
The chomp is necessary because it strips off the trailing null character from each path, which would otherwise match our regex. To delete the matching files and directories, we can use the following:
find . -print0 | perl -MFile::Path=remove_tree -n0e 'chomp; remove_tree($_, {verbose=>1}) if /[[:^ascii:][:cntrl:]]/'
This will also print out what is being deleted as the command runs (although control characters are interpreted so the output will not quite match the output of ls).
Based on this answer, try:
LC_ALL=C find . -regex '.*[^ -~].*' -print # -delete
or:
LC_ALL=C find . -type f -regex '*[^[:alnum:][:punct:]]*' -print # -delete
Note: After files are printed right, remove the # character.
See also: How do I grep for all non-ASCII characters.
By now, you probably have solved your question, but it didn't work well for my case, as I had files that was not being shown by find when I used the -regex switch. So I developed this workaround using ls. Hope it can be useful to someone.
Basically, what worked for me was this:
ls -1 -R -i | grep -a "[^A-Za-z0-9_.':# /-]" | while read f; do inode=$(echo "$f" | cut -d ' ' -f 1); find -inum "$inode" -delete; done
Breaking it in parts:
ls -1 -R -i
This will recursively (-R) list (ls) files under current directory, one file per line (-1), prefixing each file by its inode number (-i). Results will be piped to grep.
grep -a "[^A-Za-z0-9_.':# /-]"
Filter each entry considering each input as text (-a), even when it is eventually binary. grep will let a line pass if it contains a character different from the specified in the list. Results will be piped to while.
while read f
do
inode=$(echo "$f" | cut -d ' ' -f 1)
find -inum "$inode" -delete
done
This while will iterate through all entries, extracting the inode number and passing the inode to find, which will then delete the file.
It is possible to use PCRE with grep -P, just not with find (unfortunately). You can chain find with grep using exec. With PCRE (perl regex), we can use the ascii class and find any char that is non-ascii.
find . -type f -exec sh -c "echo \"{}\" | grep -qP '[^[:ascii:]]'" \; -exec rm {} \;
The following exec will not execute unless the first one returns a non-error code. In this case, it means the expression matched the filename. I used sh -c because -exec doesn't like pipes.
You could print only lines containing a backslash with grep:
ls -lb | grep \\\\

Remove special characters in linux files

I have a lot of files *.java, *.xml. But a guy wrote some comments and Strings with spanish characters. I been searching on the web how to remove them.
I tried find . -type f -exec sed 's/[áíéóúñ]//g' DefaultAuthoritiesPopulator.java just as an example, how can i remove these characters from many other files in subfolders?
If that's what you really want, you can use find, almost as you are using it.
find -type f \( -iname '*.java' -or -iname '*.xml' \) -execdir sed -i 's/[áíéóúñ]//g' '{}' ';'
The differences:
The path . is implicit if no path is supplied.
This command only operates on *.java and *.xml files.
execdir is more secure than exec (read the man page).
-i tells sed to modify the file argument in place. Read the man page to see how to use it to make a backup.
{} represents a path argument which find will substitute in.
The ; is part of the find syntax for exec/execdir.
You're almost there :)
find . -type f -exec sed -i 's/[áíéóúñ]//g' {} \;
^^ ^^
From sed(1):
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if extension supplied)
From find(1):
-exec command ;
Execute command; true if 0 status is returned. All
following arguments to find are taken to be arguments to
the command until an argument consisting of `;' is
encountered. The string `{}' is replaced by the current
file name being processed everywhere it occurs in the
arguments to the command, not just in arguments where it
is alone, as in some versions of find. Both of these
constructions might need to be escaped (with a `\') or
quoted to protect them from expansion by the shell. See
the EXAMPLES section for examples of the use of the -exec
option. The specified command is run once for each
matched file. The command is executed in the starting
directory. There are unavoidable security problems
surrounding use of the -exec action; you should use the
-execdir option instead.
tr is the tool for the job:
NAME
tr - translate or delete characters
SYNOPSIS
tr [OPTION]... SET1 [SET2]
DESCRIPTION
Translate, squeeze, and/or delete characters from standard input, writing to standard out‐
put.
-c, -C, --complement
use the complement of SET1
-d, --delete
delete characters in SET1, do not translate
-s, --squeeze-repeats
replace each input sequence of a repeated character that is listed in SET1 with a
single occurrence of that character
piping your input through tr -d áíéóúñ will probably do what you want.
Why are you trying to remove only characters with diacritic signs? It probably worth removing all characters with codes not in the range 0-127, so removal regexp will be s/[\0x80-\0xFF]//g if you're sure that your files should not contain higher ascii.

Resources