find and delete files with non-ascii names

find and delete files with non-ascii names - linux

I have some old migrated files that contain non-printable characters. I would like to find all files with such names and delete them completely from the system.
Example:
ls -l
-rwxrwxr-x 1 cws cws 0 Dec 28 2011 ??"??
ls -lb
-rwxrwxr-x 1 cws cws 0 Dec 28 2011 \a\211"\206\351
I would like to find all such files.
Here is an example screenshot of what I'm seeing when I do a ls in such folders:
I want to find these files with the non-printable characters and just delete them.

Non-ASCII characters
ASCII character codes range from 0x00 to 0x7F in hex. Therefore, any character with a code greater than 0x7F is a non-ASCII character. This includes the bulk of the characters in UTF-8 (ASCII codes are essentially a subset of UTF-8). For example, the Japanese character
あ
is encoded in hex in UTF-8 as
E3 81 82
UTF-8 has been the default character encoding on, among others, Red Hat Linux since version 8.0 (2002), SuSE Linux since version 9.1 (2004), and Ubuntu Linux since version 5.04 (2005).
ASCII control characters
Out of the ASCII codes, 0x00 through 0x1F and 0x7F represent control characters such as ESC (0x1B). These control characters were not originally intended to be printable even though some of them, like the line feed character 0x0A, can be interpreted and displayed.
On my system, ls displays all control characters as ? by default, unless I pass the --show-control-chars option. I'm guessing that the files you want to delete contain ASCII control characters, as opposed to non-ASCII characters. This is an important distinction: if you delete filenames containing non-ASCII characters, you may blow away legitimate files that just happen to be named in another language.
Regular expressions for character codes
POSIX
POSIX provides a very handy collection of character classes for dealing with these types of characters (thanks to bashophil for pointing this out):
[:cntrl:] Control characters
[:graph:] Graphic printable characters (same as [:print:] minus the space character)
[:print:] Printable characters (same as [:graph:] plus the space character)
PCRE
Perl Compatible Regular Expressions allow hexadecimal character codes using the syntax
\x00
For example, a PCRE regex for the Japanese character あ would be
\xE3\x81\x82
In addition to the POSIX character classes listed above, PCRE also provides the [:ascii:] character class, which is a convenient shorthand for [\x00-\x7F].
GNU's version of grep supports PCRE using the -P flag, but BSD grep (on Mac OS X, for example) does not. Neither GNU nor BSD find supports PCRE regexes.
Finding the files
GNU find supports POSIX regexes (thanks to iscfrc for pointing out the pure find solution to avoid spawning additional processes). The following command will list all filenames (but not directory names) below the current directory that contain non-printable control characters:
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$'
The regex is a little complicated because the -regex option has to match the entire file path, not just the filename, and because I'm assuming that we don't want to blow away files with normal names simply because they are inside directories with names containing control characters.
To delete the matching files, simply pass the -delete option to find, after all other options (this is critical; passing -delete as the first option will blow away everything in your current directory):
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -delete
I highly recommend running the command without the -delete first, so you can see what will be deleted before it's too late.
If you also pass the -print option, you can see what is being deleted as the command runs:
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -print -delete
To blow away any paths (files or directories) that contain control characters, the regex can be simplified and you can drop the -type option:
find -regextype posix-basic -regex '.*[[:cntrl:]].*' -print -delete
With this command, if a directory name contains control characters, even if none of the filenames inside the directory do, they will all be deleted.
Update: Finding both non-ASCII and control characters
It looks like your files contain both non-ASCII characters and ASCII control characters. As it turns out, [:ascii:] is not a POSIX character class, but it is provided by PCRE. I couldn't find a POSIX regex to do this, so it's Perl to the rescue. We'll still use find to traverse our directory tree, but we'll pass the results to Perl for processing.
To make sure we can handle filenames containing newlines (which seems likely in this case), we need to use the -print0 argument to find (supported on both GNU and BSD versions); this separates records with a null character (0x00) instead of a newline, since the null character is the only character that can't be in a valid filename on Linux. We need to pass the corresponding flag -0 to our Perl code so it knows how records are separated. The following command will print every path inside the current directory, recursively:
find . -print0 | perl -n0e 'print $_, "\n"'
Note that this command only spawns a single instance of the Perl interpreter, which is good for performance. The starting path argument (in this case, . for CWD) is optional in GNU find but is required in BSD find on Mac OS X, so I've included it for the sake of portability.
Now for our regex. Here is a PCRE regex matching names that contain either non-ASCII or non-printable (i.e. control) characters (or both):
[[:^ascii:][:cntrl:]]
The following command will print all paths (directories or files) in the current directory that match this regex:
find . -print0 | perl -n0e 'chomp; print $_, "\n" if /[[:^ascii:][:cntrl:]]/'
The chomp is necessary because it strips off the trailing null character from each path, which would otherwise match our regex. To delete the matching files and directories, we can use the following:
find . -print0 | perl -MFile::Path=remove_tree -n0e 'chomp; remove_tree($_, {verbose=>1}) if /[[:^ascii:][:cntrl:]]/'
This will also print out what is being deleted as the command runs (although control characters are interpreted so the output will not quite match the output of ls).

Based on this answer, try:
LC_ALL=C find . -regex '.*[^ -~].*' -print # -delete
or:
LC_ALL=C find . -type f -regex '*[^[:alnum:][:punct:]]*' -print # -delete
Note: After files are printed right, remove the # character.
See also: How do I grep for all non-ASCII characters.

By now, you probably have solved your question, but it didn't work well for my case, as I had files that was not being shown by find when I used the -regex switch. So I developed this workaround using ls. Hope it can be useful to someone.
Basically, what worked for me was this:
ls -1 -R -i | grep -a "[^A-Za-z0-9_.':# /-]" | while read f; do inode=$(echo "$f" | cut -d ' ' -f 1); find -inum "$inode" -delete; done
Breaking it in parts:
ls -1 -R -i
This will recursively (-R) list (ls) files under current directory, one file per line (-1), prefixing each file by its inode number (-i). Results will be piped to grep.
grep -a "[^A-Za-z0-9_.':# /-]"
Filter each entry considering each input as text (-a), even when it is eventually binary. grep will let a line pass if it contains a character different from the specified in the list. Results will be piped to while.
while read f
do
inode=$(echo "$f" | cut -d ' ' -f 1)
find -inum "$inode" -delete
done
This while will iterate through all entries, extracting the inode number and passing the inode to find, which will then delete the file.

It is possible to use PCRE with grep -P, just not with find (unfortunately). You can chain find with grep using exec. With PCRE (perl regex), we can use the ascii class and find any char that is non-ascii.
find . -type f -exec sh -c "echo \"{}\" | grep -qP '[^[:ascii:]]'" \; -exec rm {} \;
The following exec will not execute unless the first one returns a non-error code. In this case, it means the expression matched the filename. I used sh -c because -exec doesn't like pipes.

You could print only lines containing a backslash with grep:
ls -lb | grep \\\\

Related

Count number of files (with arbitrary filenames) in a given directory

I am trying to count the number of files in a given directory matching a specific name pattern. While this initially sounded like a no-brainer the issue turned out to be more complicated than I ever thought because the filenames can contain spaces and other nasty characters.
So, starting from an initial find -name "${filePattern}" | wc -l I have by now now reached this expression:
find . -maxdepth 1 -regextype posix-egrep -regex "${filePattern}" -print0 | wc -l --files0-from=-
The maxdepth option restricts to the current directory only. The -print0 and the -files0-from options of find and wc, respectively, emit and accept filenames which are null-byte terminated. This is supposed to take care of possible special characters contained in the filenames.
BUT: the --files0-from= option interprets the strings as filenames and wc thus counts the lines contained in those files. But I simply need is the number of files themselves (i.e. the number of null-byte terminated strings emitted by the find). For that wc would need a -l0 (or possibly a -w0) option, which it doesn't seem to have. Any idea how can I count just the number of those names/strings?
And - yes: I realized that the syntax for the filePattern has to be different in the two variants. The former one uses shell syntax while the latter one requires "real" regex-syntax. But that's OK and actually what I want: it allows me to search for multiple file patterns in one go. The question is really just to count null-byte terminated strings.

You could delete all the non-NUL characters with tr, then count the number of characters remaining.
find . -maxdepth 1 -regextype posix-egrep -regex "${filePattern}" -print0 | tr -cd '\0' | wc -c
If you're dealing with a small-to-medium number of files, an alternate solution would be to store the matches in an array and check the array size. (As you touch on in your question, this would use glob syntax rather than regexes.)
files=(*foo* *bar*)
echo "${#files[#]}"

executing Linux sed command and version control complain

I have a folder which contains jsp files. I used find and sed to change part of the text in some files. This folder is under version control. The command successfully changed all the occurrences of the specified pattern But
The problem is when I'm synchronizing the folder with the remote repository I can see so many files listed as modified which actually nothing in that file has changed. There is sth wrong with the white space I suppose. Could anyone shed some light on this matter.
I'm trying to replace ../../images/spacer to ${pageContext.request.contextPath}/static/images/spacer in all jsp files under current folder
The command I'm using is as below
find . -name '*.jsp' -exec sed -i 's/..\/..\/images\/spacer/${pageContext.request.contextPath}\/static\/images\/spacer/g' {} \;

In most of systems, grep has an option to recursively search for files that contains a pattern, avoiding find.
So, the command would be:
grep -r -l -m1 "\.\./\.\./images/spacer" --include \*.jsp |
xargs -r sed -i 's!\.\./\.\./\(images/spacer\)!${pageContext.request.contextPath}/static/\1!g'
Explanation
Both grep and sed work with regular expression patterns, in which th dot character . represent any character including the dot itself. In order to explicit indicate a dot, it must be escaped with a \ before it. So to search .. is necessary specify \.\., or it can match texts like ab/cd/
Now, about the grep options:
-m1 stops search when finds the first occurrence avoiding search the entire file.
-r search recursively in the directories
--include \*.jsp search only in files with FILEPAT file pattern.

problems with source command in shell

Good afternoon I have the following command to run me 'code.sh' file which I pass a parameter '$ 1' the problem is that I want to run a 'code.sh' with 'source' this is my command:
find . -name "*.txt" -type f -exec ./code.sh {} \;
And I do do well occupied
source ./code.sh

This is tricky. When you source a script you need to do it in the current shell, not in a sub-shell or child process. Executing source from find won't work because find is a child process, and so changes to environment variables will be lost.
It's rather roundabout, but you can use a loop to parse find's output and run the source commands directly in the top-level shell (using process substitution).
while read -d $'\0' fileName; do
source code.sh "$fileName"
done < <(find . -name "*.txt" -type f -print0)
Now what's with -print0 and -d $'\0', you ask? Using these two flags together is a way of making the script extra safe.† File names in UNIX are allowed to contain lots of oddball characters including spaces, tabs, and even newlines. While newlines are rare, they are indeed legal.
-print0 tells find to use NUL characters (\0) to separate the file names rather than the default newlines (\n). Doing this means file names containing \n won't mess up the loop. Using \0 as a separator works well because \0 is not a legal character in file names.
-d $'\0'&ddagger; does the same thing with read on the other side. It tells read that lines are delimited with \0 instead of \n.
† You may have seen this trick before. It's common to write find ... -print0 | xargs -0 ... to get the same sort of safety when pairing find with xargs.
&ddagger; If you're wondering about $'...': that's Bash ANSI-C quoting syntax for writing string literals containing escape codes. Dollar sign plus single quotes. You can write $'\n' for a newline or $'\t' for a tab or $'\0' for a NUL.

You won't be able to use find in this way; it will always execute a command in a separate process, not the current shell. If you are using bash 4, there's a simple alternative to using find:
shopt -s globstar
for f in **/*.txt; do
[[ -f $f ]] && source code.sh "$f"
done

Find top 500 oldest files

How can I find top 500 oldest files?
What I've tried:
find /storage -name "*.mp4" -o -name "*.flv" -type f | sort | head -n500

Find 500 oldest files using GNU find and GNU sort:
#!/bin/bash
typeset -a files
export LC_{TIME,NUMERIC}=C
n=0
while ((n++ < 500)) && IFS=' ' read -rd '' _ x; do
files+=("$x")
done < <(find /storage -type f \( -name '*.mp4' -o -name '*.flv' \) -printf '%T# %p\0' | sort -zn)
printf '%q\n' "${files[#]}"
Update - some explanation:
As mentioned by Jonathan in the comments, the proper way to handle this involves a lot of non-standard features which allows producing and consuming null-delimited lists so that arbitrary filenames can be handled safely.
GNU find's -printf produces the mtime (using the undocumented %T# format. My guess would be that whether or not this works depends upon your C library) followed by a space, followed by the filename with a terminating \0. Two additional non-standard features process the output: GNU sort's -z option, and the read builtin's -d option, which with an empty option argument delimits input on nulls. The overall effect is to have sort order the elements by the mtime produced by find's -printf string, then read the first 500 results into an array, using IFS to split read's input on space and discard the first element into the _ variable, leaving only the filename.
Finally, we print out the array using the %q format just to display the results unambiguously with a guarantee of one file per line.
The process substitution (<(...) syntax) isn't completely necessary but avoids the subshell induced by the pipe in versions that lack the lastpipe option. That can be an advantage should you decide to make the script more complicated than merely printing out the results.
None of these features are unique to GNU. All of this can be done using e.g. AST find(1), openbsd sort(1), and either Bash, mksh, zsh, or ksh93 (v or greater). Unfortunately the find format strings are incompatible.

The following finds the oldest 500 files with the oldest file at the top of the list:
find . -regex '.*.\(mp4\|flv\)' -type f -print0 | xargs -0 ls -drt --quoting-style=shell-always 2>/dev/null | head -n500
The above is a pipeline. The first step is to find the file names which is done by find. Any of find's options can be used to select the files of interest to you. The second step does the sorting. This is accomplished with xargs passing the file names to ls with sorts on time in reverse order so that the oldest files are at the top. The last step is head -n500 which takes just the first 500 file names. The first of those names will be the oldest file.
If there are more than 500 files, then head terminates before ls. If this happens, ls will issue a message: terminated by signal 13. I redirected stderr from the xargs command to eliminate this harmless message.
The above solution assumes that all the filenames can fit on one command line in your shell.

Remove special characters in linux files

I have a lot of files *.java, *.xml. But a guy wrote some comments and Strings with spanish characters. I been searching on the web how to remove them.
I tried find . -type f -exec sed 's/[áíéóúñ]//g' DefaultAuthoritiesPopulator.java just as an example, how can i remove these characters from many other files in subfolders?

If that's what you really want, you can use find, almost as you are using it.
find -type f \( -iname '*.java' -or -iname '*.xml' \) -execdir sed -i 's/[áíéóúñ]//g' '{}' ';'
The differences:
The path . is implicit if no path is supplied.
This command only operates on *.java and *.xml files.
execdir is more secure than exec (read the man page).
-i tells sed to modify the file argument in place. Read the man page to see how to use it to make a backup.
{} represents a path argument which find will substitute in.
The ; is part of the find syntax for exec/execdir.

You're almost there :)
find . -type f -exec sed -i 's/[áíéóúñ]//g' {} \;
^^ ^^
From sed(1):
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if extension supplied)
From find(1):
-exec command ;
Execute command; true if 0 status is returned. All
following arguments to find are taken to be arguments to
the command until an argument consisting of `;' is
encountered. The string `{}' is replaced by the current
file name being processed everywhere it occurs in the
arguments to the command, not just in arguments where it
is alone, as in some versions of find. Both of these
constructions might need to be escaped (with a `\') or
quoted to protect them from expansion by the shell. See
the EXAMPLES section for examples of the use of the -exec
option. The specified command is run once for each
matched file. The command is executed in the starting
directory. There are unavoidable security problems
surrounding use of the -exec action; you should use the
-execdir option instead.

tr is the tool for the job:
NAME
tr - translate or delete characters
SYNOPSIS
tr [OPTION]... SET1 [SET2]
DESCRIPTION
Translate, squeeze, and/or delete characters from standard input, writing to standard out‐
put.
-c, -C, --complement
use the complement of SET1
-d, --delete
delete characters in SET1, do not translate
-s, --squeeze-repeats
replace each input sequence of a repeated character that is listed in SET1 with a
single occurrence of that character
piping your input through tr -d áíéóúñ will probably do what you want.

Why are you trying to remove only characters with diacritic signs? It probably worth removing all characters with codes not in the range 0-127, so removal regexp will be s/[\0x80-\0xFF]//g if you're sure that your files should not contain higher ascii.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string