Remove special characters in linux files

Remove special characters in linux files - linux

I have a lot of files *.java, *.xml. But a guy wrote some comments and Strings with spanish characters. I been searching on the web how to remove them.
I tried find . -type f -exec sed 's/[áíéóúñ]//g' DefaultAuthoritiesPopulator.java just as an example, how can i remove these characters from many other files in subfolders?

If that's what you really want, you can use find, almost as you are using it.
find -type f \( -iname '*.java' -or -iname '*.xml' \) -execdir sed -i 's/[áíéóúñ]//g' '{}' ';'
The differences:
The path . is implicit if no path is supplied.
This command only operates on *.java and *.xml files.
execdir is more secure than exec (read the man page).
-i tells sed to modify the file argument in place. Read the man page to see how to use it to make a backup.
{} represents a path argument which find will substitute in.
The ; is part of the find syntax for exec/execdir.

You're almost there :)
find . -type f -exec sed -i 's/[áíéóúñ]//g' {} \;
^^ ^^
From sed(1):
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if extension supplied)
From find(1):
-exec command ;
Execute command; true if 0 status is returned. All
following arguments to find are taken to be arguments to
the command until an argument consisting of `;' is
encountered. The string `{}' is replaced by the current
file name being processed everywhere it occurs in the
arguments to the command, not just in arguments where it
is alone, as in some versions of find. Both of these
constructions might need to be escaped (with a `\') or
quoted to protect them from expansion by the shell. See
the EXAMPLES section for examples of the use of the -exec
option. The specified command is run once for each
matched file. The command is executed in the starting
directory. There are unavoidable security problems
surrounding use of the -exec action; you should use the
-execdir option instead.

tr is the tool for the job:
NAME
tr - translate or delete characters
SYNOPSIS
tr [OPTION]... SET1 [SET2]
DESCRIPTION
Translate, squeeze, and/or delete characters from standard input, writing to standard out‐
put.
-c, -C, --complement
use the complement of SET1
-d, --delete
delete characters in SET1, do not translate
-s, --squeeze-repeats
replace each input sequence of a repeated character that is listed in SET1 with a
single occurrence of that character
piping your input through tr -d áíéóúñ will probably do what you want.

Why are you trying to remove only characters with diacritic signs? It probably worth removing all characters with codes not in the range 0-127, so removal regexp will be s/[\0x80-\0xFF]//g if you're sure that your files should not contain higher ascii.

Related

find and copy all images in directory using terminal linux mint, trying to understand syntax

OS Linux Mint
Like the title says finally I would like to find and copy all images in a directory.
I found:
find all jpg (or JPG) files in a directory and copy them into the folder /home/joachim/neu2:
find . -iname \*.jpg -print0 | xargs -I{} -0 cp -v {} /home/joachim/neu2
and
find all image files in a direcotry:
find . -name '*' -exec file {} \; | grep -o -P '^.+: \w+ image'
My problem is first of all, I don't really understand the syntax. Could someone explain the code?
And secondly can someone connect the two codes for generating a code that does what I want ;)
Greetings and thanks in advance!

First, understand that the pipe "|" links commands piping the output of the first into the second as an argument. Your two shell codes both pipe output of the find command into other commands (grep and xargs). Let's look at those commands one after another:
First command: find
find is a program to "search for files in a directory hierarchy" (that is the explanation from find's man page). The syntax is (in this case)
find <search directory> <search pattern> <action>
In both cases the search directory is . (that is the current directory). Note that it does not just search the current directory but all its subdirectories as well (the directory hierarchy).
The search pattern accepts options -name (meaning it searches for files the name of which matches the pattern given as an argument to this option) or -iname (same as name but case insensitive) among others.
The action pattern may be -print0 (print the exact filename including its position in the given search directory, i.e. the relative or absolute path to the file) or -exec (execute the given command on the file(s), the command is to be ended with ";" and every instance of "{}" is replaced by the filename).
That is, the first shell code (first part, left of the pipe)
find . -iname \*.jpg -print0
searches all files with ending ".jpg" in the current directory hierarchy and prints their paths and names. The second one (first part)
find . -name '*' -exec file {} \;
finds all files in the current directory hierarchy and executes
file <filename>
on them. File is another command that determines and prints the file type (have a look at the man page for details, man file).
Second command: xargs
xargs is a command that "builds and exectues command lines from standard input" (man xargs), i.e. from the find output that is piped into xargs. The command that it builds and executes is in this case
cp -v {} /home/joachim/neu2"
Option -I{} defines the replacement string, i.e. every instance of {} in the command is to be replaced by the input it gets from file (that is, the filenames). Option -0 defines that input items are not terminated (seperated) by whitespace or newlines but only by a null character. This seems to be necessary when using and the standard way to deal with find output as xargs input.
The command that is built and executed is then of course the copy command with option -v (verbose) and it copies each of the filenames it gets from find to the directory.
Third command: grep
grep filters its input giving only those lines or strings that match a particular output pattern. Option -o tells grep to print only the matching string, not the entire line (see man grep), -P tells it to interpret the following pattern as a perl regexp pattern. In perl regex, ^ is the start of the line, .+ is any arbitrary string, this arbitrary should then be followed by a colon, a space, a number of alphanumeric characters (in perl regex denoted \w+) a space and the string "image". Essentially this grep command filters the file output to only output the filenames that are image files. (Read about perl regex's for instance here: http://www.comp.leeds.ac.uk/Perl/matching.html )
The command you actually wanted
Now what you want to do is (1) take the output of the second shell command (which lists the image files), (2) bring it into the appropriate form and (3) pipe it into the xargs command from the first shell command line (which then builds and executes the copy command you wanted). So this time we have a three (actually four) stage shell command with two pipes. Not a problem. We already have stages (1) and (3) (though in stage (3) we need to leave out the -0 option because the input is not find output any more; we need it to treat newlines as item seperators).
Stage (2) is still missing. I suggest using the cut command for this. cut changes strings py splitting them into different fields (seperated by a delimiter character in the original string) that can then be rearranged. I will choose ":" as the delimiter character (this ends the filename in the grep output, option -d':') and tell it to give us just the first field (option -f1, essentialls: print only the filename, not the part that comes after the ":"), i.e. stage (2) would then be
cut -d':' -f1
And the entire command you wanted will then be:
find . -name '*' -exec file {} \; | grep -o -P '^.+: \w+ image' | cut -d':' -f1 | xargs -I{} cp -v {} /home/joachim/neu2
Note that you can find all the man pages for instance here: http://www.linuxmanpages.com

I figured out a command only using awk that does the job as well:
find . -name '*' -exec file {} \; |
awk '{
if ($3=="image"){
print substr($1, 0, length($1)-1);
system("cp " substr($1, 0, length($1)-1) " /home/joachim/neu2" )
}
}'
the substr($1, 0, length($1)-1) is needed because in first column file returns name;

The above answer is really good. but it could take longer if it a huge directory.
here is a shorter version of it , if you already know your file extension
find . -name \*.jpg | cut -d':' -f1 | xargs -I{} cp --parents -v {} ~/testimage/

Here's another one which works like a charm.
It adds the EPOCH time to prevent overwriting files with the same name.
cd /media/myhome/'Local station'/
find . -path ./jpg -prune -o -type f -iname '*.jpg' -exec sh -c '
for file do
newname="${file##*/}"
newname="${newname%.jpg}"
mv -T -- "$file" "/media/myhome/Local station/jpg/$newname-$(date +%s).jpg"
done
' find-sh {} +
cd ~/
It's been designed by Kamil in this post here.

Find a specific type file from a directory:
find /home/user/find/data/ -name '*' -exec file {} \; | grep -o -P '^.+: \w+ image'
Copy specific type of file from one directory to another directory:
find /home/user/find/data/ -name '*' -exec file {} \; | grep -o -P '^.+: \w+ image' | cut -d':' -f1 | xargs -I{} cp -v {} /home/user/copy/data/

problems with source command in shell

Good afternoon I have the following command to run me 'code.sh' file which I pass a parameter '$ 1' the problem is that I want to run a 'code.sh' with 'source' this is my command:
find . -name "*.txt" -type f -exec ./code.sh {} \;
And I do do well occupied
source ./code.sh

This is tricky. When you source a script you need to do it in the current shell, not in a sub-shell or child process. Executing source from find won't work because find is a child process, and so changes to environment variables will be lost.
It's rather roundabout, but you can use a loop to parse find's output and run the source commands directly in the top-level shell (using process substitution).
while read -d $'\0' fileName; do
source code.sh "$fileName"
done < <(find . -name "*.txt" -type f -print0)
Now what's with -print0 and -d $'\0', you ask? Using these two flags together is a way of making the script extra safe.† File names in UNIX are allowed to contain lots of oddball characters including spaces, tabs, and even newlines. While newlines are rare, they are indeed legal.
-print0 tells find to use NUL characters (\0) to separate the file names rather than the default newlines (\n). Doing this means file names containing \n won't mess up the loop. Using \0 as a separator works well because \0 is not a legal character in file names.
-d $'\0'&ddagger; does the same thing with read on the other side. It tells read that lines are delimited with \0 instead of \n.
† You may have seen this trick before. It's common to write find ... -print0 | xargs -0 ... to get the same sort of safety when pairing find with xargs.
&ddagger; If you're wondering about $'...': that's Bash ANSI-C quoting syntax for writing string literals containing escape codes. Dollar sign plus single quotes. You can write $'\n' for a newline or $'\t' for a tab or $'\0' for a NUL.

You won't be able to use find in this way; it will always execute a command in a separate process, not the current shell. If you are using bash 4, there's a simple alternative to using find:
shopt -s globstar
for f in **/*.txt; do
[[ -f $f ]] && source code.sh "$f"
done

find and delete files with non-ascii names

I have some old migrated files that contain non-printable characters. I would like to find all files with such names and delete them completely from the system.
Example:
ls -l
-rwxrwxr-x 1 cws cws 0 Dec 28 2011 ??"??
ls -lb
-rwxrwxr-x 1 cws cws 0 Dec 28 2011 \a\211"\206\351
I would like to find all such files.
Here is an example screenshot of what I'm seeing when I do a ls in such folders:
I want to find these files with the non-printable characters and just delete them.

Non-ASCII characters
ASCII character codes range from 0x00 to 0x7F in hex. Therefore, any character with a code greater than 0x7F is a non-ASCII character. This includes the bulk of the characters in UTF-8 (ASCII codes are essentially a subset of UTF-8). For example, the Japanese character
あ
is encoded in hex in UTF-8 as
E3 81 82
UTF-8 has been the default character encoding on, among others, Red Hat Linux since version 8.0 (2002), SuSE Linux since version 9.1 (2004), and Ubuntu Linux since version 5.04 (2005).
ASCII control characters
Out of the ASCII codes, 0x00 through 0x1F and 0x7F represent control characters such as ESC (0x1B). These control characters were not originally intended to be printable even though some of them, like the line feed character 0x0A, can be interpreted and displayed.
On my system, ls displays all control characters as ? by default, unless I pass the --show-control-chars option. I'm guessing that the files you want to delete contain ASCII control characters, as opposed to non-ASCII characters. This is an important distinction: if you delete filenames containing non-ASCII characters, you may blow away legitimate files that just happen to be named in another language.
Regular expressions for character codes
POSIX
POSIX provides a very handy collection of character classes for dealing with these types of characters (thanks to bashophil for pointing this out):
[:cntrl:] Control characters
[:graph:] Graphic printable characters (same as [:print:] minus the space character)
[:print:] Printable characters (same as [:graph:] plus the space character)
PCRE
Perl Compatible Regular Expressions allow hexadecimal character codes using the syntax
\x00
For example, a PCRE regex for the Japanese character あ would be
\xE3\x81\x82
In addition to the POSIX character classes listed above, PCRE also provides the [:ascii:] character class, which is a convenient shorthand for [\x00-\x7F].
GNU's version of grep supports PCRE using the -P flag, but BSD grep (on Mac OS X, for example) does not. Neither GNU nor BSD find supports PCRE regexes.
Finding the files
GNU find supports POSIX regexes (thanks to iscfrc for pointing out the pure find solution to avoid spawning additional processes). The following command will list all filenames (but not directory names) below the current directory that contain non-printable control characters:
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$'
The regex is a little complicated because the -regex option has to match the entire file path, not just the filename, and because I'm assuming that we don't want to blow away files with normal names simply because they are inside directories with names containing control characters.
To delete the matching files, simply pass the -delete option to find, after all other options (this is critical; passing -delete as the first option will blow away everything in your current directory):
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -delete
I highly recommend running the command without the -delete first, so you can see what will be deleted before it's too late.
If you also pass the -print option, you can see what is being deleted as the command runs:
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -print -delete
To blow away any paths (files or directories) that contain control characters, the regex can be simplified and you can drop the -type option:
find -regextype posix-basic -regex '.*[[:cntrl:]].*' -print -delete
With this command, if a directory name contains control characters, even if none of the filenames inside the directory do, they will all be deleted.
Update: Finding both non-ASCII and control characters
It looks like your files contain both non-ASCII characters and ASCII control characters. As it turns out, [:ascii:] is not a POSIX character class, but it is provided by PCRE. I couldn't find a POSIX regex to do this, so it's Perl to the rescue. We'll still use find to traverse our directory tree, but we'll pass the results to Perl for processing.
To make sure we can handle filenames containing newlines (which seems likely in this case), we need to use the -print0 argument to find (supported on both GNU and BSD versions); this separates records with a null character (0x00) instead of a newline, since the null character is the only character that can't be in a valid filename on Linux. We need to pass the corresponding flag -0 to our Perl code so it knows how records are separated. The following command will print every path inside the current directory, recursively:
find . -print0 | perl -n0e 'print $_, "\n"'
Note that this command only spawns a single instance of the Perl interpreter, which is good for performance. The starting path argument (in this case, . for CWD) is optional in GNU find but is required in BSD find on Mac OS X, so I've included it for the sake of portability.
Now for our regex. Here is a PCRE regex matching names that contain either non-ASCII or non-printable (i.e. control) characters (or both):
[[:^ascii:][:cntrl:]]
The following command will print all paths (directories or files) in the current directory that match this regex:
find . -print0 | perl -n0e 'chomp; print $_, "\n" if /[[:^ascii:][:cntrl:]]/'
The chomp is necessary because it strips off the trailing null character from each path, which would otherwise match our regex. To delete the matching files and directories, we can use the following:
find . -print0 | perl -MFile::Path=remove_tree -n0e 'chomp; remove_tree($_, {verbose=>1}) if /[[:^ascii:][:cntrl:]]/'
This will also print out what is being deleted as the command runs (although control characters are interpreted so the output will not quite match the output of ls).

Based on this answer, try:
LC_ALL=C find . -regex '.*[^ -~].*' -print # -delete
or:
LC_ALL=C find . -type f -regex '*[^[:alnum:][:punct:]]*' -print # -delete
Note: After files are printed right, remove the # character.
See also: How do I grep for all non-ASCII characters.

By now, you probably have solved your question, but it didn't work well for my case, as I had files that was not being shown by find when I used the -regex switch. So I developed this workaround using ls. Hope it can be useful to someone.
Basically, what worked for me was this:
ls -1 -R -i | grep -a "[^A-Za-z0-9_.':# /-]" | while read f; do inode=$(echo "$f" | cut -d ' ' -f 1); find -inum "$inode" -delete; done
Breaking it in parts:
ls -1 -R -i
This will recursively (-R) list (ls) files under current directory, one file per line (-1), prefixing each file by its inode number (-i). Results will be piped to grep.
grep -a "[^A-Za-z0-9_.':# /-]"
Filter each entry considering each input as text (-a), even when it is eventually binary. grep will let a line pass if it contains a character different from the specified in the list. Results will be piped to while.
while read f
do
inode=$(echo "$f" | cut -d ' ' -f 1)
find -inum "$inode" -delete
done
This while will iterate through all entries, extracting the inode number and passing the inode to find, which will then delete the file.

It is possible to use PCRE with grep -P, just not with find (unfortunately). You can chain find with grep using exec. With PCRE (perl regex), we can use the ascii class and find any char that is non-ascii.
find . -type f -exec sh -c "echo \"{}\" | grep -qP '[^[:ascii:]]'" \; -exec rm {} \;
The following exec will not execute unless the first one returns a non-error code. In this case, it means the expression matched the filename. I used sh -c because -exec doesn't like pipes.

You could print only lines containing a backslash with grep:
ls -lb | grep \\\\

renaming with find

I managed to find several files with the find command.
the files are of the type file_sakfksanf.txt, file_afsjnanfs.pdf, file_afsnjnjans.cpp,
now I want to rename them with the rename and -exec command to
mywish_sakfksanf.txt, mywish_afsjnanfs.pdf, mywish_afsnjnjans.cpp
that only the first prefix is changed. I am trying for some time, so don't blame me for being stupid.

If you read through the -exec section of the man pages for find you will come across the {} string that allows you to use the matches as arguments within -exec. This will allow you to use rename on your find matches in the following way:
find . -name 'file_*' -exec rename 's/file_/mywish_/' {} \;
From the manual:
-exec command ;
Execute command; true if 0 status is returned. All following
arguments to find are taken to be arguments to the command until an
argument consisting of ;' is encountered. The string{}' is replaced
by the current file name being processed everywhere it occurs in the
arguments to the command, not just in arguments where it is alone, as
in some versions of find. Both of these constructions might need to
be escaped (with a `\') or quoted to protect them from expansion by
the shell. See the EXAMPLES section for examples of the use of the
-exec option. The specified command is run once for each matched file. The command is executed in the starting directory.There are
unavoidable security problems surrounding use of the -exec action;
you should use the -execdir option instead.
Although you asked for a find/exec solution, as Mark Reed suggested, you might want to consider piping your results to xargs. If you do, make sure to use the -print0 option with find and either the -0 or -null option with xargs to avoid unexpected behaviour resulting from whitespace or shell metacharacters appearing in your file names. Also, consider using the + version of -exec (also in the manual) as this is the POSIX spec for find and should therefore be more portable if you are wanting to run your command elsewhere (not always true); it also builds its command line in a way similar to xargs which should result in less invocations of rename.

Don't think there's a way you can do this with just find, you'll need to create a script:
#!/bin/bash
NEW=`echo $1 | sed -e 's/file_/mywish_/'`
mv $1 ${NEW}
THen you can:
find ./ -name 'file_*' -exec my_script {} \;

Linux: how to replace all instances of a string with another in all files of a single type

I want to replace for example all instances of "123" with "321" contained within all .txt files in a folder (recursively).
I thought of doing this
sed -i 's/123/321/g' | find . -name \*.txt
but before possibly screwing all my files I would like to ask if it will work.

You have the sed and the find back to front. With GNU sed and the -i option, you could use:
find . -name '*.txt' -type f -exec sed -i s/123/321/g {} +
The find finds files with extension .txt and runs the sed -i command on groups of them (that's the + at the end; it's standard in POSIX 2008, but not all versions of find necessarily support it). In this example substitution, there's no danger of misinterpretation of the s/123/321/g command so I've not enclosed it in quotes. However, for simplicity and general safety, it is probably better to enclose the sed script in single quotes whenever possible.
You could also use xargs (and again using GNU extensions -print0 to find and -0 and -r to xargs):
find . -name '*.txt' -type f -print0 | xargs -0 -r sed -i 's/123/321/g'
The -r means 'do not run if there are no arguments' (so the find doesn't find anything). The -print0 and -0 work in tandem, generating file names ending with the C null byte '\0' instead of a newline, and avoiding misinterpretation of file names containing newlines, blanks and so on.
Note that before running the script on the real data, you can and should test it. Make a dummy directory (I usually call it junk), copy some sample files into the junk directory, change directory into the junk directory, and test your script on those files. Since they're copies, there's no harm done if something goes wrong. And you can simply remove everything in the directory afterwards: rm -fr junk should never cause you anguish.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string