sed doesn't remove characters from UTF range properly

sed doesn't remove characters from UTF range properly - linux

I want to clear my file from all characters except russian and arabic letters, "|" and space mark. Lets start with only arabic letters. So I have:
cat file.tzt | sed 's/[^\u0600-\u06FF]//g'
sed: -e expression #1, char 21: Invalid range end.
I have tried [\u0621-\u064A] - same.
I also tried to use {Arabic}, but it doesn't clean files properly at all.
Error looks kinda strange for me. Obviously, 064FF > 0621.
So, overall I want to have something like this:
cat file.tzt | sed 's/[^\u0600-\u06FFа-яА-Я |]//g'
And I am ok with awk or any other utility, but as I know sed is stable and reliable.

Perl understands UTF-8:
perl -CSD -pe 's/[^\N{U+0600}-\N{U+06FF}]//g' -- file.txt
-C turns of UTF-8 support, S means for stdin/stdout/stderr, D means for any i/o streams.
You can also use Unicode properties:
s/\P{Cyrillic}//g

Related

How to replace non printable characters in file like <97> on linux [duplicate]

I am trying to remove non-printable character (for e.g. ^#) from records in my file. Since the volume to records is too big in the file using cat is not an option as the loop is taking too much time.
I tried using
sed -i 's/[^#a-zA-Z 0-9`~!##$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILENAME
but still the ^# characters are not removed.
Also I tried using
awk '{ sub("[^a-zA-Z0-9\"!##$%^&*|_\[](){}", ""); print } FILENAME > NEW FILE
but it also did not help.
Can anybody suggest some alternative way to remove non-printable characters?
Used tr -cd but it is removing accented characters. But they are required in the file.

Perhaps you could go with the complement of [:print:], which contains all printable characters:
tr -cd '[:print:]' < file > newfile
If your version of tr doesn't support multi-byte characters (it seems that many don't), this works for me with GNU sed (with UTF-8 locale settings):
sed 's/[^[:print:]]//g' file

Remove all control characters first:
tr -dc '\007-\011\012-\015\040-\376' < file > newfile
Then try your string:
sed -i 's/[^#a-zA-Z 0-9`~!##$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' newfile
I believe that what you see ^# is in fact a zero value \0.
The tr filter from above will remove those as well.

strings -1 file... > outputfile
seems to work. The strings program will take all printable characters, in this case of length 1 (the -1 argument) and print them. It effectively is removing all the non-printable characters.
"man strings" will provide the documentation.

Was searching for this for a while & found a rather simple solution:
The package ansifilter does exactly this. All you need to do is just pipe the output through it.
On Mac:
brew install ansifilter
Then:
cat file.txt | ansifilter

Find files with non-printing characters (null bytes)

I have got the log of my application with a field that contains strange characters.
I see these characters only when I use less command.
I tried to copy the result of my line of code in a text file and what I see is
CTP_OUT=^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
I'd like to know if there is a way to find these null characters. I have tried with a grep command but it didn't show anything

I hardly believe it, I might write an answer involving cat!
The characters you are observing are non-printable characters which are often written in Carret notation. The Caret notation of a character is a way to visualize non-printable characters. As mentioned in the OP, ^# is the representation of NULL.
If your file has non-printable characters, you can visualize them using cat -vET:
-E, --show-ends: display $ at end of each line
-T, --show-tabs: display TAB characters as ^I
-v, --show-nonprinting: use ^ and M- notation, except for LFD and TAB
source: man cat
I've added the -E and -T flag to it, to convert everything non-printable.
As grep will not output the non-printable characters itself in any form, you have to pipe its output to cat to see them. The following example shows all lines containing non-printable characters
Show all lines with non-printable characters:
$ grep -E '[^[:print:]]' --color=never file | cat -vET
Here, the ERE [^[:print:]] selects all non-printable characters.
Show all lines with NULL:
$ grep -Pa '\x00' --color=never file | cat -vET
Be aware that we need to make use of the Perl regular expressions here as they understand the hexadecimal and octal notation.
Various control characters can be written in C language style: \n matches a newline, \t a tab, \r a carriage return, \f a form feed, etc.
More generally, \nnn, where nnn is a string of three octal digits, matches the character whose native code point is nnn. You can easily run into trouble if you don't have exactly three digits. So always use three, or since Perl 5.14, you can use \o{...} to specify any number of octal digits.
Similarly, \xnn, where nn are hexadecimal digits, matches the character whose native ordinal is nn. Again, not using exactly two digits is a recipe for disaster, but you can use \x{...} to specify any number of hex digits.
source: Perl 5 version 26.1 documentation
An example:
$ printf 'foo\012\011\011bar\014\010\012foobar\012\011\000\013\000car\012\011\011\011\012' > test.txt
$ cat test.txt
foo
bar
foobar
car
If we now use grep alone, we get the following:
$ grep -Pa '\x00' --color=never test.txt
car
But piping it to cat allows us to visualize the control characters:
$ grep -Pa '\x00' --color=never test.txt | cat -vET
^I^#^K^#car$
Why --color=never: If your grep is tuned to have --color=auto or --color=always it will add extra control characters to be interpreted as color for the terminal. And this might confuse you by the content.
$ grep -Pa '\x00' --color=always test.txt | cat -vET
^I^[[01;31m^[[K^#^[[m^[[K^K^[[01;31m^[[K^#^[[m^[[Kcar$

sed can.
sed -n '/\x0/ { s/\x0/<NUL>/g; p}' file
-n skips printing any output unless explicitly requested.
/\x0/ selects for only lines with null bytes.
{...} encapsulates multiple commands, so that they can be collectively applied always and only when the /\x0/ has detected a null on the line.
s/\x0/<NUL>/g; substitutes in a new, visible value for the null bytes. You could make it whatever you want - I used <NUL> as something both reasonably obvious and yet unlikely to occur otherwise. You should probably grep the file for it first to be sure the pattern doesn't exist before using it.
p; causes lines that have been edited (because they had a null byte) to show.
This basically makes sed an effective grep for nulls.

Remove lines with japanese characters from a file

First question on here- I've searched around to put together an answer to this but have come up empty thus far.
I have a multi-line text file that I am cleaning up. Part of this is to remove lines that include Japanese characters. I have been using sed for my other operations but it is not working in this instance.
I was under the impression that using the -r switch and the \p{Han} regular expression would work (from looking at other questions of this kind), but it is not working in this case.
Here is my test string - running this returns the full string, and does not filter out the JP characters as I was expecting.
echo 80岁返老还童的处女: 第3话 | sed -r "s/\\p\{Han\}//g"
Am I missing something? Is there another command I should be using instead?

I think this might work for you:
echo "80岁返老还童的处女: 第3话" | tr -cd '[:print:]\n'
sed doesn't support unicode classes AFAIK, and nor support multibyte ranges.
-d deletes characters in SET1, and -c reverses it.
[:print:] matches all printable characters including space.
\n is a newline
The above will not only remove Japanese characters but all multibyte characters, including control characters.
Perl can also be used:
PERLIO=:utf8 perl -pe 's/\p{Han}//g' file
PERLIO=:utf8 tells Perl to tread input and output as UTF-8

How to replace a <85> to a new line in bash script

I’m running out of idea on how to replace this character “<85>” to a new line (please treat this as one character only – I think this is a non-printable character).
I tried this one in my script:
cat file | awk '{gsub(”<85>”,RS);print}' > /tmp/file.txt
but didn’t work.
I hope someone can help.
Thanks.

With sed: sed -e $'s/\302\205/\\n/' file > file.txt
Or awk: awk '{gsub("\302\205","\n")}7'
The magic here was in converting the <85> character to octal codepoints.
I used hexdump -b on a file I manually inserted that character into.

tr '\205' '\n' <file > file.txt
tr is the transliterate command; it translates one character to another (or deletes it, or …). The version of tr on Mac OS X doesn't recognize hexadecimal escapes, so you have to use octal, and octal 205 is hex 85.
I am assuming that the file contains a single byte '\x85', rather than some combination of bytes that is being presented as <85>. tr is not good for recognizing multibyte sequences that need to be transliterated.

sed help: matching and replacing a literal "\n" (not the newline)

i have a file which contains several instances of \n.
i would like to replace them with actual newlines, but sed doesn't recognize the \n.
i tried
sed -r -e 's/\n/\n/'
sed -r -e 's/\\n/\n/'
sed -r -e 's/[\n]/\n/'
and many other ways of escaping it.
is sed able to recognize a literal \n? if so, how?
is there another program that can read the file interpreting the \n's as real newlines?

Can you please try this
sed -i 's/\\n/\n/g' input_filename

What exactly works depends on your sed implementation. This is poorly specified in POSIX so you see all kinds of behaviors.
The -r option is also not part of the POSIX standard; but your script doesn't use any of the -r features, so let's just take it out. (For what it's worth, it changes the regex dialect supported in the match expression from POSIX "basic" to "extended" regular expressions; some sed variants have an -E option which does the same thing. In brief, things like capturing parentheses and repeating braces are "extended" features.)
On BSD platforms (including MacOS), you will generally want to backslash the literal newline, like this:
sed 's/\\n/\
/g' file
On some other systems, like Linux (also depending on the precise sed version installed -- some distros use GNU sed, others favor something more traditional, still others let you choose) you might be able to use a literal \n in the replacement string to represent an actual newline character; but again, this is nonstandard and thus not portable.
If you need a properly portable solution, probably go with Awk or (gasp) Perl.
perl -pe 's/\\n/\n/g' file
In case you don't have access to the manuals, the /g flag says to replace every occurrence on a line; the default behavior of the s/// command is to only replace the first match on every line.

awk seems to handle this fine:
echo "test \n more data" | awk '{sub(/\\n/,"**")}1'
test ** more data
Here you need to escape the \ using \\

$ echo "\n" | sed -e 's/[\\][n]/hello/'

sed works one line at a time, so no \n on 1 line only (it's removed by sed at read time into buffer). You should use N, n or H,h to fill the buffer with more than one line, and then \n appears inside. Be careful, ^ and $ are no more end of line but end of string/buffer because of the \n inside.
\n is recognized in the search pattern, not in the replace pattern. Two ways for using it (sample):
sed s/\(\n\)bla/\1blabla\1/
sed s/\nbla/\
blabla\
/
The first uses a \n already inside as back reference (shorter code in replace pattern);
the second use a real newline.
So basically
sed "N
$ s/\(\n\)/\1/g
"
works (but is a bit useless). I imagine that s/\(\n\)\n/\1/g is more like what you want.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

sed doesn't remove characters from UTF range properly - linux

Perl understands UTF-8: perl -CSD -pe 's/[^\N{U+0600}-\N{U+06FF}]//g' -- file.txt -C turns of UTF-8 support, S means for stdin/stdout/stderr, D means for any i/o streams. You can also use Unicode properties: s/\P{Cyrillic}//g

Related

How to replace non printable characters in file like <97> on linux [duplicate]

Find files with non-printing characters (null bytes)

Remove lines with japanese characters from a file

How to replace a <85> to a new line in bash script

sed help: matching and replacing a literal "\n" (not the newline)

Categories

Resources