How to tell sed "do not remove some characters"? - linux

I have a text file containing Arabic characters and some other characters (punctuation marks, numbers, English characters, ... ).
How can I tell sed to remove all the characters in the file, except Arabic ones? In short I can say that we typically tell sed to remove/replace some specific characters and print others, but now I am looking for a way to tell sed just print my desired characters, and remove all other characters.

With GNU sed, you should be able to specify characters by their hex code. You can use those in a a character class:
sed 's/[\x00-\x7F]//g' # hex notation
sed 's/[\o000-\o177]//g' # octal notation
You should also be able to achieve the same effect with the tr command:
tr -d '[\000-\177]'
Both methods assume UTF8 encoding of your input file. Multi-byte characters have their highest bit set, so you can simply strip everything that's a standard ASCII (7 bits) character.
To keep everything except some well defined characters, use a negative character classe:
sed 's/[^characters you want to keep]//g'
Using a pattern alike to [^…]\+ might improve performance of the regex.

Related

How can I escape all non-alphanumeric characters in AWK?

I inherited a very large AWK script that matches against .csv files, and I've found it does not match some alphanumeric characters, especially + ( ).
While I realize this would be easy in sed:
sed 's/\([^A-z0-9]\)/\\\1/g'
I can't seem to find a way to call on the matched character the same way in AWK.
For instance a sample input is:
select.awk 'Patient data +(B/U)'
I would like to escape the non-alphanumeric characters, and turn the line into:
Patient\ data\ \+\(B\/U\)
I have seen some people pass very obscure non-alphanumeric characters as well, which I would like to escape.
gsub(/[^[:alnum:]]/, "\\\\&", arg)
the gnu variant has more feature,
awk '{n=gensub(/[^[:alnum:]]/,"\\\\&","g"); print n}' d.csv

What does this sed command line do?

I see this lines in my study.
$temp = 'echo $line | sed s/[a-z AZ 0-9 _]//g'
IF($temp != '')
echo "Line contains illegal characters"
I don't understand. Isn't sed is like substituting function? In the code, [a-z AZ 0-9 _] should be replace with ''. I don't understand how this determines if $line has illegal characters.
sed is a stream editor tool that applies regular expressions to transform the input. The command
sed s/regex/replace/g
reads from stdin and every time it finds something matching regex, it replaces it with the contents of replace. In your case, the command
sed s/[a-z A-Z 0-9 _]//g
has [a-z A-Z 0-9] as its regular expression and the empty string as its replacement. (Did you forget a dash between the A and the Z?) This means that anything matching the indicated regular expression gets deleted. This regular expression means "any character that's either between a and z, between A and Z, between 0 and 9, a space, or an underscore," so this command essentially deletes any alphanumeric characters, whitespaces, or underscores from the input and dumps what's left to stdout. Testing whether the output is empty then asks whether there were any characters in there that weren't alphanumeric, spaces, or numbers, which is how the code works.
I'd recommend adding sed to the list of tools you should get a basic familiarity with, since it's a fairly common one to see on the command-line.

using tr to strip characters but keep line breaks

I am trying to format some text that was converted from UTF-16 to ASCII, the output looks like this:
C^#H^#M^#M^#2^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
T^#h^#e^#m^#e^# ^#M^#a^#n^#a^#g^#e^#r^# ^#f^#o^#r^# ^#3^#D^#S^#^#^#^#^#^#^#^#^#^#^#^#^#^#
The only text I want out of that is:
CHMM2
Theme Manager for 3DS
So there is a line break "\n" at the end of each line and when I use
tr -cs 'a-zA-Z0-9' 'newtext' infile.txt > outfile.txt
It is stripping the new line as well so all the text ends up in one big string on one line.
Can anyone assist with figuring out how to strip out only the ^#'s and keeping spaces and new lines?
The ^#s are most certainly null characters, \0s, so:
tr -d '\0'
Will get rid of them.
But this is not really the correct solution. You should simply use theiconv command to convert from UTF-16 to UTF-8 (see its man page for more information). That is, of course, what you're really trying to accomplish here, and this will be the correct way to do it.
This is an XY problem. Your problem is not deleting the null characters. Your real problem is how to convert from UTF-16 to either UTF-8, or maybe US-ASCII (and I chose UTF-8, as the conservative answer).

Print non-ascii/unicode characters in shell

I am using following command to search and print non-ascii characters:
grep --color -R -C 2 -P -n "[\x80-\xFF]" .
The output that I get, prints the line which has non-ascii characters in it.
However it does not print the actual unicode character.
Is there a way to print the unicode character?
output
./test.yml-35-
./test.yml-36-- name: Flush Handlers
./test.yml:37:  meta: flush_handlers
./test.yml-38-
--
This was answered in Searching for non-ascii characters. The real issue as shown in Filtering invalid utf8 is that the regular expression you are using is for single bytes, while UTF-8 is a multibyte encoding (and the pattern must therefore cover multiple bytes).
The extensive answer by #Peter O in the latter Q/A appears to be the best one, using Perl. grep is the wrong tool.

What sed script can replace a range of hex characters with another

I need to replace some non text characters in some automatically generated files with spaces.
Although they are text files after processing some characters are added and the cannot be edited as text any more
Is there a sed command to do that?
Depending on your platform and sed version, you may or may not be able to do something like s/[\000-\037]/ /g; but the portable and simple alternative is this:
tr '\000-\037' ' ' <input >output
(All character codes are "binary"; I have assumed you mean control characters, but if you mean 8-bit characters \200-\377 or something else altogether, it's obviously trivial to adjust the range.)

Resources