Inserting ',' into certain position of a text containing full-width characters

Inserting ',' into certain position of a text containing full-width characters - linux

Inserting a "," in a particular position of a text
From question above, I have gotten errors because a text contained some full-width characters.
I deal with some Japanese text data on RHEL server. Question above was a perfect solution for utf-8 text but the UNIX command wont work for Japanese text in SJIS format.
The difference between these two is that utf-8 counts every character as 1 byte and SJIS counts alphabets and numbers as 1 byte and other Japanese characters, such as あ, as 2 bytes. So the sed command only works for utf-8 when inserting ',' in some positions.
My input would be like
aaaああ123あ
And I would like to insert ',' after 3 bytes, 4 bytes and 3 bytes so my desired outcome is
aaa,ああ,123,あ
It is not necessarily sed command if it works on UNIX system.
Is there any way to insert ',' after some bytes of data while counting full-width character as 2 bytes and others as 1 bytes.

あ is 3 bytes in UTF-8
Depending on the locale GNU sed supports unicode. So reset the locale before running sed commands, and it will work on bytes.
And I would like to insert ',' after 3 bytes, 4 bytes and 3 bytes
Just use a backreference to remember the bytes.
LC_ALL=C sed 's/^\(...\)\(....\)\(...\)/\1,\2,\3,/'
or you could specify numbers:
LC_ALL=C sed 's/^\(.\{3\}\)\(.\{4\}\)\(.\{3\}\)/\1,\2,\3,/'
And cleaner with extended regex extension:
LC_ALL=C sed -E 's/^(.{3})(.{4})(.{3})/\1,\2,\3,/'
The following seems to work in my terminal:
$ <<<'aaaああ123あ' iconv -f UTF-8 -t SHIFT-JIS | LC_ALL=C sed 's/^\(.\{3\}\)\(.\{4\}\)\(.\{3\}\)/\1,\2,\3,/' | iconv -f SHIFT-JIS -t UTF-8
aaa,ああ,123,あ

Related

How to insert UTF-16 character with sed in a file?

I have a file coded as UTF-16 and I want to split each line into fields (at fixed positions) separated by commas.
I have tried the following:
Option 1)
sed -i 's/./&,/400;s/./&,/360;*<and so on for several positions in the line>* FILE
This seems to work, but when editing the file with vim, it is obvious that something is wrong, since the commas are displayed as a single character, but the other symbols are displayed as a two-byte character.
BEFORE: 2^#A^#U^#W^#2^#0^#1^#9^#0^#1^#0^#1^#0^#0^#0^#1^#0^#0^#
AFTER: 2^#,A^#U^#,W^#,2^#0^#1^#9^#0^#1^#0^#1^#,0^#0^#0^#1^#0^#0^#,
Option 2)
Then I tried to use sed again but, instead of "," I typed the UTF-16 code of the comma, that is 002c:
sed -i "s/./&\U002c/400...
or even
s/./&$(echo -ne '\u002c')/400...
None of these options worked, the results are exactly the same as in Option1.

The problem is not the insertion of a new character with sed.
I noticed that the file is UTF-16LE (I was assuming it was UTF-16BE). I converted it directly to ascii with
iconv -f UTF-16LE -t ASCII sourcefile >destinationfile
and after that I could handle the file as usual with sed, grep, awk, etc.

Find files with non-printing characters (null bytes)

I have got the log of my application with a field that contains strange characters.
I see these characters only when I use less command.
I tried to copy the result of my line of code in a text file and what I see is
CTP_OUT=^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
I'd like to know if there is a way to find these null characters. I have tried with a grep command but it didn't show anything

I hardly believe it, I might write an answer involving cat!
The characters you are observing are non-printable characters which are often written in Carret notation. The Caret notation of a character is a way to visualize non-printable characters. As mentioned in the OP, ^# is the representation of NULL.
If your file has non-printable characters, you can visualize them using cat -vET:
-E, --show-ends: display $ at end of each line
-T, --show-tabs: display TAB characters as ^I
-v, --show-nonprinting: use ^ and M- notation, except for LFD and TAB
source: man cat
I've added the -E and -T flag to it, to convert everything non-printable.
As grep will not output the non-printable characters itself in any form, you have to pipe its output to cat to see them. The following example shows all lines containing non-printable characters
Show all lines with non-printable characters:
$ grep -E '[^[:print:]]' --color=never file | cat -vET
Here, the ERE [^[:print:]] selects all non-printable characters.
Show all lines with NULL:
$ grep -Pa '\x00' --color=never file | cat -vET
Be aware that we need to make use of the Perl regular expressions here as they understand the hexadecimal and octal notation.
Various control characters can be written in C language style: \n matches a newline, \t a tab, \r a carriage return, \f a form feed, etc.
More generally, \nnn, where nnn is a string of three octal digits, matches the character whose native code point is nnn. You can easily run into trouble if you don't have exactly three digits. So always use three, or since Perl 5.14, you can use \o{...} to specify any number of octal digits.
Similarly, \xnn, where nn are hexadecimal digits, matches the character whose native ordinal is nn. Again, not using exactly two digits is a recipe for disaster, but you can use \x{...} to specify any number of hex digits.
source: Perl 5 version 26.1 documentation
An example:
$ printf 'foo\012\011\011bar\014\010\012foobar\012\011\000\013\000car\012\011\011\011\012' > test.txt
$ cat test.txt
foo
bar
foobar
car
If we now use grep alone, we get the following:
$ grep -Pa '\x00' --color=never test.txt
car
But piping it to cat allows us to visualize the control characters:
$ grep -Pa '\x00' --color=never test.txt | cat -vET
^I^#^K^#car$
Why --color=never: If your grep is tuned to have --color=auto or --color=always it will add extra control characters to be interpreted as color for the terminal. And this might confuse you by the content.
$ grep -Pa '\x00' --color=always test.txt | cat -vET
^I^[[01;31m^[[K^#^[[m^[[K^K^[[01;31m^[[K^#^[[m^[[Kcar$

sed can.
sed -n '/\x0/ { s/\x0/<NUL>/g; p}' file
-n skips printing any output unless explicitly requested.
/\x0/ selects for only lines with null bytes.
{...} encapsulates multiple commands, so that they can be collectively applied always and only when the /\x0/ has detected a null on the line.
s/\x0/<NUL>/g; substitutes in a new, visible value for the null bytes. You could make it whatever you want - I used <NUL> as something both reasonably obvious and yet unlikely to occur otherwise. You should probably grep the file for it first to be sure the pattern doesn't exist before using it.
p; causes lines that have been edited (because they had a null byte) to show.
This basically makes sed an effective grep for nulls.

Remove lines with japanese characters from a file

First question on here- I've searched around to put together an answer to this but have come up empty thus far.
I have a multi-line text file that I am cleaning up. Part of this is to remove lines that include Japanese characters. I have been using sed for my other operations but it is not working in this instance.
I was under the impression that using the -r switch and the \p{Han} regular expression would work (from looking at other questions of this kind), but it is not working in this case.
Here is my test string - running this returns the full string, and does not filter out the JP characters as I was expecting.
echo 80岁返老还童的处女: 第3话 | sed -r "s/\\p\{Han\}//g"
Am I missing something? Is there another command I should be using instead?

I think this might work for you:
echo "80岁返老还童的处女: 第3话" | tr -cd '[:print:]\n'
sed doesn't support unicode classes AFAIK, and nor support multibyte ranges.
-d deletes characters in SET1, and -c reverses it.
[:print:] matches all printable characters including space.
\n is a newline
The above will not only remove Japanese characters but all multibyte characters, including control characters.
Perl can also be used:
PERLIO=:utf8 perl -pe 's/\p{Han}//g' file
PERLIO=:utf8 tells Perl to tread input and output as UTF-8

Sed is not writing to file

I wanna simply change the delimiter on my CSV.
The file comes from a outside server, so the delimiter is something like this: ^A.
name^Atype^Avalue^A
john^Ab^A500
mary^Ac^A400
jack^Ad^A200
I want to get this:
name,type,value
john,b,500
mary,c,400
jack,d,200
I need to change it to a comma(,) or a tab(,), but my sed command, despite correctly output, does not write the file.
cat -v CSVFILE | sed -i "s/\^A/,/g"
When i use the line above, it correctly outputs the file delimited by a comma instead of ^A, but it doesn't write to the file.
I also tried like this:
sed -i "s/\^A/,/g" CSVFILE
Does not work also...
What am i doing wrong?

Literal ^A (two characters, ^ and A) is how cat -v visualizes control character 0x1 (ASCII code 1, named SOH (start of heading)). ^A is an example of caret notation to represent unprintable ASCII characters:
^A stands for keyboard combination Control-A, which, when preceded by generic escape sequence Control-V, is how you can create the actual control character in your terminal; in other words,
Control-VControl-A will insert an actual 0x1 character.
Incidentally, the logic of caret notation (^<letter>) is: the letter corresponds to the ASCII value of the control character represented; e.g., A corresponds to 0x1, and D corresponds to 0x4 (^D, EOT).
To put it differently: you add 0x40 to the ASCII value of the control character to get the ASCII value of its letter representation in caret notation.
^# to represent NUL (0x0 characters) and ^? to represent DEL (0x7f) are consistent with this notation, because # has ASCII value 0x40 (i.e., it comes just before A (0x41) in the ASCII table) and 0x40 + 0x7f constrained to 7 bits (bit-ANDed with the max. ASCII value 0x7f) yields 0x3f, which is the ASCII value of ?.
To inspect a given file for the ASCII values of exotic control characters, you can pipe it to od -c, which represents 0x1 as (octal) 001.
This implies that, when passing the file to sed directly, you cannot use caret notation and must instead use the actual control character in your s call.
Note that when you use Control-VControl-A to create an actual 0x1 character, it will also appear in caret notation - as ^A - but in that case it is just the terminal's visualization of the true control character; while it may look like the two printable characters ^ and A, it is not. Purely visually you cannot tell the difference - which is why using an escape sequence or ANSI C-quoted string to represent the control character is the better choice - see below.
Assuming your shell is bash, ksh, or zsh, the better alternative to using Control-VControl-A is to use an ANSI C-quoted string to generate the 0x1 character: $'\1'
However, as Lars Fischer points out in a comment on the question, GNU sed also recognizes escape sequence \x01 for 0x1.
Thus, your command should be:
sed -i 's/\x01/,/g' CSVFILE # \x01 only recognized by GNU sed
or, using an ANSI C-quoted string:
sed -i $'s/\1/,/g' CSVFILE
Note: While this form can in principle be used with BSD/OSX sed, the -i syntax is slightly different: you'd have to use sed -i '' $'s/\1/,/g' CSVFILE
The only reason to use sed for your task is to take advantage of in-place updating (-i); otherwise, tr is the better choice - see Ed Morton's answer.

This is the job tr was created to do:
tr '<control-A>' ',' < file > tmp && mv tmp file
Replace <control-A> with a literal control-A obviously.

If your sed supports the -i option, you could use it like this:
sed -i.bak -e "s/\^A/,/g" CSVFILE
(This assumes the delimiter in the source file consists of the two characters ^ and A; if ^A is supposed to refer to Control-A, then you will have to make adjustments accordingly, e.g. using 's/\x01/,/g'.)
Otherwise, assuming you want to keep a copy of the original file (e.g. in case the result is not what you expect -- see below), an incantation such as the following can be used:
mv CSVFILE CSVFILE.bak && sed "s/\^A/,/g" CSVFILE.bak > CSVFILE
As pointed out elsewhere, if the source-file separator is Control-A, you could also use tr '\001' , (or tr '\001' '\t' for a tab).
The caution is that the delimiter in the source file might well be used precisely because commas might appear in the "values" that the separator-character is separating. If that is a possibility, then a different approach will be needed. (See e.g. https://www.rfc-editor.org/rfc/rfc4180)

In case it's run under OS X :
Add an extension to the -i to write in a new file :
sed -i.bak "s/^A/,/g" CSVFILE
Or to write in place :
sed -i '' "s/^A/,/g" CSVFILE
You can also output to file with a cat but without -i on your sed
command :
cat -v CSVFILE | sed "s/^A/,/g" > ouput
Make sure you write the ^A this way :
Ctrl+V+Ctrl+A

How to replace a <85> to a new line in bash script

I’m running out of idea on how to replace this character “<85>” to a new line (please treat this as one character only – I think this is a non-printable character).
I tried this one in my script:
cat file | awk '{gsub(”<85>”,RS);print}' > /tmp/file.txt
but didn’t work.
I hope someone can help.
Thanks.

With sed: sed -e $'s/\302\205/\\n/' file > file.txt
Or awk: awk '{gsub("\302\205","\n")}7'
The magic here was in converting the <85> character to octal codepoints.
I used hexdump -b on a file I manually inserted that character into.

tr '\205' '\n' <file > file.txt
tr is the transliterate command; it translates one character to another (or deletes it, or …). The version of tr on Mac OS X doesn't recognize hexadecimal escapes, so you have to use octal, and octal 205 is hex 85.
I am assuming that the file contains a single byte '\x85', rather than some combination of bytes that is being presented as <85>. tr is not good for recognizing multibyte sequences that need to be transliterated.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Inserting ',' into certain position of a text containing full-width characters - linux

Related

How to insert UTF-16 character with sed in a file?

Find files with non-printing characters (null bytes)

Remove lines with japanese characters from a file

Sed is not writing to file

How to replace a <85> to a new line in bash script

Categories

Resources