I am creating a report in pipe separated text file using Application Oracle framework on unix file server. This file is in iso-8859-1 encoding format. But I need to send to downstream in UTF-8 format(which I can not generate from Oracle framework) so I am converting it to UTF format using below command:
iconv -f iso-8859-1 -t UTF-8//TRANSLIT $i -o $i
But there is requirement of replacing "|" separator with inverted exclamation mark character "¡"
So how can find and replace "|" character and replace it with "¡" in Unix?
The INVERTED EXCLAMATION MARK is unicode U+00A1 and is member of the ISO-8859-1 charset with code 0xa1 or 0241 in octal. As you know that your input file is iso-8859-1 encoded, you can convert the pipe with a mere tr command:
tr '|' '\241' < infile > outfile
You can then use iconv to convert outfile from ISO-8859-1 to utf8.
Demo (on an ISO-8859-1 terminal):
$ echo 'a|b' | tr '|' '\241'
a¡b
$
Related
Inserting a "," in a particular position of a text
From question above, I have gotten errors because a text contained some full-width characters.
I deal with some Japanese text data on RHEL server. Question above was a perfect solution for utf-8 text but the UNIX command wont work for Japanese text in SJIS format.
The difference between these two is that utf-8 counts every character as 1 byte and SJIS counts alphabets and numbers as 1 byte and other Japanese characters, such as あ, as 2 bytes. So the sed command only works for utf-8 when inserting ',' in some positions.
My input would be like
aaaああ123あ
And I would like to insert ',' after 3 bytes, 4 bytes and 3 bytes so my desired outcome is
aaa,ああ,123,あ
It is not necessarily sed command if it works on UNIX system.
Is there any way to insert ',' after some bytes of data while counting full-width character as 2 bytes and others as 1 bytes.
あ is 3 bytes in UTF-8
Depending on the locale GNU sed supports unicode. So reset the locale before running sed commands, and it will work on bytes.
And I would like to insert ',' after 3 bytes, 4 bytes and 3 bytes
Just use a backreference to remember the bytes.
LC_ALL=C sed 's/^\(...\)\(....\)\(...\)/\1,\2,\3,/'
or you could specify numbers:
LC_ALL=C sed 's/^\(.\{3\}\)\(.\{4\}\)\(.\{3\}\)/\1,\2,\3,/'
And cleaner with extended regex extension:
LC_ALL=C sed -E 's/^(.{3})(.{4})(.{3})/\1,\2,\3,/'
The following seems to work in my terminal:
$ <<<'aaaああ123あ' iconv -f UTF-8 -t SHIFT-JIS | LC_ALL=C sed 's/^\(.\{3\}\)\(.\{4\}\)\(.\{3\}\)/\1,\2,\3,/' | iconv -f SHIFT-JIS -t UTF-8
aaa,ああ,123,あ
While i am editing csv file in linux special character look like £stackoverflow, £unixbox,£query. My query is how to remove  from csv file.
Input: £stackoverflow, £unixbox,£query
Output: £stackoverflow, £unixbox,£query
Observations of linux box:
currently linux window translation setting is ISO-8859-1, while i am changing the window setting--->translation-->UTF-8 then open the same file using vi editior  char being disappeared.I have tried iconv command as well but didn't work.It may be the reason that i am conv the file ISO-8859-1 to UTF-8 but by default setting of linux is ISO-8859-1 so it is showing me  it is not removing this char.How to handle it to remove the same.
You can try the below Perl solution. This removes all the ordinal values that are not in the range of 32 to 127 (which contains the ascii text)
$ echo "£stackoverflow, £unixbox,£query Output: £stackoverflow, £unixbox,£query" | perl -pe ' s/[^\x20-\x7f]//g '
stackoverflow, unixbox,query Output: stackoverflow, unixbox,query
$
EDIT:
To remove just Â, use
$ echo "Â" | perl -pe ' s/./sprintf("%x |",ord($&))/eg ' # Find the underlying ordinal values for Â
c3 |82 |
$ echo "£stackoverflow, £unixbox,£query" | perl -pe ' s/\xc3\x82//g ' #removing it using s///
£stackoverflow, £unixbox,£query
$
I have one problem. I'd like to decompress string directly from a file. I have one script in bash that create another script.
#!/bin/bash
echo -n '#!/bin/bash
' > test.sh #generate header for interpreter
echo -n "echo '" >> test.sh #print echo to file
echo -n "My name is Daniel" | gzip -f >> test.sh #print encoded by gzip string into a file
echo -n "' | gunzip;" >> test.sh #print reverse commands for decode into a file
chmod a+x test.sh #make file executable
I want to generate script test.sh that will the shortest script. I'm trying to compress string "My name is Daniel" and write it directly into file test.sh
But if I run test.sh i got gzip: stdin has flags 0x81 -- not supported
Do you know why have I got this problem?
gzip output is binary so it can contain any character, as script is generated with bash it contains characters which are encoded (echo $LANG).
characters which cause problem between single quotes are NUL 0x0, ' 0x27 and non ascii characters 128-256 0x80-0xff.
a solution can be to use ANSI C quotes $'..' and to escape NUL and non ascii characters.
EDIT bash string can't contain nul character :
gzip -c <<<"My name is Daniel" | od -c -tx1
trying to create ansi string
echo -n $'\x1f\x8b\x08\x00\xf7i\xe2Y\x00\x03\xf3\xadT\xc8K\xccMU\xc8,VpI\xcc\xcbL\xcd\^C1\x00\xa5u\x87\xad\x11\x00\x00\x00' | od -c -tx1
shows that string is truncated after nul character.
The best compromise may be to use base64 encoding:
gzip <<<"My name is Daniel"| base64
base64 --decode <<__END__ | gzip -cd
H4sIAPts4lkAA/OtVMhLzE1VyCxWcEnMy0zN4QIAgdbGlBIAAAA=
__END__
or
base64 --decode <<<H4sIAPts4lkAA/OtVMhLzE1VyCxWcEnMy0zN4QIAgdbGlBIAAAA=|gzip -cd
The problem was with storing null character (\0) in bash script.
Null character can not be stored in echo and variable string. It can be stored in files and pipes.
I want to avoid use base64, but I fixed it with
printf "...%b....%b" "\0" "\0"
I edited script with bless hex editor. It's working for me :)
I have a file which contains the letter ö. Except that it doesn't. When I open the file in gedit, I see:
\u00f6
I tried to convert the file, applying code that I found on other threads:
$ file blb.txt
blb.txt: ASCII text
$ iconv -f ISO-8859-15 -t UTF-8 blb.txt > blb_tmp.txt
$ file blb_tmp.txt
blb_tmp.txt: ASCII text
What am I missing?
EDIT
I found this solution:
echo -e "$(cat blb.txt)" > blb_tmp.txt
$ file blb_tmp.txt
blb_tmp.txt: UTF-8 Unicode text
The -e "enables interpretation of backslash escapes".
Still not sure why iconv didn't make it happen. I'm guessing it's something like "iconv only changes the encoding, it doesn't interpret". Not sure yet, what the difference is though. Why did the Unicode people make this world such a mess? :D
I'm trying to convert a UTF-16BE encoded file (byte order mark: 0xFE 0xFF) to UTF-8 using iconv like so:
iconv -f UTF-16BE -t UTF-8 myfile.txt
The resulting output, however, has the UTF-8 byte order mark (0xEF 0xBB 0xBF) and that is not what I need. Is there a way to tell iconv (or is there an equivalent encoding) to not put a BOM in the UTF-8 result?
Experiment shows that indicating UTF-16 rather than UTF-16BE does what you want:
iconv -f UTF-16 -t UTF-8 myfile.txt