Using iconv to convert from UTF-16BE to UTF-8 without BOM - text

I'm trying to convert a UTF-16BE encoded file (byte order mark: 0xFE 0xFF) to UTF-8 using iconv like so:
iconv -f UTF-16BE -t UTF-8 myfile.txt
The resulting output, however, has the UTF-8 byte order mark (0xEF 0xBB 0xBF) and that is not what I need. Is there a way to tell iconv (or is there an equivalent encoding) to not put a BOM in the UTF-8 result?

Experiment shows that indicating UTF-16 rather than UTF-16BE does what you want:
iconv -f UTF-16 -t UTF-8 myfile.txt

Related

Inserting ',' into certain position of a text containing full-width characters

Inserting a "," in a particular position of a text
From question above, I have gotten errors because a text contained some full-width characters.
I deal with some Japanese text data on RHEL server. Question above was a perfect solution for utf-8 text but the UNIX command wont work for Japanese text in SJIS format.
The difference between these two is that utf-8 counts every character as 1 byte and SJIS counts alphabets and numbers as 1 byte and other Japanese characters, such as あ, as 2 bytes. So the sed command only works for utf-8 when inserting ',' in some positions.
My input would be like
aaaああ123あ
And I would like to insert ',' after 3 bytes, 4 bytes and 3 bytes so my desired outcome is
aaa,ああ,123,あ
It is not necessarily sed command if it works on UNIX system.
Is there any way to insert ',' after some bytes of data while counting full-width character as 2 bytes and others as 1 bytes.
あ is 3 bytes in UTF-8
Depending on the locale GNU sed supports unicode. So reset the locale before running sed commands, and it will work on bytes.
And I would like to insert ',' after 3 bytes, 4 bytes and 3 bytes
Just use a backreference to remember the bytes.
LC_ALL=C sed 's/^\(...\)\(....\)\(...\)/\1,\2,\3,/'
or you could specify numbers:
LC_ALL=C sed 's/^\(.\{3\}\)\(.\{4\}\)\(.\{3\}\)/\1,\2,\3,/'
And cleaner with extended regex extension:
LC_ALL=C sed -E 's/^(.{3})(.{4})(.{3})/\1,\2,\3,/'
The following seems to work in my terminal:
$ <<<'aaaああ123あ' iconv -f UTF-8 -t SHIFT-JIS | LC_ALL=C sed 's/^\(.\{3\}\)\(.\{4\}\)\(.\{3\}\)/\1,\2,\3,/' | iconv -f SHIFT-JIS -t UTF-8
aaa,ああ,123,あ

Print/Replace ALT code in Unix

I am creating a report in pipe separated text file using Application Oracle framework on unix file server. This file is in iso-8859-1 encoding format. But I need to send to downstream in UTF-8 format(which I can not generate from Oracle framework) so I am converting it to UTF format using below command:
iconv -f iso-8859-1 -t UTF-8//TRANSLIT $i -o $i
But there is requirement of replacing "|" separator with inverted exclamation mark character "¡"
So how can find and replace "|" character and replace it with "¡" in Unix?
The INVERTED EXCLAMATION MARK is unicode U+00A1 and is member of the ISO-8859-1 charset with code 0xa1 or 0241 in octal. As you know that your input file is iso-8859-1 encoded, you can convert the pipe with a mere tr command:
tr '|' '\241' < infile > outfile
You can then use iconv to convert outfile from ISO-8859-1 to utf8.
Demo (on an ISO-8859-1 terminal):
$ echo 'a|b' | tr '|' '\241'
a¡b
$

Linux: iconv ASCII text to UTF-8

I have a file which contains the letter ö. Except that it doesn't. When I open the file in gedit, I see:
\u00f6
I tried to convert the file, applying code that I found on other threads:
$ file blb.txt
blb.txt: ASCII text
$ iconv -f ISO-8859-15 -t UTF-8 blb.txt > blb_tmp.txt
$ file blb_tmp.txt
blb_tmp.txt: ASCII text
What am I missing?
EDIT
I found this solution:
echo -e "$(cat blb.txt)" > blb_tmp.txt
$ file blb_tmp.txt
blb_tmp.txt: UTF-8 Unicode text
The -e "enables interpretation of backslash escapes".
Still not sure why iconv didn't make it happen. I'm guessing it's something like "iconv only changes the encoding, it doesn't interpret". Not sure yet, what the difference is though. Why did the Unicode people make this world such a mess? :D

Linux script to automatically convert file type to UTF8

I am in a tight spot and could use some help coming up with a linux shell script to convert a directory full of pipes delimited files from their original file encoding to UTF-8. The source files are either US-ASCII or ISO-8859-1 file encoding. The closest thing that I could come up with is:
iconv -f ISO8859-1 -t utf-8 * > name_of_utf8_file
This condenses all of the files into a single file which is not needed but OK for this application. The problem is that I neeed to specify both the source and destination file encoding, so for half of the files I don't know what it does. Is there way to write a shell script using commands like file -i or the like.
Any advice here is much appreciated.
This is, (not properly tested, caveat emptor :)), one way of doing it:
Maybe try w/ a small subset first - this is more of a thought example than a turn-key solution.
for i in *
do
if $( file -i "${i}"|grep -q us-ascii ); then
iconv -f us-ascii -t utf-8 "$i" > "${i}.utf8"
fi
if $( file -i "${i}"|grep -q iso-8859-1 ); then
iconv -f iso8859-1 -t utf-8 "$i" > "${i}.utf8"
fi
done

Convert UTF8 to UTF16 using iconv

When I use iconv to convert from UTF16 to UTF8 then all is fine but vice versa it does not work.
I have these files:
a-16.strings: Little-endian UTF-16 Unicode c program text
a-8.strings: UTF-8 Unicode c program text, with very long lines
The text look OK in editor. When I run this:
iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16.strings
Then I get this result:
b-16.strings: data
a-16.strings: Little-endian UTF-16 Unicode c program text
a-8.strings: UTF-8 Unicode c program text, with very long lines
The file utility does not show expected file format and the text does not look good in editor either. Could it be that iconv does not create proper BOM? I run it on MAC command line.
Why is not the b-16 in proper UTF-16LE format? Is there another way of converting utf8 to utf16?
More elaboration is bellow.
$ iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16le-BAD-fromUTF8.strings
$ iconv -f UTF-8 -t UTF-16 a-8.strings > b-16be.strings
$ iconv -f UTF-16 -t UTF-16LE b-16be.strings > b-16le-BAD-fromUTF16BE.strings
$ file *s
a-16.strings: Little-endian UTF-16 Unicode c program text, with very long lines
a-8.strings: UTF-8 Unicode c program text, with very long lines
b-16be.strings: Big-endian UTF-16 Unicode c program text, with very long lines
b-16le-BAD-fromUTF16BE.strings: data
b-16le-BAD-fromUTF8.strings: data
$ od -c a-16.strings | head
0000000 377 376 / \0 * \0 \0 \f 001 E \0 S \0 K \0
$ od -c a-8.strings | head
0000000 / * * * Č ** E S K Y ( J V O
$ od -c b-16be.strings | head
0000000 376 377 \0 / \0 * \0 * \0 * \0 001 \f \0 E
$ od -c b-16le-BAD-fromUTF16BE.strings | head
0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0
$ od -c b-16le-BAD-fromUTF8.strings | head
0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0
It is clear the BOM is missing whenever I run conversion to UTF-16LE.
Any help on this?
UTF-16LE tells iconv to generate little-endian UTF-16 without a BOM (Byte Order Mark). Apparently it assumes that since you specified LE, the BOM isn't necessary.
UTF-16 tells it to generate UTF-16 text (in the local machine's byte order) with a BOM.
If you're on a little-endian machine, I don't see a way to tell iconv to generate big-endian UTF-16 with a BOM, but I might just be missing something.
I find that the file command doesn't recognize UTF-16 text without a BOM, and your editor might not either. But if you run iconv -f UTF-16LE -t UTF_8 b-16 strings, you should get a valid UTF-8 version of the original file.
Try running od -c on the files to see their actual contents.
UPDATE :
It looks like you're on a big-endian machine (x86 is little-endian), and you're trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell, iconv won't do that directly. But this should work:
( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le UTF-8-FILE ) > UTF-16-FILE
The behavior of the printf might depend on your locale settings; I have LANG=en_US.UTF-8.
(Can anyone suggest a more elegant solution?)
Another workaround, if you know the endianness of the output produced by -t utf-16:
iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv=swab 2>/dev/null
I first convert to UTF-16, which will prepend a byte-order mark, if necessary as Keith Thompson mentions. Then since UTF-16 doesn't define endianness, we must use file to determine whether it's UTF-16BE or UTF-16LE. Finally, we can convert to UTF-16LE.
iconv -f utf-8 -t utf-16 UTF-8-FILE > UTF-16-UNKNOWN-ENDIANNESS-FILE
FILE_ENCODING="$( file --brief --mime-encoding UTF-16-UNKNOWN-ENDIANNESS-FILE )"
iconv -f "$FILE_ENCODING" -t UTF-16LE UTF-16-UNKNOWN-ENDIANNESS-FILE > UTF-16-FILE
This may not be an elegant solution but I found a manual way to ensure correct conversion for my problem which I believe is similar to the subject of this thread.
The Problem:
I got a text datafile from a user and I was going to process it on Linux (specifically, Ubuntu) using shell script (tokenization, splitting, etc.). Let's call the file myfile.txt. The first indication that I got that something was amiss was that the tokenization was not working. So I was not surprised when I ran the file command on myfile.txt and got the following
$ file myfile.txt
myfile.txt: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators
If the file was compliant, here is what should have been the conversation:
$ file myfile.txt
myfile.txt: ASCII text, with very long lines
The Solution:
To make the datafile compliant, below are the 3 manual steps that I found to work after some trial and error with other steps.
First convert to Big Endian at the same encoding via vi (or vim). vi myfile.txt. In vi do :set fileencoding=UTF-16BE then write out the file. You may have to force it with :!wq.
vi myfile.txt (which should now be in utf-16BE). In vi do :set fileencoding=ASCII then write out the file. Again, you may have to force the write with !wq.
Run dos2unix converter: d2u myfile.txt. If you now run file myfile.txt you should now see an output or something more familiar and assuring like:
myfile.txt: ASCII text, with very long lines
That's it. That's what worked for me, and I was then able to run my processing bash shell script of myfile.txt. I found that I cannot skip Step 2. That is, in this case I cannot skip directly to Step 3. Hopefully you can find this info useful; hopefully someone can automate it perhaps via sed or the like. Cheers.

Resources