Convert UTF8 to UTF16 using iconv

Convert UTF8 to UTF16 using iconv - linux

When I use iconv to convert from UTF16 to UTF8 then all is fine but vice versa it does not work.
I have these files:
a-16.strings: Little-endian UTF-16 Unicode c program text
a-8.strings: UTF-8 Unicode c program text, with very long lines
The text look OK in editor. When I run this:
iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16.strings
Then I get this result:
b-16.strings: data
a-16.strings: Little-endian UTF-16 Unicode c program text
a-8.strings: UTF-8 Unicode c program text, with very long lines
The file utility does not show expected file format and the text does not look good in editor either. Could it be that iconv does not create proper BOM? I run it on MAC command line.
Why is not the b-16 in proper UTF-16LE format? Is there another way of converting utf8 to utf16?
More elaboration is bellow.
$ iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16le-BAD-fromUTF8.strings
$ iconv -f UTF-8 -t UTF-16 a-8.strings > b-16be.strings
$ iconv -f UTF-16 -t UTF-16LE b-16be.strings > b-16le-BAD-fromUTF16BE.strings
$ file *s
a-16.strings: Little-endian UTF-16 Unicode c program text, with very long lines
a-8.strings: UTF-8 Unicode c program text, with very long lines
b-16be.strings: Big-endian UTF-16 Unicode c program text, with very long lines
b-16le-BAD-fromUTF16BE.strings: data
b-16le-BAD-fromUTF8.strings: data
$ od -c a-16.strings | head
0000000 377 376 / \0 * \0 \0 \f 001 E \0 S \0 K \0
$ od -c a-8.strings | head
0000000 / * * * Č ** E S K Y ( J V O
$ od -c b-16be.strings | head
0000000 376 377 \0 / \0 * \0 * \0 * \0 001 \f \0 E
$ od -c b-16le-BAD-fromUTF16BE.strings | head
0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0
$ od -c b-16le-BAD-fromUTF8.strings | head
0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0
It is clear the BOM is missing whenever I run conversion to UTF-16LE.
Any help on this?

UTF-16LE tells iconv to generate little-endian UTF-16 without a BOM (Byte Order Mark). Apparently it assumes that since you specified LE, the BOM isn't necessary.
UTF-16 tells it to generate UTF-16 text (in the local machine's byte order) with a BOM.
If you're on a little-endian machine, I don't see a way to tell iconv to generate big-endian UTF-16 with a BOM, but I might just be missing something.
I find that the file command doesn't recognize UTF-16 text without a BOM, and your editor might not either. But if you run iconv -f UTF-16LE -t UTF_8 b-16 strings, you should get a valid UTF-8 version of the original file.
Try running od -c on the files to see their actual contents.
UPDATE :
It looks like you're on a big-endian machine (x86 is little-endian), and you're trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell, iconv won't do that directly. But this should work:
( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le UTF-8-FILE ) > UTF-16-FILE
The behavior of the printf might depend on your locale settings; I have LANG=en_US.UTF-8.
(Can anyone suggest a more elegant solution?)
Another workaround, if you know the endianness of the output produced by -t utf-16:
iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv=swab 2>/dev/null

I first convert to UTF-16, which will prepend a byte-order mark, if necessary as Keith Thompson mentions. Then since UTF-16 doesn't define endianness, we must use file to determine whether it's UTF-16BE or UTF-16LE. Finally, we can convert to UTF-16LE.
iconv -f utf-8 -t utf-16 UTF-8-FILE > UTF-16-UNKNOWN-ENDIANNESS-FILE
FILE_ENCODING="$( file --brief --mime-encoding UTF-16-UNKNOWN-ENDIANNESS-FILE )"
iconv -f "$FILE_ENCODING" -t UTF-16LE UTF-16-UNKNOWN-ENDIANNESS-FILE > UTF-16-FILE

This may not be an elegant solution but I found a manual way to ensure correct conversion for my problem which I believe is similar to the subject of this thread.
The Problem:
I got a text datafile from a user and I was going to process it on Linux (specifically, Ubuntu) using shell script (tokenization, splitting, etc.). Let's call the file myfile.txt. The first indication that I got that something was amiss was that the tokenization was not working. So I was not surprised when I ran the file command on myfile.txt and got the following
$ file myfile.txt
myfile.txt: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators
If the file was compliant, here is what should have been the conversation:
$ file myfile.txt
myfile.txt: ASCII text, with very long lines
The Solution:
To make the datafile compliant, below are the 3 manual steps that I found to work after some trial and error with other steps.
First convert to Big Endian at the same encoding via vi (or vim). vi myfile.txt. In vi do :set fileencoding=UTF-16BE then write out the file. You may have to force it with :!wq.
vi myfile.txt (which should now be in utf-16BE). In vi do :set fileencoding=ASCII then write out the file. Again, you may have to force the write with !wq.
Run dos2unix converter: d2u myfile.txt. If you now run file myfile.txt you should now see an output or something more familiar and assuring like:
myfile.txt: ASCII text, with very long lines
That's it. That's what worked for me, and I was then able to run my processing bash shell script of myfile.txt. I found that I cannot skip Step 2. That is, in this case I cannot skip directly to Step 3. Hopefully you can find this info useful; hopefully someone can automate it perhaps via sed or the like. Cheers.

Related

Inserting ',' into certain position of a text containing full-width characters

Inserting a "," in a particular position of a text
From question above, I have gotten errors because a text contained some full-width characters.
I deal with some Japanese text data on RHEL server. Question above was a perfect solution for utf-8 text but the UNIX command wont work for Japanese text in SJIS format.
The difference between these two is that utf-8 counts every character as 1 byte and SJIS counts alphabets and numbers as 1 byte and other Japanese characters, such as あ, as 2 bytes. So the sed command only works for utf-8 when inserting ',' in some positions.
My input would be like
aaaああ123あ
And I would like to insert ',' after 3 bytes, 4 bytes and 3 bytes so my desired outcome is
aaa,ああ,123,あ
It is not necessarily sed command if it works on UNIX system.
Is there any way to insert ',' after some bytes of data while counting full-width character as 2 bytes and others as 1 bytes.

あ is 3 bytes in UTF-8
Depending on the locale GNU sed supports unicode. So reset the locale before running sed commands, and it will work on bytes.
And I would like to insert ',' after 3 bytes, 4 bytes and 3 bytes
Just use a backreference to remember the bytes.
LC_ALL=C sed 's/^\(...\)\(....\)\(...\)/\1,\2,\3,/'
or you could specify numbers:
LC_ALL=C sed 's/^\(.\{3\}\)\(.\{4\}\)\(.\{3\}\)/\1,\2,\3,/'
And cleaner with extended regex extension:
LC_ALL=C sed -E 's/^(.{3})(.{4})(.{3})/\1,\2,\3,/'
The following seems to work in my terminal:
$ <<<'aaaああ123あ' iconv -f UTF-8 -t SHIFT-JIS | LC_ALL=C sed 's/^\(.\{3\}\)\(.\{4\}\)\(.\{3\}\)/\1,\2,\3,/' | iconv -f SHIFT-JIS -t UTF-8
aaa,ああ,123,あ

Linux: iconv ASCII text to UTF-8

I have a file which contains the letter ö. Except that it doesn't. When I open the file in gedit, I see:
\u00f6
I tried to convert the file, applying code that I found on other threads:
$ file blb.txt
blb.txt: ASCII text
$ iconv -f ISO-8859-15 -t UTF-8 blb.txt > blb_tmp.txt
$ file blb_tmp.txt
blb_tmp.txt: ASCII text
What am I missing?
EDIT
I found this solution:
echo -e "$(cat blb.txt)" > blb_tmp.txt
$ file blb_tmp.txt
blb_tmp.txt: UTF-8 Unicode text
The -e "enables interpretation of backslash escapes".
Still not sure why iconv didn't make it happen. I'm guessing it's something like "iconv only changes the encoding, it doesn't interpret". Not sure yet, what the difference is though. Why did the Unicode people make this world such a mess? :D

Special Character: "^#" before EOF

I piped a program's output on the command line into a file and opened it in vim. At the very end of the file is the character: "^#", what does this mean?

CRTL-# (shown by Vim as ^#) is a NUL character, code point zero in the ASCII table.
You can enter it into Vim while in insert mode with CTRL-vCTRL-#, or by using a tool capable of producing a NUL output:
$ printf "\0" >tempfile
and then check it with any hex dump program:
$ od -xcb tempfile
0000000 0000
\0
000
0000001
So, obviously, your program is outputting NUL at the end for some reason.

what are the numbers shown by od -c -N 16 <filename.png>

I'm using linux and when I typed
od -c -N 16 <filename.png>
I got 0000000 211 P N G \r \n 032 \n \0 \0 \0 \r I H D R 0000020.
I thought this command tells me the type of the file but I'm curious about what the number 0000000 and 211 means. Can anybody please help?

od means "octal dump" (analogous to the hexdumper hd). It dumps bytes of a file in octal notation.
211 octal is 2 * 82 + 1 * 81 + 1 = 137, so you have a byte of value 137 there.
The 0000000 at the beginning of the line and the 0000020 at the beginning of the next are positions in the file, also in octal. If you remove -N 16 from the call, you'll see a column of monotonously ascending octal numbers on the left side of the octal dump; their purpose is to make it instantly visible which part of a dump you're currently reading.
The parameter
-N 16
means to read only the first 16 bytes of filename.png, and
-c
is a format option that tells od
to print printable characters as characters themselves rather than the octal code, and
to print unprintable characters that have a backslash escape sequence (such as \r or \n) as that escape sequence rather than an octal number.
It is the reason that not all bytes are dumped in octal.
If you want to know the file type of a file, use the file utility:
file filename.png
Side note: You may be interested in the man command, which shows the manual page of (among other things) command line tools. In this particular case,
man od
could have been enlightening.

How to create a hex dump of file containing only the hex characters without spaces in bash?

How do I create an unmodified hex dump of a binary file in Linux using bash? The od and hexdump commands both insert spaces in the dump and this is not ideal.
Is there a way to simply write a long string with all the hex characters, minus spaces or newlines in the output?

xxd -p file
Or if you want it all on a single line:
xxd -p file | tr -d '\n'

Format strings can make hexdump behave exactly as you want it to (no whitespace at all, byte by byte):
hexdump -ve '1/1 "%.2x"'
1/1 means "each format is applied once and takes one byte", and "%.2x" is the actual format string, like in printf. In this case: 2-character hexadecimal number, leading zeros if shorter.

It seems to depend on the details of the version of od. On OSX, use this:
od -t x1 -An file |tr -d '\n '
(That's print as type hex bytes, with no address. And whitespace deleted afterwards, of course.)

Perl one-liner:
perl -e 'local $/; print unpack "H*", <>' file

The other answers are preferable, but for a pure Bash solution, I've modified the script in my answer here to be able to output a continuous stream of hex characters representing the contents of a file. (Its normal mode is to emulate hexdump -C.)

I think this is the most widely supported version (requiring only POSIX defined tr and od behavior):
cat "$file" | od -v -t x1 -A n | tr -d ' \n'
This uses od to print each byte as hex without address without skipping repeated bytes and tr to delete all spaces and linefeeds in the output. Note that not even the trailing linefeed is emitted here. (The cat is intentional to allow multicore processing where cat can wait for filesystem while od is still processing previously read part. Single core users may want replace that with < "$file" od ... to save starting one additional process.)

tldr;
$ od -t x1 -A n -v <empty.zip | tr -dc '[:xdigit:]' && echo
504b0506000000000000000000000000000000000000
$
Explanation:
Use the od tool to print single hexadecimal bytes (-t x1) --- without address offsets (-A n) and without eliding repeated "groups" (-v) --- from empty.zip, which has been redirected to standard input. Pipe that to tr which deletes (-d) the complement (-c) of the hexadecimal character set ('[:xdigit:]'). You can optionally print a trailing newline (echo) as I've done here to separate the output from the next shell prompt.
References:
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/od.html
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/tr.html

This code produces a "pure" hex dump string and it runs faster than the all the
other examples given.
It has been tested on 1GB files filled with binary zeros, and all linefeeds.
It is not data content dependent and reads 1MB records instead of lines.
perl -pe 'BEGIN{$/=\1e6} $_=unpack "H*"'
Dozens of timing tests show that for 1GB files, these other methods below are slower.
All tests were run writing output to a file which was then verified by checksum.
Three 1GB input files were tested: all bytes, all binary zeros, and all LFs.
hexdump -ve '1/1 "%.2x"' # ~10x slower
od -v -t x1 -An | tr -d "\n " # ~15x slower
xxd -p | tr -d \\n # ~3x slower
perl -e 'local \$/; print unpack "H*", <>' # ~1.5x slower
- this also slurps the whole file into memory
To reverse the process:
perl -pe 'BEGIN{$/=\1e6} $_=pack "H*",$_'

You can use Python for this purpose:
python -c "print(open('file.bin','rb').read().hex())"
...where file.bin is your filename.
Explaination:
Open file.bin in rb (read binary) mode.
Read contents (returned as bytes object).
Use bytes method .hex(), which returns hex dump without spaces or new lines.
Print output.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string