iconv not complete convert to utf8 - linux

When I converted the my text on this site, be converted correctly:
http://string-functions.com/encodedecode.aspx
I choose source 'Windows-1252' and target 'utf-8'.
See it in the screenshot below:
https://i.stack.imgur.com/2Pn4E.png
But when I convert with the following code, Some letters are not converted and text disrupted.
iconv -c -f UTF-8 -t WINDOWS-1252 < mytext.txt > fixed_mytext.txt
A phrase that should be converted:
آموزش Ùˆ نرم اÙزارهای تعمیر مانیتور
If true convert should be this phrase:
آموزش و نرم افزارهای تعمیر مانیتور
plese help me. thank you
my orginal text:
http://www.todaymagazine.ir/forum.txt

The original text was in UTF-8. It got mistakenly interpreted as a text in Windows-1252 and converted from Windows-1252 to UTF-8. This should have never been done. To undo the damage we need to convert the file from UTF-8 to Windows-1252, and then just treat it as a UTF-8 file.
There's a problem however. The letter ف is encoded in UTF-8 as 0xd9 0x81, and the code 0x81 is not a part of Windows1252.
Luckily when the first erroneous conversion was made, the character was not lost or replaced with a question mark. It got converted to a control character 0xc2 0x81.
The 0xd9 code is in Windows1252, it's the letter Ù, which in UTF-8 is 0xc3 0x99. So the final byte sequence for ف in the converted file is 0xc3 0x99 0xc2 0x81.
We can just replace with something ASCII-friendly with a sed script, make an inverse conversion, and then replace it back with ف.
LANG=C sed $'s/\xc3\x99\xc2\x81/===FE===/g' forum.txt | \
iconv -f utf8 -t cp1252 | \
sed $'s/===FE===/\xd9\x81/g'
The result is the original file encoded in UTF-8.
(make sure that ===FE=== is not used in the text first!)

Related

iconv command is not changing the encoding of a plain text file to another encoding

In Linux I created a plain text file. using "file -i" I am seeing file encoding is "us-ascii" . After trying below commands it is still showing output file encoding as "us-ascii". Could you please tell me how to change encoding? or Is there any way to download some encoded file which I can't read.
iconv -f US-ASCII -t ISO88592//TRANSLIT -o o.txt ip.txt
iconv -f UTF-8 -t ISO-8859-1//TRANSLIT -o op.txt ip.txt
I am expecting either iconv change the encoding or I can download some encoded file.
If your file contains only ASCII character, then there's no difference between the ASCII, UTF-8 and different ISO8859-x encoding. So after conversion, you will end up with the exactly same file.
A text file does not store any information about what encoding was used. Therefore, the file applies a few rules but at the end of the day, it's just a guess. And as the files are identical, the result will alwazys be the same.
To see a difference, you will must use characters that are encoded differently with the different encoding or are not avaialbe at all in one of the encodings, e.g. ă, € or 😊.

Convert files between UTF-8 and ISO-8859 on Linux

Every time that I get confronted with Unicode, nothing works. I'm on Linux, and I got these files from Windows:
$file *
file1: UTF-8 Unicode text
file2: ISO-8859 text
file3: ISO-8859 text
Nothing was working until I found out that the files have different encodings. I want to make my life easy and have them all in the same format:
iconv -f UTF-8 -t ISO-8859 file1 > test
iconv: conversion to `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.
I tried to convert to ISO because that's only 1 conversion + when I open those ISO files in gedit, the German letter "ü" is displayed just fine. Okay, next try:
iconv -f ISO-8859 -t UTF-8 file2 > test
iconv: conversion from `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.
but obviously that didn't work.
ISO-8859-x (Latin-1) encoding only contains very limited characters, you should always try to encode to UTF-8 to make life easier.
And utf-8 (Unicode) is a superset of ISO 8859 so it will be not surprised you could not convert UTF-8 to ISO 8859
It seems command file just give a very limited info of the file encoding
You could try to guess the from encoding either ISO-8859-1 or ISO-8859-15 or the other from 2~14 as suggested in the comment by #hobbs
And you could get a supported encoding of iconv by iconv -l
If life treats you not easy with guessing the real file encoding, this silly script might help you out :D
As in other answers, you can list out the supported formats
iconv -l | grep 8859
A grep will save your time to find which version of your encoding is/are supported. You can provide the <number> as in my example or ISO or any expected string in your encoding.

How to convert a file from ASCII to UTF-8?

I'm trying to transcode a bunch a files from ASCII to UTF-8.
For that, I tried using iconv:
iconv -f US-ASCII -t UTF-8 infile > outfile
-f ENCODING the encoding of the input
-t ENCODING the encoding of the output
Still that file didn't convert to UTF-8. It is a .dat file.
Before posting this, I searched Google and found information like:
ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them.
Force encode from US-ASCII to UTF-8 (iconv)
Best way to convert text files between character sets?
Still the above links didn't help.
Even though it is in ASCII it will support UTF-8 as UTF-8 is a super set, the other party who is going to receive the files from me need file encoding as UTF-8. He just need file format as UTF-8.
Any suggestions please.
I'm a little confused by the question, because, as you indicated, ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded.
If you're sending files containing only ASCII characters to the other party, but the other party is complaining that they're not 'UTF-8 Encoded', then I would guess that they're referring to the fact that the ASCII file has no byte order mark explicitly indicating the contents are UTF-8.
If that is indeed the case, then you can add a byte order mark using the answer here:
iconv: Converting from Windows ANSI to UTF-8 with BOM
If the other party indicates that he does not need the 'BOM' (Byte Order Mark), but is still complaining that the files are not UTF-8, then another possibility is that your initial file is not actually ASCII, but rather contains characters that are encoded using ANSI or ISO-8859-1.
Edited to add the following experiment, after comment from Ram regarding the other party looking for the type using the 'file' command
Tims-MacBook-Pro:~ tjohns$ echo 'Stuff' > deleteme
Tims-MacBook-Pro:~ tjohns$ cat deleteme
Stuff
Tims-MacBook-Pro:~ tjohns$ file -I deleteme
deleteme: text/plain; charset=us-ascii
Tims-MacBook-Pro:~ tjohns$ echo -ne '\xEF\xBB\xBF' > deleteme
Tims-MacBook-Pro:~ tjohns$ echo 'Stuff' >> deleteme
Tims-MacBook-Pro:~ tjohns$ cat deleteme
Stuff
Tims-MacBook-Pro:~ tjohns$ file -I deleteme
deleteme: text/plain; charset=utf-8

How to avoid iconv error to halt my program

I have a utf-8 document to be convert to big5 encoding using iconv with the code below
iconv -f utf-8 -t big5 $inputFile -o $outputFile
However there are some utf-8 characters encoding is not complete because I set byte size limit in each line in the document like 40 bytes in a line so some utf-8 characters will be cut.
Since the incomplete encoding of utf-8 characters leads to the error that iconv cannot find the corresponding big5 encode for the incomplete utf-8 characters encoding and the iconv stops.
Is there any why to avoid the iconv to halt and skip the incomplete utf-8 characters encoding and continue convert the following document to big5 encoding document?
I'm not sure that is what you are looking for, but, to quote man iconv:
DESCRIPTION
The iconv program converts the encoding of characters in inputfile, or
from the standard input if no filename is specified, from one coded
character set to another.
OPTIONS
-c Omit invalid characters from output.
[...]
The man is not really clear, but when you use that option, characters in the source file invalid given the source encoding are discarded.

Character encoding problems

I attempted to convert a file I wrote in Vim to UTF-8. Vim defaulted the encoding to us-ascii. I ran this command: recode UTF-8 [filename]. It reported no errors, but when I run: file -i [filename] it still stays encoding is ASCII. Is this a known error or expected result? Thanks in advance :-)
I have to say that if your file is just ascii character, there is no difference in the final file between the ascii encoding and utf8 encoding, cause for ascii character, the utf8 encoding is exactly the same as ascii encoding.
But if your file contains some non-ascii character, you will see the difference.
Your "fileencodings" setting for vim may use "ascii" before "utf8", that's the list that vim try to detect the file encodings. So if the file can be read as "ascii", the later utf8 will not be tried anymore, although utf8 is also correct.

Resources