Confirming the encoding of a file - linux

I am outputting a file from SSIS in UTF-8 Encoding.
This file is passed to a third party for import into their system.
They are having a problem importing this file. Although they requested UTF-8 encoding, it seems they convert the encoding to ISO-8859-1. They use this command to convert the files encoding:
iconv -f UTF-8 -t ISO-8859-1 dweyr.inp
They are receiving this error
illegal input sequence at position 11
The piece of text causing the issue is:
ark O’Dwy
I think its the apostrophe, or whatever version of an apostrophe is used in this text.
The problem i face is that every text editor i try tells me the file is UTF-8 and renders it correctly.
The vendor is saying that this char is not UTF-8.
How can i confirm whom is correct?

The error message by iconv is a bit misleading, but kind-of correct.
It doesn't tell you that the input isn't valid UTF-8, but that it cannot be converted to ISO-8859-1 in a lossless way. ISO-8859-1 does not have a way to encode the ’ character.
Verify that by executing this command:
echo "ark O’Dwy" | iconv -f UTF-8 -t UTF-7
This produces the output that looks like "ark O+IBk-Dwy".
Here I'm outputting to UTF-7 (a very rarely used encoding that is useful for demonstration here, but little else).
In other words: the encoding is only "illegal" in the sense that it cannot be converted to ISO-8859-1, but it's a perfectly valid UTF-8 sequence.
If the third party claims to support UTF-8, then they may do so only very superficially. They might support any text that can be encoded in ISO-8859-1 as long as it's encoded in UTF-8 (which is an extremely low level of "UTF-8 support").

Related

Detect encoding in Python3 when BOM is present

For starters, I know that you can't always detect encoding, that's not the object of the question. Can Python detect the encoding when opening a file, when there's a BOM giving the encoding ?
I have a collection of files. Some are in UTF-16 (so they have a BOM saying so) and some are in latin 1. Let's ignore the edge case of latin-1 files that for some reason would start with the exact same characters as the UTF-16 BOM. I want, when opening the file, to look for a BOM. If there is one, automatically open the file using the encoding associated with the BOM. If there isn't a BOM, open with latin 1.
Here is my workaround :
with open(filename,mode="rb") as f:
text=f.readlines()
text=b''.join(text)
if text[:2]==(b'\xff\xfe'):
text=text.decode("utf-16")
else:
text=text.decode("iso-8859-1")
Is there some kind of module that replaces "open" to do that, with all encodings that have a BOM ?

A Utf8 encoded file produces UnicodeDecodeError during parsing

I'm trying to reformat a text file so I can upload it to a pipeline (QIIME2) - I tested the first few lines of my .txt file (but it is tab separated), and the conversion was successful. However, when I try to run the script on the whole file, I encounter an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 16: invalid start byte
I have identified that the file encoding is Utf8, so I am not sure where the problem is arising from.
$ file filename.txt
filename: UTF-8 Unicode text, with very long lines, with CRLF line terminator
I have also reviewed some of the lines that are associated with the error, and I am not able to visually identify any unorthodox characters.
I have tried to force encode it using:
$iconv -f UTF8 -t UTF8 filename.txt > new_file.txt
However, the error produced is:
iconv: illegal input sequence at position 152683
How I'm understanding this is that whatever character occurs at the position is not readable/translatable using utf-8 encoding, but I am not sure then why the file is said to be encoded in utf-8.
I am running this on Linux, and the data itself are sequence information from the BOLD database (if anyone else has run into similar problems when trying to convert this into a format appropriate for QIIME2).
file is wrong. The file command doesn't read the entire file. It bases its guess on some sample of the file. I don't have a source ref for this, but file is so fast on huge files that there's no other explanation.
I'm guessing that your file is actually UTF-8 in the beginning, because UTF-8 has characteristic byte sequences. It's quite unlikely that a piece of text only looks like UTF-8 but isn't actually.
But the part of the text containing the byte 0x96 cannot be UTF-8. It's likely that some text was encoded with an 8-bit encoding like CP1252, and then concatenated to the UTF-8 text. This is something that shouldn't happen, because now you have multiple encodings in a single file. Such a file is broken with respect to text encoding.
This is all just guessing, but in my experience, this is the most likely explanation for the scenario you described.
For text with broken encoding, you can use the third-party Python library ftfy: fixes text for you.
It will cut your text at every newline character and try to find (guess) the right encoding for each portion.
It doesn't magically do the right thing always, but it's pretty good.
To give you more detailed guidance, you'd have to show the code of the script you're calling (if it's your code and you want to fix it).

How do I get from Éphémère to Éphémère in Python3?

I've tried all kinds of combinations of encode/decode with options 'surrogatepass' and 'surrogateescape' to no avail. I'm not sure what format this is in (it might even be a bug in Autoit), but I know for a fact the information is in there because at least one online utf decoder got it right. On the online converter website, I specified the file as utf8 and the output as utf16, and the output was as expected.
This issue is called mojibake, and your specific case occurs if you have a text stream that was encoded with UTF-8, and you decode it with Windows-1252 (which is a superset of ISO 8859-1).
So, as you have already found out, you have to decode this file with UTF-8, rather than with the default encoding of Python (which appears to be Windows-1252 in your case).
Let's see why these specific garbled characters appear in your example, namely:
É at the place of É
é at the place of é
è at the place of è
The following table summarises what's going on:
All of É, é, and è are non-ASCII characters, and they are encoded with UTF-8 to 2-byte long codes.
For example, the UTF-8 code for É is:
11000011 10001001
On the other hand, Windows-1252 is an 8-bit encoding, that is, it encodes every character of its character set to 8 bits, i.e. one byte.
So, if you now decode the bit sequence 11000011 10001001 with Windows-1252, then Windows-1252 interprets this as two 1-byte codes, each representing a separate character, rather than a 2-byte code representing a single character:
The first byte 11000011 (C3 in hexadecimal) happens to be the Windows-1252 code of the character à (Unicode code point U+00C3).
The second byte 10001001 (89 in hexadecimal) happens to be the Windows-1252 code of the character ‰ (Unicode code point U+2030).
You can look up these mappings here.
So, that's why your decoding renders É instead of É. Idem for the other non-ASCII characters é and è.
My issue was during the file reading. I solved it by specifying encoding='utf-8' in the options for open().
open(filePath, 'r', encoding='utf-8')

How to substitute cp1250 specific characters to utf-8 in Vim

I have some central european characters in cp1250 encoding in Vim. When I change encoding with set encoding=utf-8 they appear like <d0> and such. How can I substitute over the entire file those characters for what they should be, i.e. Đ, in this case?
As sidyll said, you should really use iconv for the purpose. Iconv knows stuff. It knows all the hairy encodings, onscure code-points, katakana, denormalized, canonical forms, compositions, nonspacing characters and the rest.
:%!iconv --from-code cp1250 --to-code utf-8
or shorter
:%!iconv -f cp1250 -t utf-8
to filter the whole buffer. If you do
:he xxd
You'll get a sample of how to automatically encode on buffer load/save if you wanted.
iconv -l will list you all (many: 1168 on my system) encodings it accepts/knows about.
Happy hacking!
The iconv() function may be useful:
iconv({expr}, {from}, {to}) *iconv()*
The result is a String, which is the text {expr} converted
from encoding {from} to encoding {to}.
When the conversion fails an empty string is returned.
The encoding names are whatever the iconv() library function
can accept, see ":!man 3 iconv".
Most conversions require Vim to be compiled with the |+iconv|
feature. Otherwise only UTF-8 to latin1 conversion and back
can be done.
This can be used to display messages with special characters,
no matter what 'encoding' is set to. Write the message in
UTF-8 and use:
echo iconv(utf8_str, "utf-8", &enc)
Note that Vim uses UTF-8 for all Unicode encodings, conversion
from/to UCS-2 is automatically changed to use UTF-8. You
cannot use UCS-2 in a string anyway, because of the NUL bytes.
{only available when compiled with the +multi_byte feature}
You can set encoding to the value of your file's encoding and termencoding to UTF-8. See The vim mbyte documentation.

unzip utf8 to ascii

i took some files from linux hosting to my windows via ftp
and when i check file encodings utf8 without bom
now i need to convert those files back to ascii and send my other linux server
i zipped files
can i do something like
unzip if its text file and ut8 format than convert it to ascii
when i am unzipping files , i want to make conversion
thanks ?
The program you're looking for is iconv; it will convert between encodings. Use it like this:
iconv -f utf-8 -t ascii < infile > outfile
However. ASCII is a subset of UTF-8. That is, a file that's written in ASCII is also correct UTF-8 --- no conversion is needed. The only reason for needing to convert the other way is if there are characters in your UTF-8 file that are outside the ASCII range. And if this is the case, you can't convert it to ASCII, because ASCII doesn't have those characters!
Are you sure you mean ASCII? Pure ASCII is rare these days. ISO-8859-15 (Western European) or CP1252 (Windows) are much more common.

Resources