A Utf8 encoded file produces UnicodeDecodeError during parsing

A Utf8 encoded file produces UnicodeDecodeError during parsing - linux

I'm trying to reformat a text file so I can upload it to a pipeline (QIIME2) - I tested the first few lines of my .txt file (but it is tab separated), and the conversion was successful. However, when I try to run the script on the whole file, I encounter an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 16: invalid start byte
I have identified that the file encoding is Utf8, so I am not sure where the problem is arising from.
$ file filename.txt
filename: UTF-8 Unicode text, with very long lines, with CRLF line terminator
I have also reviewed some of the lines that are associated with the error, and I am not able to visually identify any unorthodox characters.
I have tried to force encode it using:
$iconv -f UTF8 -t UTF8 filename.txt > new_file.txt
However, the error produced is:
iconv: illegal input sequence at position 152683
How I'm understanding this is that whatever character occurs at the position is not readable/translatable using utf-8 encoding, but I am not sure then why the file is said to be encoded in utf-8.
I am running this on Linux, and the data itself are sequence information from the BOLD database (if anyone else has run into similar problems when trying to convert this into a format appropriate for QIIME2).

file is wrong. The file command doesn't read the entire file. It bases its guess on some sample of the file. I don't have a source ref for this, but file is so fast on huge files that there's no other explanation.
I'm guessing that your file is actually UTF-8 in the beginning, because UTF-8 has characteristic byte sequences. It's quite unlikely that a piece of text only looks like UTF-8 but isn't actually.
But the part of the text containing the byte 0x96 cannot be UTF-8. It's likely that some text was encoded with an 8-bit encoding like CP1252, and then concatenated to the UTF-8 text. This is something that shouldn't happen, because now you have multiple encodings in a single file. Such a file is broken with respect to text encoding.
This is all just guessing, but in my experience, this is the most likely explanation for the scenario you described.
For text with broken encoding, you can use the third-party Python library ftfy: fixes text for you.
It will cut your text at every newline character and try to find (guess) the right encoding for each portion.
It doesn't magically do the right thing always, but it's pretty good.
To give you more detailed guidance, you'd have to show the code of the script you're calling (if it's your code and you want to fix it).

Related

Fix wrong character encoding

I have a large excel file which I read with pandas.read_excel(). I guess it was compiled from various sources and contains some badly encoded characters.
For instance, a string foo which should be für is printed as fãâ¼r and has foo.__repr__()=='fã\x83â¼r'.
Some other string bar which should be française is printed as franãâ§aise with bar.__repr__()=='franã\x83â§aise'.
And another one baz which should also be française is printed as franãâaise with baz.__repr__()=='franã\x83â\x87aise'.
Same goes for ñ with a __repr__ of ã\x83â\x91, and so on.
What would be the best way to sanitize this input? Or is there a way to avoid the problem altogether?
Also, is there a way to simply search for such characters in my data? A .contains('\x') fails as there is nothing after the unicode escape (SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \xXX escape)

Detect encoding in Python3 when BOM is present

For starters, I know that you can't always detect encoding, that's not the object of the question. Can Python detect the encoding when opening a file, when there's a BOM giving the encoding ?
I have a collection of files. Some are in UTF-16 (so they have a BOM saying so) and some are in latin 1. Let's ignore the edge case of latin-1 files that for some reason would start with the exact same characters as the UTF-16 BOM. I want, when opening the file, to look for a BOM. If there is one, automatically open the file using the encoding associated with the BOM. If there isn't a BOM, open with latin 1.
Here is my workaround :
with open(filename,mode="rb") as f:
text=f.readlines()
text=b''.join(text)
if text[:2]==(b'\xff\xfe'):
text=text.decode("utf-16")
else:
text=text.decode("iso-8859-1")
Is there some kind of module that replaces "open" to do that, with all encodings that have a BOM ?

Confirming the encoding of a file

I am outputting a file from SSIS in UTF-8 Encoding.
This file is passed to a third party for import into their system.
They are having a problem importing this file. Although they requested UTF-8 encoding, it seems they convert the encoding to ISO-8859-1. They use this command to convert the files encoding:
iconv -f UTF-8 -t ISO-8859-1 dweyr.inp
They are receiving this error
illegal input sequence at position 11
The piece of text causing the issue is:
ark O’Dwy
I think its the apostrophe, or whatever version of an apostrophe is used in this text.
The problem i face is that every text editor i try tells me the file is UTF-8 and renders it correctly.
The vendor is saying that this char is not UTF-8.
How can i confirm whom is correct?

The error message by iconv is a bit misleading, but kind-of correct.
It doesn't tell you that the input isn't valid UTF-8, but that it cannot be converted to ISO-8859-1 in a lossless way. ISO-8859-1 does not have a way to encode the ’ character.
Verify that by executing this command:
echo "ark O’Dwy" | iconv -f UTF-8 -t UTF-7
This produces the output that looks like "ark O+IBk-Dwy".
Here I'm outputting to UTF-7 (a very rarely used encoding that is useful for demonstration here, but little else).
In other words: the encoding is only "illegal" in the sense that it cannot be converted to ISO-8859-1, but it's a perfectly valid UTF-8 sequence.
If the third party claims to support UTF-8, then they may do so only very superficially. They might support any text that can be encoded in ISO-8859-1 as long as it's encoded in UTF-8 (which is an extremely low level of "UTF-8 support").

Convert Unicode Escape to Hebrew text

I have the following text in a json file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
which represents the text "אחוזת פולג" in Hebrew.
no matter which encoding/decoding i use i don't seem to get it right with
Python 3.
if for example ill try:
text = "\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092".encode('unicode-escape')
print(text)
i get that text is:
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
which in bytecode is almost the correct text, if i was able to remove only one backslash and turn
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
into
text = b'\xd7\x90\xd7\x97\xd7\x95\xd7\x96\xd7\xaa \xd7\xa4\xd7\x95\xd7\x9c\xd7\x92'
(note how i changed double slash to single slash) then
text.decode('utf-8')
would yield the correct text in Hebrew.
but i am struggling to do so and couldn't manage to create a piece of code which will do that for me (and not manually as i just showed...)
any help much appreciated...

This string does not "represent" Hebrew text (at least not as unicode code points, UTF-16, UTF-8, or in any well-known way at all). Instead, it represents a sequence of UTF-16 code units, and this sequence consists mostly of multiplication signs, currency signs, and some weird control characters.
It looks like the original character data has been encoded and decoded several times with some strange combination of encodings.
Assuming that this is what literally is saved in your JSON file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa \u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
you can recover the Hebrew text as follows:
(jsonInput
.encode('latin-1')
.decode('raw_unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
For the above example, it gives:
'אחוזת פולג'
If you are using a JSON deserializer to read in the data, then you should of course omit the .encode('latin-1').decode('raw_unicode_escape') steps, because the JSON deserializer would already interpret the escape sequences for you. That is, after the text element is loaded by JSON deserializer, it should be sufficient to just encode it as latin-1 and then decode it as utf-8. This works because latin-1 (ISO-8859-1) is an 8-bit character encoding that corresponds exactly to the first 256 code points of unicode, whereas your strangely broken text encodes each byte of UTF-8 encoding as an ASCII-escape of an UTF-16 code unit.
I'm not sure what you can do if your JSON contains both the broken escape sequences and valid text at the same time, it might be that the latin-1 doesn't work properly any more. Please don't apply this transformation to your JSON file unless the JSON itself contains only ASCII, it would only make everything worse.

Removing lines containing encoding errors in a text file

I must warn you I'm a beginner. I have a text file in which some lines contain encoding errors. By "error", this is what I get when parsing the file in my linux console (question marks instead of characters):
I want to remove every line showing those "question marks". I tried to grep -v the problematic character, but it doesn't work. The file itself is UTF8 and I guess some of the lines come from texts encoded in another format. I know I could find a way to reconvert them properly, but I just want them gone for now.
Do you have any ideas about how I could do this please?
PS: Some lines contain diacritics which are displayed fine. The "strings" command seems to remove too many "good" lines.

When dealing with mojibake on character encodings other than ANSI you must check 2 things:
Is the file really encoded in X? (X being UTF-8 WITHOUT BOM in your case. You could be trying to read UTF-8 WITH BOM, UTF-16, latin-1, etc. as UTF-8, and that would be the problem). Try reading in (not converting to) other encodings and see if any of them fits.
Is your locale or text editor set to read the file as UTF-8? If not, that may be the problem. Check for support and figure out how to change the setting. In linux try locale and setlocale commands to check and set it properly.
I like how notepad++ for windows (which also runs perfectly in linux using wine) lets you set any encoding you want to read the file without trying to convert it (of course if you set any other than the one the file is encoded in you will only see those weird characters), and also has a different option which allows you to convert it from one encoding to another. That has been pretty useful to me.
If you are a beginner you may be interested in this article. It explains briefly and clearly the whats, whys and hows of character encoding.
[EDIT] If the above fails, even windows-1252 and such ANSI encodings, I've just learned here how to remove non-ascii characters using tr unix command, turning it into ASCII (but be aware information on extra characters is lost in this output and there is no coming back, so keep the input file just in case you find a better fix):
tr -cd '\11\12\40-\176' < $INPUT_FILE > $OUTPUT_FILE
or, if you want to get rid of the whole line:
grep -v -P "[^\11\12\40-\176]" $INPUT_FILE > $OUTPUT_FILE
[EDIT 2] This answer here gives a pretty good guess of what could be happening if none of the encodings work on your file (Unfortunately the only straight forward solution seems to be removing those problematic characters).

You can use a micro-Perl script like:
perl -pe 's/[^[:ascii:]]+//g;' my_utf8_file.txt

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string