I have a large excel file which I read with pandas.read_excel(). I guess it was compiled from various sources and contains some badly encoded characters.
For instance, a string foo which should be für is printed as fãâ¼r and has foo.__repr__()=='fã\x83â¼r'.
Some other string bar which should be française is printed as franãâ§aise with bar.__repr__()=='franã\x83â§aise'.
And another one baz which should also be française is printed as franãâaise with baz.__repr__()=='franã\x83â\x87aise'.
Same goes for ñ with a __repr__ of ã\x83â\x91, and so on.
What would be the best way to sanitize this input? Or is there a way to avoid the problem altogether?
Also, is there a way to simply search for such characters in my data? A .contains('\x') fails as there is nothing after the unicode escape (SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \xXX escape)
Related
I'm trying to reformat a text file so I can upload it to a pipeline (QIIME2) - I tested the first few lines of my .txt file (but it is tab separated), and the conversion was successful. However, when I try to run the script on the whole file, I encounter an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 16: invalid start byte
I have identified that the file encoding is Utf8, so I am not sure where the problem is arising from.
$ file filename.txt
filename: UTF-8 Unicode text, with very long lines, with CRLF line terminator
I have also reviewed some of the lines that are associated with the error, and I am not able to visually identify any unorthodox characters.
I have tried to force encode it using:
$iconv -f UTF8 -t UTF8 filename.txt > new_file.txt
However, the error produced is:
iconv: illegal input sequence at position 152683
How I'm understanding this is that whatever character occurs at the position is not readable/translatable using utf-8 encoding, but I am not sure then why the file is said to be encoded in utf-8.
I am running this on Linux, and the data itself are sequence information from the BOLD database (if anyone else has run into similar problems when trying to convert this into a format appropriate for QIIME2).
file is wrong. The file command doesn't read the entire file. It bases its guess on some sample of the file. I don't have a source ref for this, but file is so fast on huge files that there's no other explanation.
I'm guessing that your file is actually UTF-8 in the beginning, because UTF-8 has characteristic byte sequences. It's quite unlikely that a piece of text only looks like UTF-8 but isn't actually.
But the part of the text containing the byte 0x96 cannot be UTF-8. It's likely that some text was encoded with an 8-bit encoding like CP1252, and then concatenated to the UTF-8 text. This is something that shouldn't happen, because now you have multiple encodings in a single file. Such a file is broken with respect to text encoding.
This is all just guessing, but in my experience, this is the most likely explanation for the scenario you described.
For text with broken encoding, you can use the third-party Python library ftfy: fixes text for you.
It will cut your text at every newline character and try to find (guess) the right encoding for each portion.
It doesn't magically do the right thing always, but it's pretty good.
To give you more detailed guidance, you'd have to show the code of the script you're calling (if it's your code and you want to fix it).
I have the following text in a json file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
which represents the text "אחוזת פולג" in Hebrew.
no matter which encoding/decoding i use i don't seem to get it right with
Python 3.
if for example ill try:
text = "\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092".encode('unicode-escape')
print(text)
i get that text is:
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
which in bytecode is almost the correct text, if i was able to remove only one backslash and turn
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
into
text = b'\xd7\x90\xd7\x97\xd7\x95\xd7\x96\xd7\xaa \xd7\xa4\xd7\x95\xd7\x9c\xd7\x92'
(note how i changed double slash to single slash) then
text.decode('utf-8')
would yield the correct text in Hebrew.
but i am struggling to do so and couldn't manage to create a piece of code which will do that for me (and not manually as i just showed...)
any help much appreciated...
This string does not "represent" Hebrew text (at least not as unicode code points, UTF-16, UTF-8, or in any well-known way at all). Instead, it represents a sequence of UTF-16 code units, and this sequence consists mostly of multiplication signs, currency signs, and some weird control characters.
It looks like the original character data has been encoded and decoded several times with some strange combination of encodings.
Assuming that this is what literally is saved in your JSON file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa \u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
you can recover the Hebrew text as follows:
(jsonInput
.encode('latin-1')
.decode('raw_unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
For the above example, it gives:
'אחוזת פולג'
If you are using a JSON deserializer to read in the data, then you should of course omit the .encode('latin-1').decode('raw_unicode_escape') steps, because the JSON deserializer would already interpret the escape sequences for you. That is, after the text element is loaded by JSON deserializer, it should be sufficient to just encode it as latin-1 and then decode it as utf-8. This works because latin-1 (ISO-8859-1) is an 8-bit character encoding that corresponds exactly to the first 256 code points of unicode, whereas your strangely broken text encodes each byte of UTF-8 encoding as an ASCII-escape of an UTF-16 code unit.
I'm not sure what you can do if your JSON contains both the broken escape sequences and valid text at the same time, it might be that the latin-1 doesn't work properly any more. Please don't apply this transformation to your JSON file unless the JSON itself contains only ASCII, it would only make everything worse.
I've tried all kinds of combinations of encode/decode with options 'surrogatepass' and 'surrogateescape' to no avail. I'm not sure what format this is in (it might even be a bug in Autoit), but I know for a fact the information is in there because at least one online utf decoder got it right. On the online converter website, I specified the file as utf8 and the output as utf16, and the output was as expected.
This issue is called mojibake, and your specific case occurs if you have a text stream that was encoded with UTF-8, and you decode it with Windows-1252 (which is a superset of ISO 8859-1).
So, as you have already found out, you have to decode this file with UTF-8, rather than with the default encoding of Python (which appears to be Windows-1252 in your case).
Let's see why these specific garbled characters appear in your example, namely:
É at the place of É
é at the place of é
è at the place of è
The following table summarises what's going on:
All of É, é, and è are non-ASCII characters, and they are encoded with UTF-8 to 2-byte long codes.
For example, the UTF-8 code for É is:
11000011 10001001
On the other hand, Windows-1252 is an 8-bit encoding, that is, it encodes every character of its character set to 8 bits, i.e. one byte.
So, if you now decode the bit sequence 11000011 10001001 with Windows-1252, then Windows-1252 interprets this as two 1-byte codes, each representing a separate character, rather than a 2-byte code representing a single character:
The first byte 11000011 (C3 in hexadecimal) happens to be the Windows-1252 code of the character à (Unicode code point U+00C3).
The second byte 10001001 (89 in hexadecimal) happens to be the Windows-1252 code of the character ‰ (Unicode code point U+2030).
You can look up these mappings here.
So, that's why your decoding renders É instead of É. Idem for the other non-ASCII characters é and è.
My issue was during the file reading. I solved it by specifying encoding='utf-8' in the options for open().
open(filePath, 'r', encoding='utf-8')
I need to get the first char of a text variable. I achieve this with one of the following simple methods:
string.sub(someText,1,1)
or
someText:sub(1,1)
If I do the following, I expect to get 'ñ' as the first letter. However, the result of either of the sub methods is 'Ã'
local someText = 'ñññññññ'
print('Test whole: '..someText)
print('first char: '..someText:sub(1,1))
print('first char with .sub: '..string.sub(someText,1,1))
Here are the results from the console:
2014-03-02 09:08:47.959 Corona Simulator[1701:507] Test whole: ñññññññ
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char: Ã
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char with .sub: Ã
It seems like the string.sub() function is encoding the returned value in UTF-8. Just for kicks I tried using the utf8_decode() function that's provided by Corona SDK. It was not successful. The simulator indicated that the function expected a number but got nil instead.
I also searched the web to see if anyone else had ran into this issue. I found out that there is a fair amount of discussion on Lua, Corona, Unicode and UTF-8 but I did not come across anything that would address this specific problem.
Lua strings are 8-bit clean, which means strings in Lua are a stream of bytes. The UTF-8 character ñ has multiple bytes, but someText:sub(1,1) returns only the first single byte.
For UTF-8 encoding, all characters in the ASCII range have the same representation as in ASCII, that is, a single byte smaller than 128. For other CodePoints, a sequences of bytes where the first byte is in the range 194-244 and continuation bytes are in the range 128-191.
Because of this, you can use the pattern ".[\128-\191]*" to match a single UTF-8 CodePoint (not Grapheme):
for c in "ñññññññ":gmatch(".[\128-\191]*") do -- pretend the first string is in NFC
print(c)
end
Output:
ñ
ñ
ñ
ñ
ñ
ñ
ñ
Regarding the used character set:
Just know what requirements you bake into your own code, and make sure those are actually satisfied.
There are various typical requirements:
ASCII-compatible (aka any byte < 128 represents an ASCII character and all ASCII characters are represented as themselves)
Fixed-Size vs Variable-Width (maybe self-synchronizing) Character Set
No embedded 0-bytes
Write your code so you need as few of those requirements as cannot be avoided, and document them.
match a single UTF-8 character: Be sure what you mean by UTF-8 character. Is it Glyph or CodePoint? AFAIK you need full unicode-tables for glyph-matching. Do you actually have to get to this level at all?
I'm writing scripts to clean up unicode text files (stored as UTF-8), and I chose to use Python 3.x (3.2) rather than the more popular 2.x because 3.x is supposed to default to UTF-8. Maybe I'm doing something wrong, but it seems that the print statement, at least, still is not defaulting to UTF-8. If I try to print a string (msg below is a string) that contains special characters, I still get a UnicodeEncodeError like this:
print(label, msg)
... in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0968' in position
38: character maps to <undefined>
If I use the encode() method first (which does nicely default to UTF-8), I can avoid the error:
print(label, msg.encode())
This also works for printing objects or lists containing unicode strings--something I often have to do when debugging--since str() seems to default to UTF-8. But do I really need to remember to use print(str(myobj).encode()) every single time I want to do a print(myobj) ? If so, I suppose I could try to wrap it with my own function, but I'm not confident about handling all the argument permutations that print() supports.
Also, my script loads regular expressions from a file and applies them one by one. Before applying encode(), I was able to print something fairly legible to the console:
msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg)
Applying regex 5 of 15: ^\\ge[0-9]*\b([ ]+[0-9]+\.)?[ ]*
However, this crashes if the regex includes literal unicode characters, so I applied encode() to the string first. But now the regexes are very hard to read on-screen (and I suspect I may have similar trouble if I try to write code that saves these regexes back to disk):
msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg.encode())
b'Applying regex 5 of 15: ^\\\\ge[0-9]*\\b([ ]+[0-9]+\\.)?[ ]*'
I'm not very experienced yet in Python, so I may be misunderstanding. Any explanations or links to tutorials (for Python 3.x; most of what I see online is for 2.x) would be much appreciated.
print doesn't default to any encoding, it just uses whatever encoding the output device (like a console) claims to support. Your console encoding appears to be non-unicode, so print tries to encode your unicode strings in that encoding, and fails. The easiest way to get around this is to tell the console to use utf8 (like export LC_ALL=en_US.UTF-8 on unix systems).
The easier way to proceed is to only use unicode in your script, and only use encoded data when you want to interact with the "outside" world. That is, when you have input to decode or output to encode.
To do so, everytime you read something, use decode, everytime you output something, use encode.
For you regex, use the re.UNICODE flag.
I know that this doesn't exactly answer your question point by point, but I think that applying such a methodology should keep you safe from encoding issues.