ASCII or UTF-8? - text

Long long time ago before world scripts birth, text files are all ASCII.
Nowadays, we have world scripts.
I would like to ask if I open up a text file in a hex editor, is there a way to tell its code page is in ASCII or UTF-8?

UTF-8 is backwards compatible with ASCII: an ASCII text file is also a UTF-8 text file.
If a file contains bytes starting with 8 through F it's not ASCII.
If a file is not ASCII, it may be UTF-8 if every byte that starts with C, D, E, or F is followed by one to three bytes that start with 8, 9, A, or B. If any of these bytes appears in any other context it's not UTF-8.
There are a few more requirements for valid UTF-8, but they are harder to glean with a hex editor. See https://en.m.wikipedia.org/wiki/UTF-8

Related

A Utf8 encoded file produces UnicodeDecodeError during parsing

I'm trying to reformat a text file so I can upload it to a pipeline (QIIME2) - I tested the first few lines of my .txt file (but it is tab separated), and the conversion was successful. However, when I try to run the script on the whole file, I encounter an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 16: invalid start byte
I have identified that the file encoding is Utf8, so I am not sure where the problem is arising from.
$ file filename.txt
filename: UTF-8 Unicode text, with very long lines, with CRLF line terminator
I have also reviewed some of the lines that are associated with the error, and I am not able to visually identify any unorthodox characters.
I have tried to force encode it using:
$iconv -f UTF8 -t UTF8 filename.txt > new_file.txt
However, the error produced is:
iconv: illegal input sequence at position 152683
How I'm understanding this is that whatever character occurs at the position is not readable/translatable using utf-8 encoding, but I am not sure then why the file is said to be encoded in utf-8.
I am running this on Linux, and the data itself are sequence information from the BOLD database (if anyone else has run into similar problems when trying to convert this into a format appropriate for QIIME2).
file is wrong. The file command doesn't read the entire file. It bases its guess on some sample of the file. I don't have a source ref for this, but file is so fast on huge files that there's no other explanation.
I'm guessing that your file is actually UTF-8 in the beginning, because UTF-8 has characteristic byte sequences. It's quite unlikely that a piece of text only looks like UTF-8 but isn't actually.
But the part of the text containing the byte 0x96 cannot be UTF-8. It's likely that some text was encoded with an 8-bit encoding like CP1252, and then concatenated to the UTF-8 text. This is something that shouldn't happen, because now you have multiple encodings in a single file. Such a file is broken with respect to text encoding.
This is all just guessing, but in my experience, this is the most likely explanation for the scenario you described.
For text with broken encoding, you can use the third-party Python library ftfy: fixes text for you.
It will cut your text at every newline character and try to find (guess) the right encoding for each portion.
It doesn't magically do the right thing always, but it's pretty good.
To give you more detailed guidance, you'd have to show the code of the script you're calling (if it's your code and you want to fix it).

Does reading a binary file linewise in python cause problems for unicode data?

I'm reading a large (10Gb) bzipped file in python3, which is utf-8-encoded JSON. I only want a few of the lines though, that start with a certain set of bytes, so to save having to decode all the lines into unicode, I'm reading the file in 'rb' mode, like this:
with bz2.open(filename, 'rb') as file:
for line in file:
if line.startswith(b'Hello'):
#decode line here, then do stuff
But I suddenly thought, what if one of the unicode characters contains the same byte as a newline character? By doing for line in file will I risk getting truncated lines? Or does the linewise iterator over a binary file still work by magic?
Line-wise iteration will work for UTF-8 encoded data.
Not by magic, but by design:
UTF-8 was created to be backwards-compatible to ASCII.
ASCII only uses the byte values 0 through 127, leaving the upper half of possible values for extensions of any kind.
UTF-8 takes advantage of this, in that any Unicode codepoint outside ASCII is encoded using bytes in the range 128..255.
For example, the letter "Ċ" (capital letter C with dot above) has the Unicode codepoint value U+010A.
In UTF-8, this is encoded with the byte sequence C4 8A, thus without using the byte 0A, which is the ASCII newline.
In contrast, UTF-16 encodes the same character as 0A 01 or 01 0A (depending on the Endianness).
So I guess UTF-16 is not safe to do line-wise iteration over.
It's not that common as file encoding though.

How do I get from Éphémère to Éphémère in Python3?

I've tried all kinds of combinations of encode/decode with options 'surrogatepass' and 'surrogateescape' to no avail. I'm not sure what format this is in (it might even be a bug in Autoit), but I know for a fact the information is in there because at least one online utf decoder got it right. On the online converter website, I specified the file as utf8 and the output as utf16, and the output was as expected.
This issue is called mojibake, and your specific case occurs if you have a text stream that was encoded with UTF-8, and you decode it with Windows-1252 (which is a superset of ISO 8859-1).
So, as you have already found out, you have to decode this file with UTF-8, rather than with the default encoding of Python (which appears to be Windows-1252 in your case).
Let's see why these specific garbled characters appear in your example, namely:
É at the place of É
é at the place of é
è at the place of è
The following table summarises what's going on:
All of É, é, and è are non-ASCII characters, and they are encoded with UTF-8 to 2-byte long codes.
For example, the UTF-8 code for É is:
11000011 10001001
On the other hand, Windows-1252 is an 8-bit encoding, that is, it encodes every character of its character set to 8 bits, i.e. one byte.
So, if you now decode the bit sequence 11000011 10001001 with Windows-1252, then Windows-1252 interprets this as two 1-byte codes, each representing a separate character, rather than a 2-byte code representing a single character:
The first byte 11000011 (C3 in hexadecimal) happens to be the Windows-1252 code of the character à (Unicode code point U+00C3).
The second byte 10001001 (89 in hexadecimal) happens to be the Windows-1252 code of the character ‰ (Unicode code point U+2030).
You can look up these mappings here.
So, that's why your decoding renders É instead of É. Idem for the other non-ASCII characters é and è.
My issue was during the file reading. I solved it by specifying encoding='utf-8' in the options for open().
open(filePath, 'r', encoding='utf-8')

Vim's encoding options

Although Vim's help is a treasure cave of information, in some cases I find it mindboggling. Its explanation of different encoding-related options is one such case.
Can someone please explain to me, in simple terms, what do encoding, fileencoding and fileencodings settings do, and how can I
a) view the encoding of the current file?
b) change the encoding of the current file?
c) do something else which is used often, but slips my mind right now?
encoding is used by Vim to know what character sets it supports and how characters are stored internally.
You shouldn't really modify this setting; it should default to something Unicodeish. Otherwise you couldn't read and write files with an extended character set.
Put :set encoding=utf-8 at the start of your vimrc if you are not sure, and never play with that setting again except if you have to read huge files for one session with a 1-byte encoding.
fileencoding stores the encoding of the current buffer.
You might read and write to this variable and it will do what you want.
When you modify it, the file will be marked as modified, and when you save it (:w or :up) to disk, it will be written with the encoding that you specified.
fileencodings tells Vim how to detect the encoding of every file you read (in order to determine the value of fileencoding). It is a list of encodings, that are tried in order, and the first encoding that is consistent with the binary contents of the file is assumed to be the encoding of the file you are reading.
Set it once and then forget it. You might need to change it if you know that you are going to open plenty of files and that they all use the same encoding, and you don't want to lose time trying to check other encodings. Default which is ucs-bom,utf8,latin1 is nice IMO if you are in Western Europe, because almost any file will be opened in the correct encoding. However with this setting, when you open plain ASCII files (ie, which byte representation would be the same in UTF8 and in any latin-based code page encoding) the file will be assumed to be UTF8, and saved as such.
Example: if you set fileencodings to latin1,utf8, every file that you open will be read as latin1 because trying to read a file with latin1 encoding never fails: there is a bijection between the 256 possible byte values and the individual characters in the character set.
Conversely if you try fileencodings=ucs-bom,utf8,latin1 Vim will first check for a byte-order-mark and decode Unicode files with BOM, then if it failed (no BOM) try to read your files in UTF-8, and if it fails (because some byte sequences in UTF8 are invalid) open your file in latin1.
In order to reload a file with proper encoding (case when fileencodings did not work properly) you can do: :e! ++enc=<the_encoding>.
tl;dr:
view the encoding of the current file: :echo &fileencoding (shorter: :echo &fenc or :set fenc? or :verb set fenc?)
change the encoding of the current file: :set fenc=…… and call then :w as many times as you want.
reload your file with proper encoding: :e! ++enc=…
encoding:
The internal representation. View or set with:
:set encoding
:set encoding = utf-8
fileencoding:
The representation that will be used when the file is written. View or set with:
:set fileencoding
:set fileencoding = utf-8
fileencodings:
The list of possible encodings that are tested when reading a file. View or set with:
:set fileencodings
:set fileencodings= utf-8,latin-1,cp1251
Here is the list of possible encodings from the vim documentation (mbyte-encoding)
Supported 'encoding' values are: *encoding-values*
1 latin1 8-bit characters (ISO 8859-1, also used for cp1252)
1 iso-8859-n ISO_8859 variant (n = 2 to 15)
1 koi8-r Russian
1 koi8-u Ukrainian
1 macroman MacRoman (Macintosh encoding)
1 8bit-{name} any 8-bit encoding (Vim specific name)
1 cp437 similar to iso-8859-1
1 cp737 similar to iso-8859-7
1 cp775 Baltic
1 cp850 similar to iso-8859-4
1 cp852 similar to iso-8859-1
1 cp855 similar to iso-8859-2
1 cp857 similar to iso-8859-5
1 cp860 similar to iso-8859-9
1 cp861 similar to iso-8859-1
1 cp862 similar to iso-8859-1
1 cp863 similar to iso-8859-8
1 cp865 similar to iso-8859-1
1 cp866 similar to iso-8859-5
1 cp869 similar to iso-8859-7
1 cp874 Thai
1 cp1250 Czech, Polish, etc.
1 cp1251 Cyrillic
1 cp1253 Greek
1 cp1254 Turkish
1 cp1255 Hebrew
1 cp1256 Arabic
1 cp1257 Baltic
1 cp1258 Vietnamese
1 cp{number} MS-Windows: any installed single-byte codepage
2 cp932 Japanese (Windows only)
2 euc-jp Japanese (Unix only)
2 sjis Japanese (Unix only)
2 cp949 Korean (Unix and Windows)
2 euc-kr Korean (Unix only)
2 cp936 simplified Chinese (Windows only)
2 euc-cn simplified Chinese (Unix only)
2 cp950 traditional Chinese (on Unix alias for big5)
2 big5 traditional Chinese (on Windows alias for cp950)
2 euc-tw traditional Chinese (Unix only)
2 2byte-{name} Unix: any double-byte encoding (Vim specific name)
2 cp{number} MS-Windows: any installed double-byte codepage
u utf-8 32 bit UTF-8 encoded Unicode (ISO/IEC 10646-1)
u ucs-2 16 bit UCS-2 encoded Unicode (ISO/IEC 10646-1)
u ucs-2le like ucs-2, little endian
u utf-16 ucs-2 extended with double-words for more characters
u utf-16le like utf-16, little endian
u ucs-4 32 bit UCS-4 encoded Unicode (ISO/IEC 10646-1)
u ucs-4le like ucs-4, little endian
The {name} can be any encoding name that your system supports. It is passed
to iconv() to convert between the encoding of the file and the current locale.
For MS-Windows "cp{number}" means using codepage {number}.
Examples:
:set encoding=8bit-cp1252
:set encoding=2byte-cp932
The MS-Windows codepage 1252 is very similar to latin1. For practical reasons
the same encoding is used and it's called latin1. 'isprint' can be used to
display the characters 0x80 - 0xA0 or not.
Several aliases can be used, they are translated to one of the names above.
An incomplete list:
1 ansi same as latin1 (obsolete, for backward compatibility)
2 japan Japanese: on Unix "euc-jp", on MS-Windows cp932
2 korea Korean: on Unix "euc-kr", on MS-Windows cp949
2 prc simplified Chinese: on Unix "euc-cn", on MS-Windows cp936
2 chinese same as "prc"
2 taiwan traditional Chinese: on Unix "euc-tw", on MS-Windows cp950
u utf8 same as utf-8
u unicode same as ucs-2
u ucs2be same as ucs-2 (big endian)
u ucs-2be same as ucs-2 (big endian)
u ucs-4be same as ucs-4 (big endian)
u utf-32 same as ucs-4
u utf-32le same as ucs-4le
default stands for the default value of 'encoding', depends on the
environment
For the UCS codes the byte order matters. This is tricky, use UTF-8 whenever
you can. The default is to use big-endian (most significant byte comes
first):
name bytes char
ucs-2 11 22 1122
ucs-2le 22 11 1122
ucs-4 11 22 33 44 11223344
ucs-4le 44 33 22 11 11223344
On MS-Windows systems you often want to use "ucs-2le", because it uses little
endian UCS-2.
There are a few encodings which are similar, but not exactly the same. Vim
treats them as if they were different encodings, so that conversion will be
done when needed. You might want to use the similar name to avoid conversion
or when conversion is not possible:
cp932, shift-jis, sjis
cp936, euc-cn

unzip utf8 to ascii

i took some files from linux hosting to my windows via ftp
and when i check file encodings utf8 without bom
now i need to convert those files back to ascii and send my other linux server
i zipped files
can i do something like
unzip if its text file and ut8 format than convert it to ascii
when i am unzipping files , i want to make conversion
thanks ?
The program you're looking for is iconv; it will convert between encodings. Use it like this:
iconv -f utf-8 -t ascii < infile > outfile
However. ASCII is a subset of UTF-8. That is, a file that's written in ASCII is also correct UTF-8 --- no conversion is needed. The only reason for needing to convert the other way is if there are characters in your UTF-8 file that are outside the ASCII range. And if this is the case, you can't convert it to ASCII, because ASCII doesn't have those characters!
Are you sure you mean ASCII? Pure ASCII is rare these days. ISO-8859-15 (Western European) or CP1252 (Windows) are much more common.

Resources