Use iconv or python3 to recode utf-8 to Latin-1 (ISO-8859-1) preserving accented characters - python-3.x

By most accounts, one ought to be able to change the encoding of a UTF-8
file to a Latin-1 (ISO-8859-1) encoding by a trivial invocation of iconv such as:
iconv -c -f utf-8 -t ISO-8859-1//TRANSLIT
However, this fails to deal with accented characters properly. Consider
for example:
$ echo $LC_ALL
C
$ cat Gonzalez.txt
González, M.
$ file Gonzalez.txt
Gonzalez.txt: UTF-8 Unicode text
$ iconv -c -f utf-8 -t ISO-8859-1//TRANSLIT < Gonzalez.txt > out
$ file out
out: ASCII text
$ cat out
Gonzalez, M.
I've tried various variations of the above, but none properly handles
the accented "a", the point being that Latin-1 does have an accented "a".
Indeed, uconv does handle the situation properly:
$ uconv -x Any-Accents -f utf-8 -t l1 < Gonzalez.txt > out
$ file out
out: ISO-8859 text
Opening the file in emacs or
Sublime shows the accented "a" properly. Same thing using -x nfc.
Unfortunately, my target environment does not permit a solution using "uconv",
so I am looking for a simple solution using either iconv or Python3.
python3 attempts
My attempts using python3 so far have not been successful.
For example, the following:
import sys
import fileinput # allows file to be specified or else reads from STDIN
for line in fileinput.input():
l=line.encode("latin-1","replace")
sys.stdout.buffer.write(l)
produces:
Gonza?lez, M.
(That's a literal "?".)
I've tried various other Python3 possibilities, so far without success.
Please note that I've reviewed numerous SO questions on this topic, but the answers using iconv or Python3 do not handle Gonzalez.txt properly.

There are two ways to encode A WITH ACUTE ACCENT in Unicode.
One is to use a combined character, as illustrated here with Python's built-in ascii function:
>>> ascii('á')
"'\\xe1'"
But you can also use a combining accent following an unaccented letter a:
>>> ascii('á')
"'a\\u0301'"
Depending on the displaying applications, the two variants may look indistinguishable (in my terminal, the latter looks a bit odd with the accent being too large).
Now, Latin-1 has an accented letter a, but no combining accents, so that's why the acute becomes a question mark when encoding with errors="replace".
Fortunately, you can automatically switch between the two variants.
Without going into details (there are many details here), Unicode defined two normalization forms, called composed and decomposed, abbreviated NFC and NFD, respectively.
In Python, you can use the standard-library module unicodedata:
>>> import unicodedata as ud
>>> ascii(ud.normalize('NFD', 'á'))
"'a\\u0301'"
>>> ascii(ud.normalize('NFC', 'á'))
"'\\xe1'"
In your specific case, you can convert the input strings to NFC form, which will increase coverage of Latin-1 characters:
>>> n = 'Gonza\u0301lez, M.'
>>> print(n)
González, M.
>>> n.encode('latin1', errors='replace')
b'Gonza?lez, M.'
>>> ud.normalize('NFC', n).encode('latin1', errors='replace')
b'Gonz\xe1lez, M.'

Related

How to convert text to UTF-8 encoding within a text file

When I use a text editor to see the actual content, I see
baliÄ<8d>ky 0 b a l i ch k i
and when I use cat to see it, I see
baličky 0 b a l i ch k i
How can I make it so, it is actually baličky in the text editor as well?
I've tried numerous commands such as iconv -f UTF-8 -t ISO-8859-15, iconv -f ISO-8859-15 -t UTF-8, recode utf8..l9.
None of them works. It's still baliÄ<8d>ky instead of baličky. This is a Czech word. If I do a simple sed command (/Ä<8d>/č), it works but I have so many other characters like this and manual work is basically really mundane at this point.
Any suggestions?

how can i make '\u' work in variable in python3

I got a string from network like:
s = '\u0070\u0079\u0074\u0068\u006f\u006e' or other
when i print(s)
it output '\u0070\u0079\u0074\u0068\u006f\u006e'
I want make '\u' work
when I print(s)
it output 'python' (u'\u0070\u0079\u0074\u0068\u006f\u006e'=python)
what should I do?
I don't know if I got what you mean, but if you mean you want convert unicode characters with trailing slashes to real unicode characters, you can use this solution.
for python2 : print s.decode('unicode-escape')
for python3 : print(bytes(s, 'ascii').decode('unicode-escape'))

Obfuscate a Python script in Unicode escape sequences

I want to obfuscate a Python script by using Unicode escape sequences.
For example,
print("Hello World")
in Unicode escape sequences is:
\x70\x72\x69\x6e\x74\x28\x22\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x22\x29
From my command line, I can achieve this with:
$ python3 -c \x70\x72\x69\x6e\x74\x28\x22\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x22\x29
Hello World
I've create a file and put the "Hello World" Unicode escape sequence in it as the source code.
But when I run it, I get:
$ python3 sample.py
SyntaxError: unexpected character after line continuation character
How can I use Unicode escape sequences in my source code.
You can use a PEP 263 header, which tells Python which encoding the source code is written in.
The format is:
# coding=<encoding name>
By using the unicode_escape codec (selected from https://docs.python.org/3/library/codecs.html), Python will unescape your strings first.
sample.py
# coding=unicode_escape
\x70\x72\x69\x6e\x74\x28\x22\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x22\x29
Result:
$ python3 sample.py
Hello World

How to remove non UTF-8 characters from text file

I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:
Malformed UTF-8 character (fatal)
Manually checking the content of these files, I found some strange characters in them.
Now I'm looking for a way to automatically remove these characters from the files.
Is there anyway to do it?
This command:
iconv -f utf-8 -t utf-8 -c file.txt
will clean up your UTF-8 file, skipping all the invalid characters.
-f is the source format
-t the target format
-c skips any invalid sequence
Your method must read byte by byte and fully understand and appreciate the byte wise construction of characters. The simplest method is to use an editor which will read anything but only output UTF-8 characters. Textpad is one choice.
iconv
can do it
iconv -f cp1252 foo.txt
None of the methods here or on any other similar questions worked for me.
In the end what worked was simply opening the file in Sublime Text 2. Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it.
May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.

encoding problem?

i work with txt files, and i recently found e.g. these characters in a few of them:
http://pastebin.com/raw.php?i=Bdj6J3f4
what could these characters be? wrong character-encoding? i just want to use normal UTF-8 TXT files, but when i use:
iconv -t UTF-8 input.txt > output.txt
it's still the same.
When i open the files in gedit, copy+paste them in another txt files, then there's no characters like in the ones in pastebin. so gedit can solve this problem, it encodes the TXT files well. but there are too many txt files.
why are there http://pastebin.com/raw.php?i=Bdj6J3f4 -like chars in the text files? can they be converted to "normal chars"? I can't see e.g.: the "Ì" char, when i open the files with vim, only after i "work with them" (e.g.: awk, etc)
It would help if you posted the actual binary content of your file (perhaps by using the output of od -t x1). The pastebin returns this as HTML:
"Ì"
"Ã"
"é"
The first line corresponds to U+00C3 U+0152. THe last line corresponds to U+00C3 U+00A9, which is the string "\ux00e9" in UTF ("\xc3\xa9") with the UTF-8 bytes reinterpreted as Latin-1.
From man iconv:
The iconv program converts text from
one encoding to another encoding. More
precisely, it converts from the
encoding given for the -f option to
the encoding given for the -t option.
Either of these encodings defaults to
the encoding of the current locale
Because you didn't specify the -f option it assumes the file is encoded with your current locale's encoding (probably UTF-8), which apparently is not true. Your text editors (gedit, vim) do some encoding detection - you can check which encoding do they detect (I don't know how - I don't use any of them) and use that as -f iconv option (or save the open file with your desired encoding using one of those text editors).
You can also use some tool for encoding detection like Python chardet module:
$ python -c "import chardet as c; print c.detect(open('file.txt').read(4096))"
{'confidence': 0.7331842298102511, 'encoding': 'ISO-8859-2'}
..solved !
how:
i just right clicked on the folders containing the TXT files, and pasted them to another folder.. :O and presto..theres no more ugly chars..

Resources