Convert unicode text to single byte ascii in Python - python-3.x

I have an input file whose data I need to process. The file is in UTF-16 even though every single character in it is just a standard ascii character.
I can NOT change the input file so that it doesn't use useless double byte characters to represent 100% English language single character data. I need to convert this in python, on Windows. (Please, no non-python solutions, thank you).
I want my python program to act on these strings and output a file which is NOT double-byte. I just want standard ascii strings (one byte per character)
I've googled a lot, see all sorts of related questions, but not mine. I'm frustrated with not being able to solve this seemingly very simple question and need.
EDIT: Here is the program I got to work. It is absurd. There must be an easier way. The chr(10) references in the code is because the input has lines and I couldn't find a nonabsurd way to do simple readline/writeline calls.
with open('Unicode.txt','r') as input:
with open('ASCII.txt','w') as output:
for line in input.readlines():
codelist=[code for code in line.encode('ascii','ignore') if code not in (0,10)]
if codelist:
output.write(''.join([chr(code) for code in codelist]+[chr(10)]))
Question solved after reading a hint from #Mark Ransom.

with open('unicode.txt','r',encoding='UTF-16') as input:
with open('ascii.txt','w',encoding='ascii') as output:
output.write(input.read())

Related

Beautiful Soup - meaning of letter 'u' in documentation [duplicate]

Like in:
u'Hello'
My guess is that it indicates "Unicode", is that correct?
If so, since when has it been available?
You're right, see 3.1.3. Unicode Strings.
It's been the syntax since Python 2.0.
Python 3 made them redundant, as the default string type is Unicode. Versions 3.0 through 3.2 removed them, but they were re-added in 3.3+ for compatibility with Python 2 to aide the 2 to 3 transition.
The u in u'Some String' means that your string is a Unicode string.
Q: I'm in a terrible, awful hurry and I landed here from Google Search. I'm trying to write this data to a file, I'm getting an error, and I need the dead simplest, probably flawed, solution this second.
A: You should really read Joel's Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) essay on character sets.
Q: sry no time code pls
A: Fine. try str('Some String') or 'Some String'.encode('ascii', 'ignore'). But you should really read some of the answers and discussion on Converting a Unicode string and this excellent, excellent, primer on character encoding.
My guess is that it indicates "Unicode", is it correct?
Yes.
If so, since when is it available?
Python 2.x.
In Python 3.x the strings use Unicode by default and there's no need for the u prefix. Note: in Python 3.0-3.2, the u is a syntax error. In Python 3.3+ it's legal again to make it easier to write 2/3 compatible apps.
I came here because I had funny-char-syndrome on my requests output. I thought response.text would give me a properly decoded string, but in the output I found funny double-chars where German umlauts should have been.
Turns out response.encoding was empty somehow and so response did not know how to properly decode the content and just treated it as ASCII (I guess).
My solution was to get the raw bytes with 'response.content' and manually apply decode('utf_8') to it. The result was schöne Umlaute.
The correctly decoded
für
vs. the improperly decoded
fĂźr
All strings meant for humans should use u"".
I found that the following mindset helps a lot when dealing with Python strings: All Python manifest strings should use the u"" syntax. The "" syntax is for byte arrays, only.
Before the bashing begins, let me explain. Most Python programs start out with using "" for strings. But then they need to support documentation off the Internet, so they start using "".decode and all of a sudden they are getting exceptions everywhere about decoding this and that - all because of the use of "" for strings. In this case, Unicode does act like a virus and will wreak havoc.
But, if you follow my rule, you won't have this infection (because you will already be infected).

How to convert "binary text" to "visible text"?

I have a text file full of non-ASCII characters.
I can not detect the encoding by either file or enca.
file non_ascii.txt
non_ascii.txt: Non-ISO extended-ASCII text
enca non_ascii.txt
Unrecognized encoding
But I can open it normally in Windows Notepad++
Edit: The expression above leads misunderstanding. Sorry for this.
In fact, I picked some parts of the original file and put them into new text file, then opened in notepad++.
The 2 parts shows as below. They are decoded in 2 different ways by notepad++.
Question:
How could I detect the files encoding under linux?
how do I recover the characters represented by <F1><EE><E9><E4><FF>?
I couldn't get result by "grep 'сойдя' win.txt" even though the "сойдя" is encoded into <F1><EE><E9><E4><FF>?
The file content slice as follows:
less non_ascii.txt
"non_ascii.txt" may be a binary file. See it anyway?
<F1><EE><E9><E4><FF>
<F2><F0><E0><EA><F2><EE><E2><E0><F2><FC><F1><FF>
<D0><F2><E9><E4><D7><E9><E7><E1><EC><E1><F3><F8>
<D1><E5><EA><F3><ED><E4>
<F0><E0><E7><E3><F0><F3><E7><EA><E8>
<EF><EE><E4><F1><F2><E0><E2><EB><FF><F2><FC>
<F0><E0><E7><E3><F0><F3><E7><EA><E5>
<F1><EE><E9><E4><F3>
<F0><E0><E7><E3><F0><F3><E7><EA><E0>
<F1><EE><E2><EB><E0><E4><E0><EB><E8>
<C1><D7><E9><E1><F0><EF><FE><F4><E1>
<CB><C1><D3><D3><C9><D4><C5><D2><C9><D4>
<F1><EE><E2><EB><E0><E4><E0><EB><EE>
<F1><EE><E9><E4><E8>
<F1><EE><E2><EB><E0><E4><E0><EB><E0>
Your question really has two parts: (1) how do I identify an unknown encoding and (2) how do I convert that to something useful?
The first part is the real challenge, and really cannot be answered in universal terms -- in the general case, there is no reliable way to identify an unknown 8-bit encoding. Some encodings give you good hints (UTF-8 is an excellent example) and in many cases, if you have a good idea what the text is supposed to represent, the problem can be solved.
A mapping of 8-bit character meanings can be helpful (cough, the link is to mine) and in this case quickly hints at Windows code page 1251. Kudos for the hex dumps and the picture with the representation you expect!
With that out of the way, converting is easy.
iconv -f cp1251 -t utf-8 non_ascii.txt >utf8.txt
Provided your Linux system is set up to use UTF-8 at the terminal, your grep command should work on utf-8.txt now.
The indication that some of the text is "ANSI" (which is a bogus term anyway) is probably just a red herring -- as far as I can tell, everything in your excerpt looks like well-formed CP1251.
Some tools like chardet do a reasonable job of at least steering you in the right direction, though you have to understand that, like a human expert, they have to guess what the text is supposed to represent. There are corner cases where they just don't have enough information to guess correctly, either because there are several candidate encodings with very few differences (for example, Latin-1 vs Latin-9 vs Windows-1252, all of which also overlap with plain 7-bit US-ASCII in the first 128 positions) or because the input doesn't contain enough information to establish any common patterns.

Handling print() UnicodeEncodeError

I extract unicode text from a website and do some printing that can contain non-ascii characters. Printing those raises a UnicodeEncodeError. How should I handle this case? I need a solution that requires minimum changes to the existing code.
I tried encoding the strings (by wrapping sys.stdout) with UTF-8, but then I get a TypeError since the original sys.stdout.write() method expects str not bytes.
I don't want the characters to be lost. The user may want to pipe the output into a file that would in this case be UTF8 formatted.
Edit: Setting PYTHONIOENCODING=utf8 can fix the problem, but isn't there a way I can "imitate" this behaviour in Python? Since utf8 doesn't match the Windows console encoding, some characters look strange. But this is by far better than a crash of the program.
Anyway to format the output from Python so I don't need to modify environment variables?

Read a text file to a string using fortran77

Is it possible to read a text file to a string using fortran77.
I actually have a text file in the following format
Some comments
Some comments
n1 m1 comment_with_unknown_number_of_words
..m1 lines of data..
n2 m2 comment_with_unknown_number_of_words
..m2 lines of data..
and so on
whereas n1,n2.. are the orders of the objects. m1, m2,..are the number of lines which contains the data about these objects, respectively. I also want to store the comment of each object for further investigations.
How can I deal with this? Thank you so much in advance!
I can't believe nobody called me on this.. My apologies this in fact only grabs the first word of the comment...
------------original answer----
Not to recomend F77, but this isnt that tough a problem either. Just declare a char variable long enough to hold your longest comment and use a list directed read.
integer m1,n1
char*80 comment
...
read(unit,*)m1,n1,comment
If you want to write it back out without padding a bunch of extra spaces thats a bit of effort but hardly the end of the world.
What you can not do at all in f77 is discern whether your file has trailing blanks at the end of a line, unless you go to direct access reading.
------------improved answer
What you need to do is read the whole line as a string then read your integers from the string:
read(unit,'(a)')comment
read(comment,*)m1,n1
at this point comment contains the whole line including your two integers (perhaps that will do the job for you). If you want to pull off the actual string it requires a bit of coding (I have a ~40 line subroutine to split the string into words). I could post if interesed but I'm more inclined as others to encourage you to see if your code will work with a more modern compiler.

Troubles with text encoding

I'm having some troubles with text encoding. Parsing a website gives me a Data.Text string
"Project - Fran\195\167ois Dubois",
which I need to write to a file. So I'm using Data.Text.Lazy.Encoding.encodeUtf8 to convert it into a Bytestring. The problem is that this yields garbled output:
"Project - François Dubois".
What am I missing here?
If you have gotten Fran\195\167ois inside your Data.Text, you already have a UTF-8-encoded François.
That's inconvenient because Data.Text[.Lazy] is supposed to be UTF-16 encoded text, and the two code units 195 and 167 are interpreted as the unicode code points 195 resp. 167 which are 'Ã' resp. '§'. If you UTF-8-encode the text, these are converted to the byte sequences c383 ([195,131]) resp c2a7 ([194,167]).
The most likely way for getting into this situation is that the data you got from the website was UTF-8 encoded, but was interpreted as ISO-8859-1 (Latin 1) encoded (or another 8-bit encoding; 8859-15 is widespread too).
The proper way of handling it is avoiding the situation altogether [that may not be possible, unfortunately].
If the source of your data states its encoding correctly - as a website should - find out the encoding and interpret the data accordingly. If an incorrect encoding is stated, you are of course out of luck, and if no encoding is specified, you have to guess right (the natural guess nowadays is UTF-8, at least for languages using a variant of the Latin alphabet).
If avoiding the situation is not possible, the easiest ways of fixing it are
replacing the occurrences of the offending sequence with the desired one before encoding:
encodeUtf8 $ replace (pack "Fran\195\167ois") (pack "Fran\231ois") contents
assuming everything else is ASCII or inadvertent UTF-8 too, interpret the Text code units as bytes:
Data.ByteString.Lazy.Char8.pack $ Data.Text.Lazy.unpack contents
The former is more efficient, but becomes inconvenient if there are many different misencodings (caused by different accented letters, for example). The latter works only in the assumed situation (no code units above 255 in the Text) and is rather inefficient for long texts.
I am not completely sure if less can show UTF-8 encoded characters properly. GVim can. You can check this link on SO to find out how you can view UTF-8 data in gVim.
And regarding the other issue of being able to pass this to graphviz, I think you need to set the encoding on the command-line as explained in the Graph NonAscii FAQ.
From what you are explaining, I think there are no issues with how the data is being persisted. If you pass the encoding properly to graphviz, I think your problem will be resolved.
P.S: Creating an answer since it is easier to create descriptive links

Resources