Beautiful Soup - meaning of letter 'u' in documentation [duplicate] - python-3.x

Like in:
u'Hello'
My guess is that it indicates "Unicode", is that correct?
If so, since when has it been available?

You're right, see 3.1.3. Unicode Strings.
It's been the syntax since Python 2.0.
Python 3 made them redundant, as the default string type is Unicode. Versions 3.0 through 3.2 removed them, but they were re-added in 3.3+ for compatibility with Python 2 to aide the 2 to 3 transition.

The u in u'Some String' means that your string is a Unicode string.
Q: I'm in a terrible, awful hurry and I landed here from Google Search. I'm trying to write this data to a file, I'm getting an error, and I need the dead simplest, probably flawed, solution this second.
A: You should really read Joel's Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) essay on character sets.
Q: sry no time code pls
A: Fine. try str('Some String') or 'Some String'.encode('ascii', 'ignore'). But you should really read some of the answers and discussion on Converting a Unicode string and this excellent, excellent, primer on character encoding.

My guess is that it indicates "Unicode", is it correct?
Yes.
If so, since when is it available?
Python 2.x.
In Python 3.x the strings use Unicode by default and there's no need for the u prefix. Note: in Python 3.0-3.2, the u is a syntax error. In Python 3.3+ it's legal again to make it easier to write 2/3 compatible apps.

I came here because I had funny-char-syndrome on my requests output. I thought response.text would give me a properly decoded string, but in the output I found funny double-chars where German umlauts should have been.
Turns out response.encoding was empty somehow and so response did not know how to properly decode the content and just treated it as ASCII (I guess).
My solution was to get the raw bytes with 'response.content' and manually apply decode('utf_8') to it. The result was schöne Umlaute.
The correctly decoded
für
vs. the improperly decoded
fĂźr

All strings meant for humans should use u"".
I found that the following mindset helps a lot when dealing with Python strings: All Python manifest strings should use the u"" syntax. The "" syntax is for byte arrays, only.
Before the bashing begins, let me explain. Most Python programs start out with using "" for strings. But then they need to support documentation off the Internet, so they start using "".decode and all of a sudden they are getting exceptions everywhere about decoding this and that - all because of the use of "" for strings. In this case, Unicode does act like a virus and will wreak havoc.
But, if you follow my rule, you won't have this infection (because you will already be infected).

Related

Convert unicode text to single byte ascii in Python

I have an input file whose data I need to process. The file is in UTF-16 even though every single character in it is just a standard ascii character.
I can NOT change the input file so that it doesn't use useless double byte characters to represent 100% English language single character data. I need to convert this in python, on Windows. (Please, no non-python solutions, thank you).
I want my python program to act on these strings and output a file which is NOT double-byte. I just want standard ascii strings (one byte per character)
I've googled a lot, see all sorts of related questions, but not mine. I'm frustrated with not being able to solve this seemingly very simple question and need.
EDIT: Here is the program I got to work. It is absurd. There must be an easier way. The chr(10) references in the code is because the input has lines and I couldn't find a nonabsurd way to do simple readline/writeline calls.
with open('Unicode.txt','r') as input:
with open('ASCII.txt','w') as output:
for line in input.readlines():
codelist=[code for code in line.encode('ascii','ignore') if code not in (0,10)]
if codelist:
output.write(''.join([chr(code) for code in codelist]+[chr(10)]))
Question solved after reading a hint from #Mark Ransom.
with open('unicode.txt','r',encoding='UTF-16') as input:
with open('ascii.txt','w',encoding='ascii') as output:
output.write(input.read())

Python won't write '♥' character to text document

I've got a bit of an annoying issue, I'm trying to write a series of json data to a text document, however, python raises a UnicodeEncodeError whenever it encounters these kinds of characters.
As per the big update with python 3, these characters print to the console just fine, its the issue when we go
with open("filename.txt", "a") as file
file.write("I ♥ ice cream")
file.close()
As I'm still a newbie to python, I haven't the slightest clue how to solve this, any ideas?
Found out how to solve this one!
First off I'd like to thank #JJJ, for hinting me down the right track, however my only criticism is that the presented solution wasn't very straight forward, and for someone with no knowledge of the significance of bytes and strings this may present quite the challenge.
Basically the problem was to do with the default method of encoding that my computer uses (the OS being the standard win 10), being cp1252.
When going into python and having the program run a simple bit of code to test this, it more clearly illustrates the issue and thus we can find a more viable solution.
text = "I ♥ IceCream"
text = text.encode("cp1252")
open('People Jobs.txt','a').write(text)
Running this in IDLE, we get this:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2665' in position 2: character maps to <undefined>
Ah! Now we can see our issue! The codec can't encode the character! Knowing this we can encode the string using utf-8 before writing it to the file like so:
text = "I ♥ IceCream"
text = text.encode("utf-8")
open('People Jobs.txt','a').write(text)
Running this, we finally get:
b'I \xe2\x99\xa5 IceCream'
Which can be written to the file no worries. We can turn this back into the original message using the decode method, however for my purposes, we don't need to do that.
Once again, I'd like to extend my thanks to those who commented on my post, your extensive knowledge of the python language is quite the asset and I greatly appreciate it.
Hopefully my negligence to see these simple programming principles will benefit others when they come to laugh at this post
But hey that's why I have the name!
So until next time,
Mr Incompetent
P.S #Pratik K Thank you for the reminder of how to write this one in a more compact manner, I appreciate it :) (been doing C++ for a while so I've forgotten about python)
I tried it out, seems to work just fine. Just a side note, instead of using with ..., you can just as well write this out as: open(filename, 'a' ).write(string). This will open up the file, write/ append to it, and close it, all within one line.
Just to be clear, the syntax for your solution would be:
with open("filename.txt", 'a') as file:
file.write("I ♥ ice cream")

Debug Inspector showing strange characters?

I'm just getting into Go for the first time and finally got things running on my Win10 machine. Finally got breakpoints working inside of IntelliJ IDEA, and I'm seeing stuff like this in my debugger window. Those messes of unicode chars should actually be a 24-char HEX id that's coming from MongoDB.
My best guess is that this is a problem with mgo not properly unmarshalling ObjectId objects, but this doesn't seem to be a problem for any of the devs running linux or macOS, so maybe it's just a Windows thing?
Any input would be appreciated!
No error here. bson.ObjectId has an underlying type of string:
type ObjectId string
But it is used to store 12 "arbitrary" bytes ("arbitrary" means it is not meant to be interpreted by runes, and it's not a valid UTF-8 encoded sequence). It is usually displayed using the hex representation of its bytes, for humans.
Debuggers don't take that convenience. They see it's a string, so they attempt to display it as a string (even though it's not meant to). This is not a Windows-only thing, the Atom editor with the delve debugger does the same on Linux too. Nothing to worry about.
If you print an ObjectId, it's usually the fmt package's "thing" to use its String() method to acquire the string value to be displayed. Debuggers do not necessarily do that.

How to convert "binary text" to "visible text"?

I have a text file full of non-ASCII characters.
I can not detect the encoding by either file or enca.
file non_ascii.txt
non_ascii.txt: Non-ISO extended-ASCII text
enca non_ascii.txt
Unrecognized encoding
But I can open it normally in Windows Notepad++
Edit: The expression above leads misunderstanding. Sorry for this.
In fact, I picked some parts of the original file and put them into new text file, then opened in notepad++.
The 2 parts shows as below. They are decoded in 2 different ways by notepad++.
Question:
How could I detect the files encoding under linux?
how do I recover the characters represented by <F1><EE><E9><E4><FF>?
I couldn't get result by "grep 'сойдя' win.txt" even though the "сойдя" is encoded into <F1><EE><E9><E4><FF>?
The file content slice as follows:
less non_ascii.txt
"non_ascii.txt" may be a binary file. See it anyway?
<F1><EE><E9><E4><FF>
<F2><F0><E0><EA><F2><EE><E2><E0><F2><FC><F1><FF>
<D0><F2><E9><E4><D7><E9><E7><E1><EC><E1><F3><F8>
<D1><E5><EA><F3><ED><E4>
<F0><E0><E7><E3><F0><F3><E7><EA><E8>
<EF><EE><E4><F1><F2><E0><E2><EB><FF><F2><FC>
<F0><E0><E7><E3><F0><F3><E7><EA><E5>
<F1><EE><E9><E4><F3>
<F0><E0><E7><E3><F0><F3><E7><EA><E0>
<F1><EE><E2><EB><E0><E4><E0><EB><E8>
<C1><D7><E9><E1><F0><EF><FE><F4><E1>
<CB><C1><D3><D3><C9><D4><C5><D2><C9><D4>
<F1><EE><E2><EB><E0><E4><E0><EB><EE>
<F1><EE><E9><E4><E8>
<F1><EE><E2><EB><E0><E4><E0><EB><E0>
Your question really has two parts: (1) how do I identify an unknown encoding and (2) how do I convert that to something useful?
The first part is the real challenge, and really cannot be answered in universal terms -- in the general case, there is no reliable way to identify an unknown 8-bit encoding. Some encodings give you good hints (UTF-8 is an excellent example) and in many cases, if you have a good idea what the text is supposed to represent, the problem can be solved.
A mapping of 8-bit character meanings can be helpful (cough, the link is to mine) and in this case quickly hints at Windows code page 1251. Kudos for the hex dumps and the picture with the representation you expect!
With that out of the way, converting is easy.
iconv -f cp1251 -t utf-8 non_ascii.txt >utf8.txt
Provided your Linux system is set up to use UTF-8 at the terminal, your grep command should work on utf-8.txt now.
The indication that some of the text is "ANSI" (which is a bogus term anyway) is probably just a red herring -- as far as I can tell, everything in your excerpt looks like well-formed CP1251.
Some tools like chardet do a reasonable job of at least steering you in the right direction, though you have to understand that, like a human expert, they have to guess what the text is supposed to represent. There are corner cases where they just don't have enough information to guess correctly, either because there are several candidate encodings with very few differences (for example, Latin-1 vs Latin-9 vs Windows-1252, all of which also overlap with plain 7-bit US-ASCII in the first 128 positions) or because the input doesn't contain enough information to establish any common patterns.

Troubles with text encoding

I'm having some troubles with text encoding. Parsing a website gives me a Data.Text string
"Project - Fran\195\167ois Dubois",
which I need to write to a file. So I'm using Data.Text.Lazy.Encoding.encodeUtf8 to convert it into a Bytestring. The problem is that this yields garbled output:
"Project - François Dubois".
What am I missing here?
If you have gotten Fran\195\167ois inside your Data.Text, you already have a UTF-8-encoded François.
That's inconvenient because Data.Text[.Lazy] is supposed to be UTF-16 encoded text, and the two code units 195 and 167 are interpreted as the unicode code points 195 resp. 167 which are 'Ã' resp. '§'. If you UTF-8-encode the text, these are converted to the byte sequences c383 ([195,131]) resp c2a7 ([194,167]).
The most likely way for getting into this situation is that the data you got from the website was UTF-8 encoded, but was interpreted as ISO-8859-1 (Latin 1) encoded (or another 8-bit encoding; 8859-15 is widespread too).
The proper way of handling it is avoiding the situation altogether [that may not be possible, unfortunately].
If the source of your data states its encoding correctly - as a website should - find out the encoding and interpret the data accordingly. If an incorrect encoding is stated, you are of course out of luck, and if no encoding is specified, you have to guess right (the natural guess nowadays is UTF-8, at least for languages using a variant of the Latin alphabet).
If avoiding the situation is not possible, the easiest ways of fixing it are
replacing the occurrences of the offending sequence with the desired one before encoding:
encodeUtf8 $ replace (pack "Fran\195\167ois") (pack "Fran\231ois") contents
assuming everything else is ASCII or inadvertent UTF-8 too, interpret the Text code units as bytes:
Data.ByteString.Lazy.Char8.pack $ Data.Text.Lazy.unpack contents
The former is more efficient, but becomes inconvenient if there are many different misencodings (caused by different accented letters, for example). The latter works only in the assumed situation (no code units above 255 in the Text) and is rather inefficient for long texts.
I am not completely sure if less can show UTF-8 encoded characters properly. GVim can. You can check this link on SO to find out how you can view UTF-8 data in gVim.
And regarding the other issue of being able to pass this to graphviz, I think you need to set the encoding on the command-line as explained in the Graph NonAscii FAQ.
From what you are explaining, I think there are no issues with how the data is being persisted. If you pass the encoding properly to graphviz, I think your problem will be resolved.
P.S: Creating an answer since it is easier to create descriptive links

Resources