I am doing some scraping work with Python 3.6 and retrieved some URLs in strings following this format:
someURL = 'http:\u002F\u002Fsomewebsite.com\u002Fsomefile.jpg'
I have been trying to convert the Unicode backslash (\u002F) in those strings in order to use the URLs (using regex methods, encode() on the strings, etc.), to no avail. The string still keeps the Unicode backslash and, if I pass it on to Requests' get(), for example, I am greeted by the following error message:
InvalidURL: Failed to parse: http:\u002F\u002Fsomewebsite.com\u002Fsomefile.jpg"
I searched for solutions in this forum and and others, but can't seem to put the finger on it. I am sure it is simple...
Use codecs.decode with the encoding named 'unicode-escape':
import codecs
print(codecs.decode(someURL, 'unicode-escape'))
# prints 'http://somewebsite.com/somefile.jpg'
Related
I'm using Python + Camelot (OCR library) to read a PDF, clean up, and write to Excel or csv. There are some non-standard dashes that print out a weird character.
Using Camelot means I'm not calling "read_csv". It's coming from the PDF. A value that is supposed to be "1-4" prints out as 1–4.
I fixed this using a regular expression but a colleague mentioned I should standardize to UTF-8. I tried to do that for the header like this:
header = df.iloc[0, 1:].str.encode('utf-8')
but then that value becomes b'1\xe2\x80\x934'.
Any advice? The goal is to simply use standard text.
I am using Stackoverflow API to fetch some data but the response contains Unicode and some other encodings which I am unable to decode. Here's one sample text:
you can use numpy \u0026#39;\u0026#39;\u0026#39;
python
import numpy as np
np.random.random((pN,C,K))\u0026#39;\u0026#39;\u0026#39;
What's the best way to decode this response in Go?
The sample text is a Python answer I received as response from Stackoverflow API.
Several layers of encoding seem to be in place here:
first, the \u0026 could be a & character
this gives ' which seems to be an XML character literal for '
the resulting ''' seems to mark up the enclosed text to be taken literally including the line breaks. It is also possible that the python indicates the language for syntax highlighting of the enclosed text.
Here are my codes:
import re,urllib
from urllib import request, parse
def gh(url):
html=urllib.request.urlopen(url).read().decode('utf-8')
return html
def gi(x):
r=r'src="(.+?\.jpg)"'
imgre=re.findall(r, x)
y=0
for iu in imgre:
urllib.request.urlretrieve(iu, '%s.jpg' %y)
y=y+1
va=gh('http://tieba.baidu.com/p/3497570603')
print(gi(va))
when it is running, it occurs:
UnicodeEncodeError: 'ascii' codec can't encode character '\u65e5' in position 873: ordinal not in range(128)
I have decoded the content of website with 'utf-8' which turns into string, and where is the 'ascii codec' problem from?
The problem is that the HTML content of http://tieba.baidu.com/p/3497570603 contains references to .png images so the non-greedy regular expression matches long strings of text such as
http://static.tieba.baidu.com/tb/editor/images/client/image_emoticon28.png" ><br><br><br><br>
...
title="蓝钻"><img src="http://imgsrc.baidu.com/forum/pic/item/bede9735e5dde711c981db20a0efce1b9f1661d5.jpg
Calling the urlretrieve() method with URLs consisting of long strings containing non-ASCII characters results in the UnicodeEncodeError being thrown while trying to convert the URL argument to ASCII.
A better regular expression to avoid matching too much text would be
r=r'src="([^"]+?\.jpg)"'
Debugging
In the spirit of teaching someone to fish rather than simply giving them a fish for one day, I’d recommend that you use print statements to debug problems such as this. I was able to diagnose this particular issue by replacing the urllib.request.urlretrieve(iu, '%s.jpg' %y) line with print(iu).
So Ive got a string of:
YDNhZip1cDg1YWg4cCFoKg==
that needs to be decoded using Pythons Base64 module.
Ive written the code
import base64
test = 'YDNhZip1cDg1YWg4cCFoKg=='
print(test)
print(base64.b64decode(test))
which gives the answer
b'`3afup85ah8p!h'
when, according to the website decoders Ive used, its really
`3afup85ah8p!h
Im guessing that its decoding the additional quotes.
Is there some way that I can save this variable with a delimiter, as another type of variable, or run the b64encode on a section of the string as slice doesnt seem to work?
b' is Python's way of delimiting data from bytes, see: What does the 'b' character do in front of a string literal?
i.e., it is decoding it correctly.
I'm using a free web host but choosing not to work with any Python framework, and am stuck trying to print Chinese characters saved in the source file (using emacs to save file encoded in utf-8) to the resulting HTML page. I thought Unicode "just works" in Python 3.1 so I am baffled. I found three solutions that aren't working. I might just be missing a detail or two.
The host is Alwaysdata, and it has been straightforward to use, so I have little clue about details of how they put together the parts. All I do is upload or edit (with ssh) Python files to a www folder, change permissions, point a browser to the right URL, and it works.
My first attempt, which works on local IDLE (and also the server's Python command line interactive shell, which makes me even more confused why it won't work when it's passed to a browser)
#!/usr/bin/python3.1
mystr = "世界好"
print("Content-Type: text/html\n\n")
print("""<!DOCTYPE html>
<html><head><meta charset="utf-8"></head>
<body>""")
print(mystr)
The error is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3:
ordinal not in range(128)
Then I tried
print(mystr.encode("utf-8"))
resulting in no error, but the following undesired output to the browser:
b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c'
Third, the following lines were added but got an error:
import sys
sys.setdefaultencoding("utf-8")
AttributeError: 'module' object has no attribute 'setdefaultencoding'
Finally, replacing print with f.write:
import codecs
f = codecs.open(sys.stdout, "w", "utf-8")
mystr = "你好世界"
...
f.write(mystr)
error:
TypeError: invalid file: <_io.TextIOWrapper name='<stdout>'
encoding='ANSI_X3.4-1968'>
How do I get the output to work? Do I need to use a framework for a quick fix?
It sounds like you are using CGI, which is a stupid API as it's using stdout, made for output to humans, to output to your browser. This is the basic source of your problems.
You need to encode it in UTF-8, and then write to sys.stdout.buffer instead of sys.stdout.
And after that, get yourself a webframework. Really, you'll be a lot happier.