Here are my codes:
import re,urllib
from urllib import request, parse
def gh(url):
html=urllib.request.urlopen(url).read().decode('utf-8')
return html
def gi(x):
r=r'src="(.+?\.jpg)"'
imgre=re.findall(r, x)
y=0
for iu in imgre:
urllib.request.urlretrieve(iu, '%s.jpg' %y)
y=y+1
va=gh('http://tieba.baidu.com/p/3497570603')
print(gi(va))
when it is running, it occurs:
UnicodeEncodeError: 'ascii' codec can't encode character '\u65e5' in position 873: ordinal not in range(128)
I have decoded the content of website with 'utf-8' which turns into string, and where is the 'ascii codec' problem from?
The problem is that the HTML content of http://tieba.baidu.com/p/3497570603 contains references to .png images so the non-greedy regular expression matches long strings of text such as
http://static.tieba.baidu.com/tb/editor/images/client/image_emoticon28.png" ><br><br><br><br>
...
title="蓝钻"><img src="http://imgsrc.baidu.com/forum/pic/item/bede9735e5dde711c981db20a0efce1b9f1661d5.jpg
Calling the urlretrieve() method with URLs consisting of long strings containing non-ASCII characters results in the UnicodeEncodeError being thrown while trying to convert the URL argument to ASCII.
A better regular expression to avoid matching too much text would be
r=r'src="([^"]+?\.jpg)"'
Debugging
In the spirit of teaching someone to fish rather than simply giving them a fish for one day, I’d recommend that you use print statements to debug problems such as this. I was able to diagnose this particular issue by replacing the urllib.request.urlretrieve(iu, '%s.jpg' %y) line with print(iu).
Related
I am using Stackoverflow API to fetch some data but the response contains Unicode and some other encodings which I am unable to decode. Here's one sample text:
you can use numpy \u0026#39;\u0026#39;\u0026#39;
python
import numpy as np
np.random.random((pN,C,K))\u0026#39;\u0026#39;\u0026#39;
What's the best way to decode this response in Go?
The sample text is a Python answer I received as response from Stackoverflow API.
Several layers of encoding seem to be in place here:
first, the \u0026 could be a & character
this gives ' which seems to be an XML character literal for '
the resulting ''' seems to mark up the enclosed text to be taken literally including the line breaks. It is also possible that the python indicates the language for syntax highlighting of the enclosed text.
Here's a problem I am facing with encoding and decoding texts.
I am trying to write a code that finds a 'string' or a 'byte' in a file, and return the path of the file.
Currently, since the files I am opening have encoding of 'windows-1252' or 'cp-1252', so I have been trying to:
1. encode my string into a byte corresponding to the encoding of the file
2. match the file and get the path of that file
I have a file, say 'f', that has the encoding of 'windows-1252' or 'cp-1252'. It includes a text that is in Chinese: '[跑Online農場]'
with open(os.path.join(root, filename), mode='rb') as f:
text = f.read()
print(encoding(text)) # encoding() is a separate function that I wrote that returns the encoding of the file
print(text)
Windows-1252
b'\x00StaticText\x00\x00\x12\x00[\xb6]Online\xb9A\xb3\xf5]\x00\x01\x00\x ...
As you may see, the 'binary' texts for [跑Online農場] is [\xb6]Online\xb9A\xb3\xf5]
However, the funny thing is that if I literally convert the string into bytes, I get:
enter_text = '[跑Online農場]'
print(bytes(enter_text, 'cp1252'))
UnicodeEncodeError: 'charmap' codec can't encode character '\u8dd1' in position 1: character maps to <undefined>
On the other hand, opening the file using
with open(os.path.join(root, filename), mode='r', encoding='cp-1252') as f ...
I get:
StaticText [¶]Online¹A³õ] €?‹ Œ î...
which I am not sure how I would 'translate' '[跑Online農場]' into '[¶]Online¹A³õ]'. Answer to this may also solve the problem
What should I do to correctly 'encode' the Chinese/Foreign characters so that it matches the 'rb' bytes that the Python returns?
Thank you!
Your encoding function is wrong: the codec of the file is probably CP950, but certainly not CP1252.
Note: guessing the encoding of a given byte string is always approximate.
There's no safe way of determining the encoding for sure.
If you have a byte string like
b'[\xb6]Online\xb9A\xb3\xf5]'
and you know it must translate (be decoded) into
'[跑Online農場]'
then what you can is trial and error with a few codecs.
I did this with the list of codecs supported by Python, searching for codecs for Chinese.
When using CP-1252 (the Windows version of Latin-1), as you did, you get mojibake:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp1252')
'[¶]Online¹A³õ]'
When using CP-950 (the Windows codepage for Traditional Chinese), you get the expected output:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp950')
'[跑Online農場]'
So: use CP-950 for reading the file.
I am doing some scraping work with Python 3.6 and retrieved some URLs in strings following this format:
someURL = 'http:\u002F\u002Fsomewebsite.com\u002Fsomefile.jpg'
I have been trying to convert the Unicode backslash (\u002F) in those strings in order to use the URLs (using regex methods, encode() on the strings, etc.), to no avail. The string still keeps the Unicode backslash and, if I pass it on to Requests' get(), for example, I am greeted by the following error message:
InvalidURL: Failed to parse: http:\u002F\u002Fsomewebsite.com\u002Fsomefile.jpg"
I searched for solutions in this forum and and others, but can't seem to put the finger on it. I am sure it is simple...
Use codecs.decode with the encoding named 'unicode-escape':
import codecs
print(codecs.decode(someURL, 'unicode-escape'))
# prints 'http://somewebsite.com/somefile.jpg'
So Ive got a string of:
YDNhZip1cDg1YWg4cCFoKg==
that needs to be decoded using Pythons Base64 module.
Ive written the code
import base64
test = 'YDNhZip1cDg1YWg4cCFoKg=='
print(test)
print(base64.b64decode(test))
which gives the answer
b'`3afup85ah8p!h'
when, according to the website decoders Ive used, its really
`3afup85ah8p!h
Im guessing that its decoding the additional quotes.
Is there some way that I can save this variable with a delimiter, as another type of variable, or run the b64encode on a section of the string as slice doesnt seem to work?
b' is Python's way of delimiting data from bytes, see: What does the 'b' character do in front of a string literal?
i.e., it is decoding it correctly.
I created a cgi script (running at localhost with apache) which will load text from textarea and then I will work with it. I have problems with characters like š,ť,é,.. that they are not displayed correctly. I tried it in many ways. Here is one version of my shortcode in which I am just searching for the right way to deal with it.
#!C:/Python33/python
# -*- coding: UTF-8 -*-
import cgi
import cgitb
cgitb.enable()
form = cgi.FieldStorage()
if form.getvalue('textcontent'):
text_content = form.getvalue('textcontent')
else:
text_content = ""
print ("Content-type:text/html")
print ()
print("<!DOCTYPE html>")
print ("<html>")
print ("<head>")
print("<meta charset='UTF-8'></meta>")
print ("</head>")
print ("<body>")
print ("<form>")
print ("text_area:<br />")
print ("<textarea name='textcontent' rows='5' cols='20'></textarea>")
print ("<br />")
print ("<input type='submit' value='submit form' />")
print ("</form>")
print("<p>")
print(text_content)
print("</p>")
print ("</body>")
print ("</html>")
This way is using UTF-8, when I try to write something, it looks like this (write to textarea and submit):
čítam -> ��tam
When I use latin-1 as python encoding and utf-8 as charset in html part it works like this:
časa -> časa (correctly)
but with characters with an accent mark (for example áno) it returns error:
UnicodeEncodeError: 'charmap' codec can't encode character '\\ufffd' in position 0: character maps to <undefined>\r
With sys.stdout.encoding it writes cp1250 encoding (work under windows) and with sys.getdefaultencoding() it returns utf-8
I tried also text_content = (form.getvalue('textcontent')).encode('utf-8') for example word číslo and result is b'\xef\xbf\xbd\xef\xbf\xbdslo'
I don't know how to handle this problem.
I need číslo -> číslo fo example.
UPDATE: Now I have UTF-8 for python as html encoding. It looks like work with text (comparing words with the dictionary,..) is going well, so the only one problem now is that output looks like ��tam, so I need to modify it to look like čítam instead of ��tam.
UPDATE 2: When encoding is UTF-8, and in browser UTF-8 too, it displays �s, when I change browser encoding to cp1250, it displays correctly, but when I refresh the site or click on Submit button it writes error UnicodeEncodeError: 'charmap' codec can't encode character '\\ufffd'
UPDATE 3: Tried it on linux and after a few problems I found out that apache server is using wrong encoding(ascii), but I can't accomplish this problem yet. Modified /etc/apache2/envvars to PATH LANG="sk_SK.UTF-8" but got some warning in the terminal by gedit that editing was not good. So encoding is still ascii.
write your form in this way:
<form accept-charset="utf-8">
put accept-charset = "utf-8" in your forms, it can solve this problems