urlopen() throwing error in python 3.3 - python-3.x

from urllib.request import urlopen
def ShowResponse(param):
uri = str("mysite.com/?param="+param+"&submit=submit")
print(urlopen(uri).read())
file = open("myfile.txt","r")
if file.mode == "r":
filelines = file.readlines()
for line in filelines:
line = line.strip()
ShowResponse(line)
this is my python code but when i run this it causes an error "UnicodeEncodeError: 'ascii' codec can't encode characters in position 47-49: ordinal not in range(128)"
i dont know how to fix this. im new to python

I'm going to assume that the stack trace shows that line 4 (uri = str(...) is throwing the given error and myfile.txt contains UTF-8 characters.
The error is because you're trying to convert a Unicode object (decoded from assumed UTF-8) to an ASCII string object. ASCII simply can not represent your character.
URIs (including the Query String) must encode non-ASCII chars as percent-encoded UTF-8 bytes. Example:
€ (EURO SIGN) is encoded in UTF-8 is:
0xE2 0x82 0xAC
Percent-encoded, it's:
%E2%82%AC
Therefore, your code needs to re-encode your parameter to UTF-8 then percent-encode it:
from urllib.request import urlopen, quote
def ShowResponse(param):
param_utf8 = param.encode("utf-8")
param_perc_encoded = quote(param_utf8)
# or uri = str("mysite.com/?param="+param_perc_encoded+"&submit=submit")
uri = str("mysite.com/?param={0}&submit=submit".format(param_perc_encoded) )
print(urlopen(uri).read())
You'll also see I've changed your uri = definition slightly to use String.format() (https://docs.python.org/2/library/string.html#format-string-syntax), which I find easier to create complex strings rather than doing string concatenation with +. In this example, {0} is replaced with the first argument to .format().

Related

Raw string encoding from input()

I'm trying to scan a variable directory, said variable is defined by an input(), yet the program throws out this issue:
(unicode error) 'unicodeescape' codec can't decode bytes in position 320-321: truncated \UXXXXXXXX escape
Current code: Not Working
import os
import time
print("Enter directory name.\nDirectory name example:\nC:\Users\example\Documents")
dirname = input()
with os.scandir(dirname) as dir_entries:
for entry in dir_entries:
info = entry.stat()
file_name = os.path.basename(entry)
my_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(info.st_ctime))
rawmb = (info.st_size/(1024*1024))
truncated = round(rawmb, 3)
print(file_name)
print(my_time)
print(truncated,"MB")
print('===========================')
I considered using \\ or / instead of \ during the input, but that makes copying the directory from the file explorer impossible.
I have no idea how to include an r in front of the input() string.
.decode,.encode didn't seem to work for me either, but I most likely just used them wrong.
Edit #1
Tried the solution from J_H
Do this after input(): for ch in dirname: print(ch, ord(ch))
Result:
Same error.

'charmap' codec can't encode characters in position XX

I have a simple script that is attempting to extract mutiple json objects from a single file, and store it as a list:
import json
URL = r"C:\Users\Kenneth\Youtube_comment_parser\Testing.txt"
with open(URL, 'r', encoding="utf-8") as handle:
json_data = [json.loads(line) for line in handle]
print(json_data) # Can't .encode() because it's a list
Even after specifying utf-8 encoding, I'm still running into a codec error. If possible, I would also like to change this object into a dictionary, but this is as far as I've got.
The exact error reads:
UnicodeEncodeError: 'charmap' codec can't encode characters in position
394-395: character maps to <undefined>
Thanks in advance.
I was able to solve this issue by removing one unicode character that was producing "/undefined>", the string '\ufeff', and then the rest was able to display nicely. This required me to iterate over the keys in the list of dictionaries, and replace as necessary.
import json
URL = r"C:\Users\Kenneth\Youtube_comment_parser\Testing.txt"
json1_file = open(URL, encoding='utf-8')
json1_str = json1_file.read()
json1_str = [d.strip() for d in json1_str.splitlines()]
json1_data = [json.loads(i) for i in json1_str]
json1_data = [{key:value.replace(u'\ufeff', '') for
key, value in json1_data[index].items()} for
index in range(len(json1_data))]
print(json1_data[1]['text'].encode('utf-8'))
Still not sure why I have to open with utf-8 and then encode again with my print statement, but it produced the string nicely.

Degree character (°) encoding/decoding

I am on windows platform and I use Python 3.
I have a text file which contains degree characters (°)
I want to read the whole text file, do some processing and write it back with the performed modifications. Here is sample of my code :
with io.open('myTextFile.txt',encoding='ASCII') as f:
for item in allItem:
i=0
myData = pd.DataFrame(data=np.zeros((n,1)))
for line in f:
myRegex = "(AD"+item+")"
if re.match(myRegex,line):
myData.loc[i,0] = line
i+=1
myData = myData[(myData.T != 0).any()]
myData = myData.append(pd.DataFrame(["\n"],index=[myData.index[-1]+1]))
myData = myData[0].map(lambda x: x.strip()).to_frame()
myData.to_csv('myModifiedTextFile.txt', header = False, index = False, mode='a', quoting=csv.QUOTE_NONE, escapechar=' ', encoding = 'ASCII')
However I am getting unicode errors although I tried specifying encoding/decoding :
'ascii' codec can't decode byte 0xe9 in position 512: ordinal not in range(128)
ascii is not very useful here, since it only knows 128 characters, the ones you can find in this table. Notice there is no degree sign in that table. I am unsure what the actual encoding of your file is – Unicode and commonly used Windows code pages (1250/1252) have the degree sign at 0xB0.
I assume in your file, there is a degree sign at position 512 and it is causing the error. If this is the case, you need to be more specific with your encoding argument. Figure out which code page / encoding was used to save the file. Confirm this by looking up the code page and finding the degree sign at 0xE9.
If there is a different character at position 512 ("é" is a good candidate), then simply specify an encoding like cp1250, cp1252, or cp1257.

Remove non utf-8 characters from string in python

I am attempting to read in tweets and write these tweets to a file. However, I am getting UnicodeEncodeErrors when I try to write some of these tweets to a file. Is there a way to remove these non utf-8 characters so I can write out the rest of the tweet?
For example, a problem tweet may look it this:
Camera? 🎥
This is the code I am using:
with open("Tweets.txt",'w') as f:
for user_tws in twitter.get_user_timeline(screen_name='camera',
count = 200):
try:
f.write(user_tws["text"] + '\n')
except UnicodeEncodeError:
print("skipped: " + user_tws["text"])
mod_tw = user_tws["text"]
mod_tw=mod_tw.encode('utf-8','replace').decode('utf-8')
print(mod_tw)
f.write(mod_tw)
The error is this:
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f3a5' in position 56: character maps to
You are not writing a UTF8 encoded file, add the encoding parameter to the open function
with open("Tweets.txt",'w', encoding='utf8') as f:
...
Have fun 🎥

Python 3, UnicodeEncodeError with decode set to ignore

This code makes an http call to a solr index.
query_uri = prop.solr_base_uri + "?q=" + query + "&wt=json&indent=true"
with urllib.request.urlopen(query_uri) as response:
data = response.read()
#data is bytes
data_str=data.decode('utf-8', 'ignore')
print(data_str)
The print statement throws:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2715' in position 149273: character maps to undefined
I thought the decode('utf-8', 'ignore') was supposed to ignore non utf-8 characters and leave it out of the result? How is it that I have a UnicodeEncodeError in the the print statement? How do I handle characters that can't encoded in Unicode? Thanks!
The error is caused by print (and any file.write()) not having a character map set and defaulting to ASCII.
The recommended approach is to set PYTHONIOENCODING=UTF-8 in your environment or encode each string before printing:
print(`data_str`.encode("utf-8")
For file writing, set the encoding for the file when you open it:
file = open("/temp/test.txt", "w", encoding="UTF-8")
file.write('\u2715')

Resources