json() on "requests" response raises UnicodeEncodeError - python-3.x

I'm querying Github's Jobs API with python3, using the requests library, but running into an error parsing the response.
Library: http://docs.python-requests.org/en/latest/
Code:
import requests
import json
url = 'https://jobs.github.com/positions.json?'
response = requests.get(url)
print(response.json())
Error:
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in
position 321: ordinal not in range(128)
Using this API in the past with Ruby, I've never run into this issue.
I also tried converting it to a dictionary but it resulted in the same errors.
There's other questions on SO about the UnicodeEncodeError (mostly re: opening files), but I'm not familiar with Python and didn't find them helpful.

First, check that the response is actually JSON. Try printing response.text and see if it looks like a valid JSON object.
Assuming it is JSON: it's very "hack"-ey, but you can replace the non ASCII characters with their escaped Unicode representation:
def escape_unicode(c):
return c.encode('ascii', 'backslashreplace').decode('ascii')
response = ...
text = response.text
escaped = re.sub(r'[^\x00-\x7F]', lambda m: escape_unicode(m.group(0)), text)
json_response = json.loads(escaped)

Related

Open URL with ASCII characters (emojis) on it as parameters with urllib

I'm making a telegram bot that sends messages to a Channel, everything works fine until I try to send a message with an ASCII character (like an emoji) inside the URL message parameter, whenever I try to run something like this:
botMessage = '🚨'
urlRequest = f'https://api.telegram.org/bot{telegram_token}/sendMessage?chat_id={chat_id}&text={botMessage}'
urlRequest = urlRequest.replace(" ", "%20")
urllib.request.urlopen(urlRequest)
I get this error:
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f6a8' in position 95: ordinal not in range(128)
Here is the full pic of all the errors I got before the one I wrote above
Non-ASCII character is forbidden in URL. This is a limitation in HTTP protocol and it is not related to Telegram. Use urllib.parse.quote function to encode from UTF-8 to ASCII as follow:
import urllib.request
botMessage = urllib.parse.quote('🚨')
urlRequest = f'https://api.telegram.org/bot{token}/sendMessage?chat_id={chat_id}&text={botMessage}'
urlRequest = urlRequest.replace(" ", "%20")
urllib.request.urlopen(urlRequest)
There are many python library for Telegram Bot. They are easy to use and they hide these details.

It shows TypeError after run python scraping code(o'reilly example code)

I follow the example code of "O'Reilly Web Scraping with Python: Collecting More Data from the Modern Web" and find it shows error.
The versions are:
python3.7.3, BeautifulSoup4
The code is as follows:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import random
import datetime
import codecs
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
html = urlopen('http://en.wikipedia.org{}',format(articleUrl))
bs = BeautifulSoup(html, 'html.parser')
return bs.find('div', {'id':'bodyContent'}).find_all('a',
href=re.compile('^(/wiki/)((?!:).)*$'))
links = getLinks('/wiki/Kevin_Bacon')
links.encoding = 'utf8'
while len(links) > 0:
newArticle = links[random.randint(0, len(links)-1)].attrs['href']
print(newArticle)
links = getLinks(newArticle)
TypeError: POST data should be bytes, an iterable of bytes, or a file
object. It cannot be of type str.
Looking at this rather old question, I see the problem is a typo (and I VTCed as such):
html = urlopen('http://en.wikipedia.org{}',format(articleUrl))
^
That comma , should be a dot ., otherwise according to the documentation we are passing a second parameter data:
data must be an object specifying additional data to send to the server, or None if no such data is needed. Currently HTTP requests are the only ones that use data. The supported object types include bytes, file-like objects, and iterables of bytes-like objects.
Note the last sentence; the function expects an iterable of bytes, but format() returns a string, thus the error:
TypeError: POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.

bs4 can't recognize encoding python3

I am trying to scrape a few pages using Python3 for the first time. I have used Python2 many times with bs4 without any trouble, but I can't seem to be able to switch to python3, as I am always getting encoding errors.
For example, I am trying to scrape https://www.pgatour.com/webcom/tournaments/the-bahamas-great-exuma-classic/leaderboard.html
I have searched through a few threads here that have similar questions, without success.
Here is my code:
r = requests.get('https://www.pgatour.com/webcom/tournaments/the-bahamas-great-exuma-classic/leaderboard.html')
r.encoding = r.apparent_encoding
soup = bs.BeautifulSoup(r.text,'html5lib')
print(soup)
I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xd7' in position 28935: ordinal not in range(128)
I also tried to change r.encoding = r.apparent_encoding to r.encoding = "utf-8", getting the same error.
You can change the encoding as follows. This should fix your error.
r = requests.get("https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/cacao/index.html")
print(r.encoding)
soup = BS(r.content, 'html.parser').encode('utf-8')
print(soup)

Why am I getting UnicodeEncode error?

I'm trying to make a small parsing script and testing out waters.
I am not sure why am I getting this error
my code is
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('http://www.marca.com/en/football/real-madrid.html?intcmp=MENUPROD&s_kw=english-real-madrid')
data = r.text.encode()
soup = bs(data,'html.parser')
print (soup.prettify())
and the error
print (soup.prettify())
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2153-2154: ordinal not in range(128)
however if I use .encode() in my print line, it works fine.
I just want to be 100% sure I am accurate with this. Have 0 experience parsing HTML/XML
The solution is this
from bs4 import BeautifulSoup as bs
import requests
req = requests.get('http://www.marca.com/en/football/real-madrid.html?intcmp=MENUPROD&s_kw=english-real-madrid')
data = req.text
soup = bs(data,'html.parser')
print (soup.prettify('latin-1'))
with the help of this question

How do we use POST method in Python using urllib.request?

I have to make use of POST method using urllib.request in Python and have written the following code for POST method.
values = {"abcd":"efgh"}
headers = {"Content-Type": "application/json", "Authorization": "Basic"+str(authKey)}
req = urllib.request.Request(url,values,headers=headers,method='POST')
response = urllib.request.urlopen(req)
print(response.read())
I am able to make use of 'GET' and 'DELETE' but not 'POST'.Could anyone help me out in solving this?
Thanks
If you really have to use urllib.request in POST, you have to:
Encode your data using urllib.parse.urlencode()(if sending a form)
Convert encoded data to bytes
Specify Content-Type header (application/octet-stream for raw binary data, application/x-www-form-urlencoded for forms , multipart/form-data for forms containing files and application/json for JSON)
If you do all of this, your code should be like:
req=urllib.request.Request(url,
urllib.parse.urlencode(data).encode(),
headers={"Content-Type":"application/x-www-form-urlencoded"}
)
urlopen=urllib.request.urlopen(req)
response=urlopen.read()
(for forms)
or
req=urllib.request.Request(url,
json.dumps(data).encode(),
headers={"Content-Type":"application/json"}
)
urlopen=urllib.request.urlopen(req)
response=urlopen.read()
(for JSON).
Sending files is a bit more complicated.
From urllib.request's official documentation:
For an HTTP POST request method, data should be a buffer in the
standard application/x-www-form-urlencoded format. The
urllib.parse.urlencode() function takes a mapping or sequence of
2-tuples and returns an ASCII string in this format. It should be
encoded to bytes before being used as the data parameter.
Read more:
Python - make a POST request using Python 3 urllib
RFC 7578 - Returning Values from Forms: multipart/form-data
You can use the requests module for this.
import requests
...
url="https://example.com/"
print url
data = {'id':"1", 'value': 1}
r = requests.post(url, data=data)
print(r.text)
print(r.status_code, r.reason)
You can send calls without installing any additional packages.
Call this function with your input data and url. function will return the response.
from urllib import request
import json
def make_request(input_data, url):
# dict to Json, then convert to string and then to bytes
input_data = str(json.dumps(input_data)).encode('utf-8')
# Post Method is invoked if data != None
req = request.Request(url, data=input_data)
return request.urlopen(req).read().decode('utf-8')

Resources