bs4 can't recognize encoding python3 - python-3.x

I am trying to scrape a few pages using Python3 for the first time. I have used Python2 many times with bs4 without any trouble, but I can't seem to be able to switch to python3, as I am always getting encoding errors.
For example, I am trying to scrape https://www.pgatour.com/webcom/tournaments/the-bahamas-great-exuma-classic/leaderboard.html
I have searched through a few threads here that have similar questions, without success.
Here is my code:
r = requests.get('https://www.pgatour.com/webcom/tournaments/the-bahamas-great-exuma-classic/leaderboard.html')
r.encoding = r.apparent_encoding
soup = bs.BeautifulSoup(r.text,'html5lib')
print(soup)
I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xd7' in position 28935: ordinal not in range(128)
I also tried to change r.encoding = r.apparent_encoding to r.encoding = "utf-8", getting the same error.

You can change the encoding as follows. This should fix your error.
r = requests.get("https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/cacao/index.html")
print(r.encoding)
soup = BS(r.content, 'html.parser').encode('utf-8')
print(soup)

Related

Unable to read wiki page by BeautifulSoup

I tried to read wiki page using urllib and beautiful soup as follows.
I tried according to this.
import urllib.parse as parse, urllib.request as request
from bs4 import BeautifulSoup
name = "メインページ"
root = 'https://ja.wikipedia.org/wiki/'
url = root + parse.quote_plus(name)
response = request.urlopen(url)
html = response.read()
print (html)
soup = BeautifulSoup(html.decode('UTF-8'), features="lxml")
print (soup)
The code run without error but could not read Japanese characters.
Your approach seems correct and working for me.
Try printing soup parsed data using following code and check the output.
soup = BeautifulSoup(html.decode('UTF-8'), features="lxml")
some_japanese = soup.find('div', {'id': 'mw-content-text'}).text.strip()
print(some_japanese)
In my case, I am getting the following(this is part of the output) -
ウィリアム・バトラー・イェイツ(1865年6月13日 - 1939年1月28日)は、アイルランドの詩人・劇作家。幼少のころから親しんだアイルランドの妖精譚などを題材とする抒情詩で注目されたのち、民族演劇運動を通じてアイルランド文芸復興の担い手となった。……
If this is not working for you, then try to save html content to file, and check the page in browser, if japanese text is fetching properly or not. (Again, its working fine for me)

Why am I getting UnicodeEncode error?

I'm trying to make a small parsing script and testing out waters.
I am not sure why am I getting this error
my code is
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('http://www.marca.com/en/football/real-madrid.html?intcmp=MENUPROD&s_kw=english-real-madrid')
data = r.text.encode()
soup = bs(data,'html.parser')
print (soup.prettify())
and the error
print (soup.prettify())
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2153-2154: ordinal not in range(128)
however if I use .encode() in my print line, it works fine.
I just want to be 100% sure I am accurate with this. Have 0 experience parsing HTML/XML
The solution is this
from bs4 import BeautifulSoup as bs
import requests
req = requests.get('http://www.marca.com/en/football/real-madrid.html?intcmp=MENUPROD&s_kw=english-real-madrid')
data = req.text
soup = bs(data,'html.parser')
print (soup.prettify('latin-1'))
with the help of this question

json() on "requests" response raises UnicodeEncodeError

I'm querying Github's Jobs API with python3, using the requests library, but running into an error parsing the response.
Library: http://docs.python-requests.org/en/latest/
Code:
import requests
import json
url = 'https://jobs.github.com/positions.json?'
response = requests.get(url)
print(response.json())
Error:
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in
position 321: ordinal not in range(128)
Using this API in the past with Ruby, I've never run into this issue.
I also tried converting it to a dictionary but it resulted in the same errors.
There's other questions on SO about the UnicodeEncodeError (mostly re: opening files), but I'm not familiar with Python and didn't find them helpful.
First, check that the response is actually JSON. Try printing response.text and see if it looks like a valid JSON object.
Assuming it is JSON: it's very "hack"-ey, but you can replace the non ASCII characters with their escaped Unicode representation:
def escape_unicode(c):
return c.encode('ascii', 'backslashreplace').decode('ascii')
response = ...
text = response.text
escaped = re.sub(r'[^\x00-\x7F]', lambda m: escape_unicode(m.group(0)), text)
json_response = json.loads(escaped)

Python - Issue Scraping with BeautifulSoup

I'm trying to scrape the Stack Overflow jobs page using Beautiful Soup 4 and URLLIB as a personal project. I'm facing an issue where I'm trying to scrape all the links to the 50 jobs listed on each page. I'm using a regex to identify these links. Even though I reference the tag properly, I am facing these two specific issues:
Instead of the 50 links clearly visible in the source code, I get only 25 results each time as my output(after accounting for an removing an initial irrelevant link)
There's a difference between how the links are ordered in the source code and my output.
Here's my code. Any help on this will be greatly appreciated:
import bs4
import urllib.request
import re
#Obtaining source code to parse
sauce = urllib.request.urlopen('https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p&pg=0').read()
soup = bs4.BeautifulSoup(sauce, 'html.parser')
snippet = soup.find_all("script",type="application/ld+json")
strsnippet = str(snippet)
print(strsnippet)
joburls = re.findall('https://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strsnippet)
print("Urls: ",joburls)
print(len(joburls))
Disclaimer: I did some asking of my own for a part of this answer.
from bs4 import BeautifulSoup
import requests
import json
# note: link is slightly different; yours just redirects here
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
urls = [el['url'] for el in json.loads(s.text)['itemListElement']]
print(len(urls))
50
Process:
Use soup.find rather than soup.find_all. This will give a JSON bs4.element.Tag
json.loads(s.text) is a nested dict. Access the values for itemListElement key to get a dict of urls, and convert to list.

Python3 encode unicode

When I do get request with requests lib in Python3 I get this response:
{"status":true,"data":[{"koatuu":7121586600,"zona":8,"kvartal":2,"parcel":501,"cadnum":"7121586600:08:002:0501","ownershipcode":100,"purpose":"\u0414\u043b\u044f \u0432\u0435\u0434\u0435\u043d\u043d\u044f \u0442\u043e\u0432\u0430\u0440\u043d\u043e\u0433\u043e \u0441\u0456\u043b\u044c\u0441\u044c\u043a\u043e\u0433\u043e\u0441\u043f\u043e\u0434\u0430\u0440\u0441\u044c\u043a\u043e\u0433\u043e \u0432\u0438\u0440\u043e\u0431\u043d\u0438\u0446\u0442\u0432\u0430","use":"\u0414\u043b\u044f \u0432\u0435\u0434\u0435\u043d\u043d\u044f \u0442\u043e\u0432\u0430\u0440\u043d\u043e\u0433\u043e \u0441\u0456\u043b\u044c\u0441\u044c\u043a\u043e\u0433\u043e\u0441\u043f\u043e\u0434\u0430\u0440\u0441\u044c\u043a\u043e\u0433\u043e \u0432\u0438\u0440\u043e\u0431\u043d\u0438\u0446\u0442\u0432\u0430","area":"1.3397","unit_area":"\u0433\u0430 ","ownershipvalue":null,"id_office":630}]}
How can I get cp1252 letters as response?
My code is:
import requests
url = 'http://map.land.gov.ua/kadastrova-karta/get-parcel-Info?koatuu=7121586600&zone=08&quartal=002&parcel=0004'
page = requests.get(url)
print(page.text)
print(page.json())
solved my problem :D

Resources