Why am I getting UnicodeEncode error? - python-3.x

I'm trying to make a small parsing script and testing out waters.
I am not sure why am I getting this error
my code is
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('http://www.marca.com/en/football/real-madrid.html?intcmp=MENUPROD&s_kw=english-real-madrid')
data = r.text.encode()
soup = bs(data,'html.parser')
print (soup.prettify())
and the error
print (soup.prettify())
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2153-2154: ordinal not in range(128)
however if I use .encode() in my print line, it works fine.
I just want to be 100% sure I am accurate with this. Have 0 experience parsing HTML/XML

The solution is this
from bs4 import BeautifulSoup as bs
import requests
req = requests.get('http://www.marca.com/en/football/real-madrid.html?intcmp=MENUPROD&s_kw=english-real-madrid')
data = req.text
soup = bs(data,'html.parser')
print (soup.prettify('latin-1'))
with the help of this question

Related

Beautiful Soup find td by id why isn't this working

I'm trying to get the the Real Estimate price i.e. the 187.40
https://www.marketscreener.com/MICROSOFT-CORPORATION-4835/?type_recherche=rapide&mots=MSFT
It has the following html td#zbjsfv_dr
So I have done the following using Beautiful Soup
Comp = soup.find("td", id="zbjsfv_dr")
print(Comp)
But this isn't returning anything. I don't understand why?
I think there is something wrong about your bs4 connection because I can get value in td which id=zbjsfv_dr . You didn't share all code so This is just example:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.marketscreener.com/MICROSOFT-CORPORATION-4835/?type_recherche=rapide&mots=MSFT')
source = BeautifulSoup(r.content,'html')
comp = source.find("td", id="zbjsfv_dr")
print(comp.text)
OUTPUT:
188.085

BeautifulSoup prints none even the content is there

I am trying to build a hackernews scraper but when i ran my code
import requests
from bs4 import BeautifulSoup
res = requests.get("https://news.ycombinator.com/")
soup = BeautifulSoup(res.text,'html.parser')
print(soup.find(id="score_23174015"))
I am Not getting that why beautifulsoup is returning none all the time to me i am still learning so yeah i am new to python3 as well
I checked the url, but there is no element with id = 23174015.
Anyway, try this code if you want to find element with attributes.
soup.find(attrs = {'id':"score_23167794"})

Beautifulsoup response does not match with view source code output

While comparing response from code and chrome source code. I observe that response returned from beautifulsoup does not match with page source code. I want to fetch class="rc"and I can see the class with "rc" on page source code, but could not find it in the response printed. I checked with "lxml" and "html.parser" too.
I am beginner in python so my question might sound basic. Also, I already checked few articles related to my problem(BeautifulSoup returning different html than view source) but could not find solution.
Below is my code:
import sys, requests
import re
import docx
import webbrowser
from bs4 import BeautifulSoup
query = sys.argv
url = "https://google.com/search?q=" + "+".join(query[1:])
print(url)
res = requests.get(url)
# print(res[:1000])
if res.status_code == 200:
soup = BeautifulSoup(res.text, "html5lib")
print(type(soup))
all_select = soup.select("div", {"class": "rc"})
print("All Select ", all_select)
I had the same problem, try using another parser such as "lxml" instead of "html5lib".

bs4 can't recognize encoding python3

I am trying to scrape a few pages using Python3 for the first time. I have used Python2 many times with bs4 without any trouble, but I can't seem to be able to switch to python3, as I am always getting encoding errors.
For example, I am trying to scrape https://www.pgatour.com/webcom/tournaments/the-bahamas-great-exuma-classic/leaderboard.html
I have searched through a few threads here that have similar questions, without success.
Here is my code:
r = requests.get('https://www.pgatour.com/webcom/tournaments/the-bahamas-great-exuma-classic/leaderboard.html')
r.encoding = r.apparent_encoding
soup = bs.BeautifulSoup(r.text,'html5lib')
print(soup)
I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xd7' in position 28935: ordinal not in range(128)
I also tried to change r.encoding = r.apparent_encoding to r.encoding = "utf-8", getting the same error.
You can change the encoding as follows. This should fix your error.
r = requests.get("https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/cacao/index.html")
print(r.encoding)
soup = BS(r.content, 'html.parser').encode('utf-8')
print(soup)

BeautifulSoup parser and cirillic characters

guys!
I'm trying to parse this URL http://mapia.ua/ru/search?&city=%D0%9D%D0%B8%D0%BA%D0%BE%D0%BB%D0%B0%D0%B5%D0%B2&page=1&what=%D0%BE%D0%BE%D0%BE using BeautifulSoup.
But I have got a strange characters like this ��� �1 ��� "����"
Here is my code
from bs4 import BeautifulSoup
import urllib.request
URL = urllib.request.urlopen('http://mapia.ua/ru/search?city=%D0%9D%D0%B8%D0%BA%D0%BE%D0%BB%D0%B0%D0%B5%D0%B2&what=%D0%BE%D0%BE%D0%BE&page=1').read()
soup = BeautifulSoup(URL, 'html.parser')
print(soup.h3.get_text())
Can anybody help me?
P.S. I'm using python 3
I found this :
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
soup = BeautifulSoup(html.decode('utf-8', 'ignore').encode("utf-8"))
From:
How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?
Also:
Delete every non utf-8 symbols froms string
Hope it helps ;)

Resources