BeautifulSoup parser and cirillic characters - python-3.x

guys!
I'm trying to parse this URL http://mapia.ua/ru/search?&city=%D0%9D%D0%B8%D0%BA%D0%BE%D0%BB%D0%B0%D0%B5%D0%B2&page=1&what=%D0%BE%D0%BE%D0%BE using BeautifulSoup.
But I have got a strange characters like this ��� �1 ��� "����"
Here is my code
from bs4 import BeautifulSoup
import urllib.request
URL = urllib.request.urlopen('http://mapia.ua/ru/search?city=%D0%9D%D0%B8%D0%BA%D0%BE%D0%BB%D0%B0%D0%B5%D0%B2&what=%D0%BE%D0%BE%D0%BE&page=1').read()
soup = BeautifulSoup(URL, 'html.parser')
print(soup.h3.get_text())
Can anybody help me?
P.S. I'm using python 3

I found this :
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
soup = BeautifulSoup(html.decode('utf-8', 'ignore').encode("utf-8"))
From:
How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?
Also:
Delete every non utf-8 symbols froms string
Hope it helps ;)

Related

How can I scrape a <h1> tag using BeautifulSoup? [Python]

I am currently coding a price tracker for different websites, but I have run into an issue.
I'm trying to scrape the contents of a h1 tag using BeautifulSoup4, but I don't know how. I've tried to use a dictionary, as suggested in
https://stackoverflow.com/a/40716482/14003061, but it returned None.
Can someone please help? It would be appreciated!
Here's the code:
from termcolor import colored
import requests
from bs4 import BeautifulSoup
import smtplib
def choice_bwfo():
print(colored("You have selected Buy Whole Foods Online [BWFO]", "blue"))
url = input(colored("\n[ 2 ] Paste a product link from BWFO.\n", "magenta"))
url_verify = requests.get(url, headers=headers)
soup = BeautifulSoup(url_verify.content, 'html5lib')
item_block = BeautifulSoup.find('h1', {'itemprop' : 'name'})
print(item_block)
choice_bwfo()
here's an example URL you can use:
https://www.buywholefoodsonline.co.uk/organic-spanish-bee-pollen-250g.html
Thanks :)
This script will print content of <h1> tag:
import requests
from bs4 import BeautifulSoup
url = 'https://www.buywholefoodsonline.co.uk/organic-spanish-bee-pollen-250g.html'
# create `soup` variable from the URL:
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# print text of first `<h1>` tag:
print(soup.h1.get_text())
Prints:
Organic Spanish Bee Pollen 250g
Or you can do:
print(soup.find('h1', {'itemprop' : 'name'}).get_text())

Unable to read wiki page by BeautifulSoup

I tried to read wiki page using urllib and beautiful soup as follows.
I tried according to this.
import urllib.parse as parse, urllib.request as request
from bs4 import BeautifulSoup
name = "メインページ"
root = 'https://ja.wikipedia.org/wiki/'
url = root + parse.quote_plus(name)
response = request.urlopen(url)
html = response.read()
print (html)
soup = BeautifulSoup(html.decode('UTF-8'), features="lxml")
print (soup)
The code run without error but could not read Japanese characters.
Your approach seems correct and working for me.
Try printing soup parsed data using following code and check the output.
soup = BeautifulSoup(html.decode('UTF-8'), features="lxml")
some_japanese = soup.find('div', {'id': 'mw-content-text'}).text.strip()
print(some_japanese)
In my case, I am getting the following(this is part of the output) -
ウィリアム・バトラー・イェイツ(1865年6月13日 - 1939年1月28日)は、アイルランドの詩人・劇作家。幼少のころから親しんだアイルランドの妖精譚などを題材とする抒情詩で注目されたのち、民族演劇運動を通じてアイルランド文芸復興の担い手となった。……
If this is not working for you, then try to save html content to file, and check the page in browser, if japanese text is fetching properly or not. (Again, its working fine for me)

Beautifulsoup response does not match with view source code output

While comparing response from code and chrome source code. I observe that response returned from beautifulsoup does not match with page source code. I want to fetch class="rc"and I can see the class with "rc" on page source code, but could not find it in the response printed. I checked with "lxml" and "html.parser" too.
I am beginner in python so my question might sound basic. Also, I already checked few articles related to my problem(BeautifulSoup returning different html than view source) but could not find solution.
Below is my code:
import sys, requests
import re
import docx
import webbrowser
from bs4 import BeautifulSoup
query = sys.argv
url = "https://google.com/search?q=" + "+".join(query[1:])
print(url)
res = requests.get(url)
# print(res[:1000])
if res.status_code == 200:
soup = BeautifulSoup(res.text, "html5lib")
print(type(soup))
all_select = soup.select("div", {"class": "rc"})
print("All Select ", all_select)
I had the same problem, try using another parser such as "lxml" instead of "html5lib".

Why 'amp;' is include in link in many parts of links('a') that I'm trying to scrape using BeautifulSoup in phyton? whats the better way to remove it?

I am using findAll('a') or the variations of it to extract a particular tag or class but I'm getting 'amp;' in between the link in many parts.
Example:
The two links the actual and error('amp;') one
https://help.cybonline.co.uk/system/selfservice.controller?CONFIGURATION=1113&PARTITION_ID=1&secureFlag=true&TIMEZONE_OFFSET=&CMD=VIEW_ARTICLE&ARTICLE_ID=14311&CUST_PREV_CMD=null
https://help.cybonline.co.uk/system/selfservice.controller?CONFIGURATION=111)3&PARTITION_ID=1&secureFlag=true&TIMEZONE_OFFSET=&CMD=VIEW_ARTICLE&ARTICLE_ID=14311&CUST_PREV_CMD=null
"selfservice.controller?CONFIGURATION=1113&PARTITION_ID=1&secureFlag=false&TIMEZONE_OFFSET=&CMD=VIEW_ARTICLE&ARTICLE_ID=14271&CUST_PREV_CMD=BROWSE_TOPIC"
I can get rid of it using regex, but is there a better way to do it?
The website I'm having a problem with is cybonline
I don't see that problem at all with lxml. Can you try running the following?
import requests
from bs4 import BeautifulSoup as bs
base_url = 'https://help.cybonline.co.uk/system/'
r = requests.get('https://help.cybonline.co.uk/system/selfservice.controller?CONFIGURATION=1113&PARTITION_ID=1&secureFlag=true&TIMEZONE_OFFSET=&CMD=BROWSE_TOPIC&TOPIC_ID=55956')
soup = bs(r.content, 'lxml')
links = [base_url + item['href'] for item in soup.select('.articleAnchor')]
print(links)
If not, you can use replace
base_url + item['href'].replace('amp;', '')
If you want remove that & value you can simply use replace while fetching the value.
import requests
from bs4 import BeautifulSoup
html=requests.get("https://help.cybonline.co.uk/system/selfservice.controller?CONFIGURATION=1113&PARTITION_ID=1&secureFlag=true&TIMEZONE_OFFSET=&CMD=BROWSE_TOPIC&TOPIC_ID=55956").text
soup=BeautifulSoup(html,'html.parser')
for a in soup.find_all('a' ,class_='articleAnchor'):
link=a['href'].replace('&' , '')
print(link)
OR
import requests
from bs4 import BeautifulSoup
html=requests.get("https://help.cybonline.co.uk/system/selfservice.controller?CONFIGURATION=1113&PARTITION_ID=1&secureFlag=true&TIMEZONE_OFFSET=&CMD=BROWSE_TOPIC&TOPIC_ID=55956").text
soup=BeautifulSoup(html,'html.parser')
for a in soup.select('a.articleAnchor'):
link=a['href'].replace('&' , '')
print(link)

Why am I getting UnicodeEncode error?

I'm trying to make a small parsing script and testing out waters.
I am not sure why am I getting this error
my code is
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('http://www.marca.com/en/football/real-madrid.html?intcmp=MENUPROD&s_kw=english-real-madrid')
data = r.text.encode()
soup = bs(data,'html.parser')
print (soup.prettify())
and the error
print (soup.prettify())
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2153-2154: ordinal not in range(128)
however if I use .encode() in my print line, it works fine.
I just want to be 100% sure I am accurate with this. Have 0 experience parsing HTML/XML
The solution is this
from bs4 import BeautifulSoup as bs
import requests
req = requests.get('http://www.marca.com/en/football/real-madrid.html?intcmp=MENUPROD&s_kw=english-real-madrid')
data = req.text
soup = bs(data,'html.parser')
print (soup.prettify('latin-1'))
with the help of this question

Resources