Beautifulsoup response does not match with view source code output - python-3.x

While comparing response from code and chrome source code. I observe that response returned from beautifulsoup does not match with page source code. I want to fetch class="rc"and I can see the class with "rc" on page source code, but could not find it in the response printed. I checked with "lxml" and "html.parser" too.
I am beginner in python so my question might sound basic. Also, I already checked few articles related to my problem(BeautifulSoup returning different html than view source) but could not find solution.
Below is my code:
import sys, requests
import re
import docx
import webbrowser
from bs4 import BeautifulSoup
query = sys.argv
url = "https://google.com/search?q=" + "+".join(query[1:])
print(url)
res = requests.get(url)
# print(res[:1000])
if res.status_code == 200:
soup = BeautifulSoup(res.text, "html5lib")
print(type(soup))
all_select = soup.select("div", {"class": "rc"})
print("All Select ", all_select)

I had the same problem, try using another parser such as "lxml" instead of "html5lib".

Related

Unable to read wiki page by BeautifulSoup

I tried to read wiki page using urllib and beautiful soup as follows.
I tried according to this.
import urllib.parse as parse, urllib.request as request
from bs4 import BeautifulSoup
name = "メインページ"
root = 'https://ja.wikipedia.org/wiki/'
url = root + parse.quote_plus(name)
response = request.urlopen(url)
html = response.read()
print (html)
soup = BeautifulSoup(html.decode('UTF-8'), features="lxml")
print (soup)
The code run without error but could not read Japanese characters.
Your approach seems correct and working for me.
Try printing soup parsed data using following code and check the output.
soup = BeautifulSoup(html.decode('UTF-8'), features="lxml")
some_japanese = soup.find('div', {'id': 'mw-content-text'}).text.strip()
print(some_japanese)
In my case, I am getting the following(this is part of the output) -
ウィリアム・バトラー・イェイツ(1865年6月13日 - 1939年1月28日)は、アイルランドの詩人・劇作家。幼少のころから親しんだアイルランドの妖精譚などを題材とする抒情詩で注目されたのち、民族演劇運動を通じてアイルランド文芸復興の担い手となった。……
If this is not working for you, then try to save html content to file, and check the page in browser, if japanese text is fetching properly or not. (Again, its working fine for me)

web scraping with beautiful soup in python

I want to crawl the homepage of youtube to pull out all the links of videos. Following is the code
from bs4 import BeautifulSoup
import requests
s='https://www.youtube.com/'
html=requests.get(s)
html=html.text
s=BeautifulSoup(html,features="html.parser")
for e in s.find_all('a',{'id':'video-title'}):
link=e.get('href')
text=e.string
print(text)
print(link)
print()
Nothing is happenning when I run the above code. It seems like the id is not getting discovered. What am I doing wrong
It is because you are not getting the same HTML as your browser have.
import requests
from bs4 import BeautifulSoup
s = requests.get("https://youtube.com").text
soup = BeautifulSoup(s,'lxml')
print(soup)
Save this code's output to a file named test.html and run. You will see that it is not the same as the browser's, as it looks corrupted.
See these questions below.
HTML in browser doesn't correspond to scraped data in python
Python requests not giving me the same HTML as my browser is
Basically, I recommend you to use Selenium Webdriver as it reacts as a browser.
Yes, this is a strange scrape, but if you scrape at the 'div id="content"' level, you are able to get the data you are requesting. I was able to get the titles of each video, but it appears youtube has some rate limiting or throttling, so I do not think you will be able to get ALL of the titles and links. At any rate, below is what I got working for the titles:
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'html.parser')
links = soup.find_all('div', id='content')
for each in links:
print(each.text)
May be this could help for scraping all videos from youtube home page,
from bs4 import BeautifulSoup
import requests
r = 'https://www.youtube.com/'
html = requests.get(r)
all_videos = []
soup = BeautifulSoup(html.text, 'html.parser')
for i in soup.find_all('a'):
if i.has_attr('href'):
text = i.attrs.get('href')
if text.startswith('/watch?'):
urls = r+text
all_videos.append(urls)
print('Total Videos', len(all_videos))
print('LIST OF VIDEOS', all_videos)
This code snippet will selects all links from youtube.com homepage that contains /watch? in their href attribute (links to videos):
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get('https://www.youtube.com/').text, 'lxml')
for a in soup.select('a[href*="/watch?"]'):
print('https://www.youtube.com{}'.format(a['href']))
Prints:
https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=2_mDOWLhkVU
https://www.youtube.com/watch?v=2_mDOWLhkVU
...and so on

How can i get the links under a specific class

So 2 days ago i was trying to parse the data between two same classes and Keyur helped me a lot then after he left other problems behind.. :D
Page link
Now I want to get the links under a specific class, here is my code, and here are the errors.
from bs4 import BeautifulSoup
import urllib.request
import datetime
headers = {} # Headers gives information about you like your operation system, your browser etc.
headers['User-Agent'] = 'Mozilla/5.0' # I defined a user agent because HLTV perceive my connection as bot.
hltv = urllib.request.Request('https://www.hltv.org/matches', headers=headers) # Basically connecting to website
session = urllib.request.urlopen(hltv)
sauce = session.read() # Getting the source of website
soup = BeautifulSoup(sauce, 'lxml')
a = 0
b = 1
# Getting the match pages' links.
for x in soup.find('span', text=datetime.date.today()).parent:
print(x.find('a'))
Error:
Actually there isn't any error but it outputs like:
None
None
None
-1
None
None
-1
Then i researched and saw that if there isn't any data to give, find function gives you nothing which is none.
Then i tried to use find_all
Code:
print(x.find_all('a'))
Output:
AttributeError: 'NavigableString' object has no attribute 'find_all'
This is the class name:
<div class="standard-headline">2018-05-01</div>
I don't want to post all the code to here so here is the link hltv.org/matches/ so you can check the classes more easily.
I'm not quite sure I could understand what links OP really wants to grab. However, I took a guess. The links are within compound classes a-reset block upcoming-match standard-box and if you can spot the right class then one individual calss will suffice to fetch you the data like selectors do. Give it a shot.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib.parse import urljoin
import datetime
url = 'https://www.hltv.org/matches'
req = Request(url, headers={"User-Agent":"Mozilla/5.0"})
res = urlopen(req).read()
soup = BeautifulSoup(res, 'lxml')
for links in soup.find(class_="standard-headline",text=(datetime.date.today())).find_parent().find_all(class_="upcoming-match")[:-2]:
print(urljoin(url,links.get('href')))
Output:
https://www.hltv.org/matches/2322508/yeah-vs-sharks-ggbet-ascenso
https://www.hltv.org/matches/2322633/team-australia-vs-team-uk-showmatch-csgo
https://www.hltv.org/matches/2322638/sydney-saints-vs-control-fe-lil-suzi-winner-esl-womens-sydney-open-finals
https://www.hltv.org/matches/2322426/faze-vs-astralis-iem-sydney-2018
https://www.hltv.org/matches/2322601/max-vs-fierce-tiger-starseries-i-league-season-5-asian-qualifier
and so on ------

BeautifulSoup returns urls of pages on same website shortened

My code for reference:
import httplib2
from bs4 import BeautifulSoup
h = httplib2.Http('.cache')
response, content = h.request('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html')
soup = BeautifulSoup(content, "lxml")
urls = []
for tag in soup.findAll('a', href=True):
urls.append(tag['href'])
responses = []
contents = []
for url in urls:
try:
response1, content1 = h.request(url)
responses.append(response1)
contents.append(content1)
except:
pass
The idea is, I get the payload of a webpage, and then scrape that for hyperlinks. One of the links is to yahoo.com, the other to 'http://csb.stanford.edu/class/public/index.html'
However the result I'm getting from BeautifulSoup is:
>>> urls
['http://www.yahoo.com/', '../../index.html']
This presents a problem, because the second part of the script cannot be executed on the second, shortened url. Is there any way to make BeautifulSoup retrieve the full url?
That's because the link on the webpage is actually of that form. The HTML from the page is:
<p>Or let's just link to <a href=../../index.html>another page on this server</a></p>
This is called a relative link.
To convert this to an absolute link, you can use urljoin from the standard library.
from urllib.parse import urljoin # Python3
urljoin('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html`,
'../../index.html')
# returns http://csb.stanford.edu/class/public/index.html

Beautiful Soup Questions

I am trying to scrap the following website, however, I have encountered some problems. As you can see, I am not really familiar with regex and I hope you can give me some pointers to how to solve this problem.
Basically, I hope I can download all the transaction data into the database. However, I need to retrieve it first.
Thank you
Below is my quote:
from bs4 import BeautifulSoup
import requests
import re
url = 'https://bochk.etnet.com.hk/content/bochkweb/eng/quote_transaction_daily_history.php?services=STK&code=6881\
&ip=boc.etnet.com.hk&host=BochkUMSContent&sessionId=44c99b61679e019666f0570db51ad932'
pattern = re.compile('var json_result = (.*?);')
def turnover_detail(url):
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,"html.parser")
data = soup.find_all("script")
for json in data:
print(json)
turnover_detail(url)

Resources