Python 3 web scraping options - python-3.x

I'm new to Python so I'm sorry if this is a newbie question.
I'm trying to build a program involving webscraping and I've noticed that Python 3 seems to have significantly fewer web-scraping modules than the Python 2.x series.
Beautiful Soup, mechanize, and scrapy -- the three modules recommended to me -- all seem to be incompatible.
I'm wondering if anyone on this forum has a good option for webscraping using python 3.
Any suggestions would be greatly appreciated.
Thanks,
Will

lxml.html works on Python 3, and gets you html parsing, at least.
BeautifulSoup 4, which is in the works, should support Python 3 (I've done some work on this).

I'm kind of new to, but I found BeautifulSoup 4 to be really good and I'm learning and using this one with requests and lxml modules. requests module is for getting url and lxml (also you can use built in html.parser for parsing, but lxml is faster I guess) is for parsing.
Simple usage is:
import requests
from bs4 import BeautifulSoup
url = 'someUrl'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
Not simple example how to get the href's from html:
links = set()
for link in soup.find_all('a'):
if 'href' in link.attrs:
links.add(link)
Then you will get the set with unique links from your url.
Other example how you can parse the specific parts of html, e.g. if you wish to pars all <p> tags that has class of testClass:
list_of_p = []
for p in soup.find_all('p', {'class': 'testClass'}):
for item in p:
list_of_p.append(item)
and many more you can do with it as easy as it seems.

Related

Python web-scraping misses an element from the list of searched objects

I'm trying to scrape some data using beautifulsoup and requests libraries in Python 3.7. For each of the items (tag article) on this webpage, there is a youtube link. After finding all the instances of article, I can successfully extract the headlines. This code also successfully finds instances of youtube-player class inside each article, except at index 7, where the output is None.
from bs4 import BeautifulSoup
import requests
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "html.parser")
articles = soup.find_all('article')
for article in articles:
headline = article.h2.a.text
print(headline)
link = article.find('iframe', {'class': 'youtube-player'})
print(link)
However, from the source (output of beautifulsoup), if I directly search for youtube-player, I get all the instances correctly.
links = soup.find_all('iframe', {'class': 'youtube-player'})
for link in links:
print(link)
How can I improve my code to get all the youtube-player instances within article loop?
You can use zip() built-in function to tie titles and youtube links together.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "html.parser")
for title, player in zip(soup.select('.entry-title'),
soup.select('iframe.youtube-player')):
print('{:<75}{}'.format(title.text, player['src']))
Prints:
Git: Difference between “add -A”, “add -u”, “add .”, and “add *” https://www.youtube.com/embed/tcd4txbTtAY?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Programming Terms: Combinations and Permutations https://www.youtube.com/embed/QI9EczPQzPQ?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Chrome Quick Tip: Quickly Bookmark Open Tabs for Later Viewing https://www.youtube.com/embed/tsiSg_beudo?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Comprehensions – How they work and why you should be using them https://www.youtube.com/embed/3dt4OGnU5sM?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Generators – How to use them and the benefits you receive https://www.youtube.com/embed/bD05uGo_sVI?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Quickest and Easiest Way to Run a Local Web-Server https://www.youtube.com/embed/lE6Y6M9xPLw?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Git for Beginners: Command-Line Fundamentals https://www.youtube.com/embed/HVsySz-h9r4?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Time-Saving Keyboard Shortcuts for the Mac Terminal https://www.youtube.com/embed/TXzrk3b9sKM?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Overview of Online Learning Resources in 2015 https://www.youtube.com/embed/QGy6M8HZSC4?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Else Clauses on Loops https://www.youtube.com/embed/Dh-0lAyc3Bc?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
EDIT: It seems that when you use html.parser, BeautifulSoup doesn't recognize the youtube link on one place, use lxml or html5lib instead:
import requests
from bs4 import BeautifulSoup
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "lxml")
for article in soup.select('article'):
title = article.select_one('.entry-title')
player = article.select_one('iframe.youtube-player') or {'src':''}
print('{:<75}{}'.format(title.text, player['src']))

Web Scraping problem through python, can't read html file?

Been web scraping a while with Python and recently I came across this problem.
BeautifulSoup doesn't seem to be able to read the html file.
For example i'm trying to scrape from this website
https://www.thetvdb.com/series/initial-d/episodes/4889010
And this my code
from bs4 import BeautifulSoup
import requests
url_episode = 'https://www.thetvdb.com/series/initial-d/episodes/4889010'
print(url_episode)
getdetail_episode = requests.get(url_episode)
soup = BeautifulSoup(getdetail_episode.content,'html.parser')
print(soup.prettify())
I was able to scrape data from other links, but not this one.
What else should I be doing to get this working?
Thanks
UPDATE
So I checked with Relp.it and other online python compilers, the code worked. WTF?
And it's not working with my Sublime Text or Python IDLE compiler on my computer?
I am confused.
Okay so I think I figured it out.
The whole trouble was caused by the delay of data loading from the webpage, causing the IDE to think there's no data to scrape.
Ended up using requests-html instead of BeautifulSoup to resolve them.
so pretty much like this
from bs4 import BeautifulSoup
import requests
from requests_html import HTMLSession
session = HTMLSession()
url_episode = 'https://www.thetvdb.com/series/initial-d/episodes/4889010'
getdetail_episode = session.get(url_episode)
soup = BeautifulSoup(getdetail_episode.content,'html.parser')
print(soup.prettify())

Python - requests, lmxl and xpath not working

I am trying to write some python to scrape the web for firmware/driver updates but different web pages are responding differently.
I've used the requests and lxml packages to find the information based on xpath. Xpath was found by opening URL in chrome, right clicking on the data and inspecting it, then right click again when it is showing the code and selecting copy xpath.
WORKING EXAMPLE
Intel NUC at https://downloadcenter.intel.com/product/76977/Intel-NUC-Kit-D54250WYK.
At 2019-12-25 the data value it correctly picks up is "24.3".
import requests
from lxml import html
url="https://downloadcenter.intel.com/product/76977/Intel-NUC-Kit-D54250WYK"
page = requests.get(url)
XpathToFWtype = '//*[#id="search-results"]/tbody/tr[1]/td[4]/text()'
tree.xpath(XpathToFWtype)
FAILING EXAMPLE
Similar logic fails for ASUS website, where it should scape firmware text Version 1.1.2.3_790:
https://www.asus.com/lk/Networking/DSL-AC56U/HelpDesk_BIOS/
The failing xpath returns from inspect statement as:
//*[#id="Manual-Download"]/div[2]/div[2]/div/div/section/div[1]/div[1]span[1]
Everything I try fails, whether I add "/text()" or any variation. The webpages differ in that the "view source" shows the text for the Intel url, and not the Asus so it is being dynamically generated somewhere - but I am unsure after days of trying everything what to do next.
import requests
from lxml import html
url="https://www.asus.com/lk/Networking/DSL-AC56U/HelpDesk_BIOS/"
page = requests.get(url)
XpathToFWtype = '//*[#id="Manual-Download"]/div[2]/div[2]/div/div/section/div[1]/div[1]/span[1]/text()'
tree.xpath(XpathToFWtype)
# etc -> many traceback errors from lxml :-(
Thanks for any suggestion or direction, its really appreciated
For INTEL website you can do the following:
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://downloadcenter.intel.com/product/76977/Intel-NUC-Kit-D54250WYK")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("td", {'class': 'dc-version collapsible-col collapsible1'}):
item = item.text
print(item[0:item.find("L")])
Output:
24.3
0054
1.0.0
6.1.9
15.40.41.5058
1.01
1
6.0.1.7982
11.0.6.1194
15.36.28.4332
15.40.13.4331
15.36.26.4294
14.5.0.1081
2.4.2013.711
10.1.1.8
10.0.27
2.4.2013.711
2.4.2013.711
For ASUS website it's actually using JavaScript to render it's content. so you will need to use Selenium or PhantomJS. but I've been able to locate the XHR to the JSON API and called it by a request :).
import requests
r = requests.get(
"https://www.asus.com/support/api/product.asmx/GetPDBIOS?website=lk&pdhashedid=RtHWWdjImSzhdG92&model=DSL-AC56U&cpu=").json()
for item in r['Result']['Obj']:
for data in item['Files']:
print(data['Version'])
Output:
1.1.2.3_790
1.1.2.3_743
1.1.2.3_674
1.1.2.3_617
1.1.2.3_552
1.1.2.3_502
1.1.2.3_473
You can parse whatever from here :) https://www.asus.com/support/api/product.asmx/GetPDBIOS?website=lk&pdhashedid=RtHWWdjImSzhdG92&model=DSL-AC56U&cpu=

Web scraping issue searching for contents in Youtube trending page with BeautifulSoup

I am trying to build an app that returns the top 10 youtube trending videos into an excel file but ran into an issue right at the beginning. For some reason, whenever I try to use "soup.find" on any of the id's on this YouTube page, it returns "None" as the result.
I have made sure that my spelling is perfect and everything but it still won't work. I have tried this same code using other sites and get the same error.
#What I did for Youtube which resulted in output being "None"
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.youtube.com/feed/trending')
soup = BeautifulSoup(page.content, 'html.parser')
videos = soup.find(id= "contents")
print(videos)
I expect it to provide me with the HTML code that has this id that I have specified but it keeps saying "None".
The page is using heavy Javascript to modify class, attributes of tags. What you see in Developer Tools isn't always what requests provides you. I recommend to call print(soup.prettify()) and see with what markup you're working with.
You can use this script to get first 10 trending videos:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.youtube.com/feed/trending')
soup = BeautifulSoup(page.content, 'html.parser')
for i, a in enumerate(soup.select('h3.yt-lockup-title a[title]')[:10], 1):
print('{: <4}{}'.format(str(i)+'.', a['title']))
Prints (in my case in Estonia):
1. Jaanus Saks - Su säravad silmad
2. Егор Крид - Сердцеедка (Премьера клипа, 2019)
3. Comment Out #11/ Ольга Бузова х Фёдор Смолов
4. 5MIINUST x NUBLU - (ei ole) aluspükse
5. Артур Пирожков - Алкоголичка (Премьера клипа 2019)
6. Slav school of driving - driving instructor Boris
7. ЧТО ЕДЯТ В АРМИИ США VS РОССИИ?
8. RC Airplane Battle | Dude Perfect
9. ЧЕЙ КОРАБЛИК ОСТАНЕТСЯ ПОСЛЕДНИЙ, ПОЛУЧИТ 1000$ !
10. Khloé Kardashian's New Mom Beauty Routine | Beauty Secrets | Vogue
Since YouTube uses too much of javascript to render and modify the way pages load, it's a better idea to make the page load in a browser and then use it's page source for rendering in BeautifulSoup scripts. So we use Selenium for this purpose. Here once the soup object is obtained then you can do whatever you want with it.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import os
driver = webdriver.Firefox(executable_path="/home/rishabh/Documents/pythonProjects/webScarapping/geckodriver")
driver.get('https://www.youtube.com/feed/trending')
content = driver.page_source
driver.close()
soup = BeautifulSoup(content, 'html.parser')
#Do whatever you want with it
Configure Selenium https://selenium-python.readthedocs.io/installation.html

Python BeautifulSoup

I am using Python BeautifulSoup to extract some data from a famous song site.
Here is the snippet of code:
import requests
from bs4 import BeautifulSoup
url= 'https://gaana.com/playlist/gaana-dj-bollywood-top-50-1'
res = requests.get(url)
while(res.status_code!=200):
try:
res = requests.get('url')
except:
pass
print (res)
soup = BeautifulSoup(res.text,'lxml')
songs = soup.find_all('meta',{'property':'music:song'})
print (songs[0])
Here is the sample output:
<Response [200]>
<meta content="https://gaana.com/song/o-saathi" property="music:song"/>
Now i want to extract the url within content as string so that i can further use that url in my program.
Someone please Help me.
It's in the comments, but I just want to explain: beautifulsoup returns most results as a list or other iterable object. You show that you understand this in your code by using songs[0], but in this case what's been returned is a dictionary.
As explained in this StackOverflow post, you have need to query not only songs[0] but also the property within the dictionary (the two together are called a key pair and are the chief way to get data out of a dictionary).
Last note: while I've been a big fan of BeautifulSoup4 for basic web scraping, you may consider the lxml library. It's pretty well documented; to really take advantage of it you have to learn Python-variety Xpaths, which are sort of like regex for XML/HTML; but for advanced scraping it's probably the last best option short of Selenium, and it returns cleaner data than bs4.
Good luck!

Resources