I am webscraping a website for flight tickets. My problem is: I am using Chrome developer to identify the class of the HTML object I want to scrape. However, my code does not find it. It looks like I am not downloading the HTML code I can see in the Chrome Developer Extension. (inspect item...)
import requests
from BeautifulSoup import BeautifulSoup
url = 'http://www.momondo.de/flightsearch/?Search=true&TripType=2&SegNo=2&SO0=BOS&SD0=LON&SDP0=07-09-2016&SO1=LON&SD1=BOS&SDP1=12-09-2016&AD=1&TK=ECO&DO=false&NA=false'
req = requests.get(url)
soup = BeautifulSoup(req.content)
x = soup.findAll("span" ,{"class":"value"} )
Please try the following:
from bs4 import BeautifulSoup
import urllib.request
source = urllib.request.urlopen('http://www.momon...e&NA=false').read()
soup = BeautifulSoup(source,'html5lib')
for item in soup.find_all("span", class_="value"):
print(item.text)
With this you can scrape all the spans of the webpage with the class "value". If you want to see the whole html element and its attributes instead of just the content, remove .text from print(item.text).
You will probably need to install html5lib with pip, if you are having trouble doing this try running CMD as admin (assuming you are using windows).
You can also try this:
for values_in_x in x:
print(values_in_x.text)
Related
Is there any logic which tags should be used in scraping?
Right now I'm just doing "trial-and-error" on different tag variations to see which works. It takes a lot of time and is really frustrating. I can't understand the logic as to why some tags work and some dont. For example, the code below works fine:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://finance.yahoo.com/quote/IWDA.AS?p=IWDA.AS&.tsrc=fin-srch')
soup = BeautifulSoup(page.content, 'html.parser')
test1 = soup.find_all('div', attrs={'id':'app'})
print(test1)
However, just a slight change to the code and the result is "None":
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://finance.yahoo.com/quote/IWDA.AS?p=IWDA.AS&.tsrc=fin-srch')
soup = BeautifulSoup(page.content, 'html.parser')
test2 = soup.find_all('div', attrs={'id':'YDC-Lead-Stack-Composite'})
print(test2)
Is there any logical explanation why the first example (test1) returns values and why the second example (test2) doesn't return any value?
Is there an efficient way to know which tags will work?
Looks to me like you're trying to scrape a react webapp which will be impossible via the usual web scraping methods.
If you view the raw source (before the scripts are loaded, you'll find that the app is not loaded (as it runs in javascript and fetches the data).
There are two options here:
Find out if there is an API you can query (instead of scraping)
Load the page in a browser and use selenium to scrape (see https://selenium-python.readthedocs.io/getting-started.html)
I'm trying to scrape some data using beautifulsoup and requests libraries in Python 3.7. For each of the items (tag article) on this webpage, there is a youtube link. After finding all the instances of article, I can successfully extract the headlines. This code also successfully finds instances of youtube-player class inside each article, except at index 7, where the output is None.
from bs4 import BeautifulSoup
import requests
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "html.parser")
articles = soup.find_all('article')
for article in articles:
headline = article.h2.a.text
print(headline)
link = article.find('iframe', {'class': 'youtube-player'})
print(link)
However, from the source (output of beautifulsoup), if I directly search for youtube-player, I get all the instances correctly.
links = soup.find_all('iframe', {'class': 'youtube-player'})
for link in links:
print(link)
How can I improve my code to get all the youtube-player instances within article loop?
You can use zip() built-in function to tie titles and youtube links together.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "html.parser")
for title, player in zip(soup.select('.entry-title'),
soup.select('iframe.youtube-player')):
print('{:<75}{}'.format(title.text, player['src']))
Prints:
Git: Difference between “add -A”, “add -u”, “add .”, and “add *” https://www.youtube.com/embed/tcd4txbTtAY?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Programming Terms: Combinations and Permutations https://www.youtube.com/embed/QI9EczPQzPQ?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Chrome Quick Tip: Quickly Bookmark Open Tabs for Later Viewing https://www.youtube.com/embed/tsiSg_beudo?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Comprehensions – How they work and why you should be using them https://www.youtube.com/embed/3dt4OGnU5sM?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Generators – How to use them and the benefits you receive https://www.youtube.com/embed/bD05uGo_sVI?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Quickest and Easiest Way to Run a Local Web-Server https://www.youtube.com/embed/lE6Y6M9xPLw?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Git for Beginners: Command-Line Fundamentals https://www.youtube.com/embed/HVsySz-h9r4?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Time-Saving Keyboard Shortcuts for the Mac Terminal https://www.youtube.com/embed/TXzrk3b9sKM?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Overview of Online Learning Resources in 2015 https://www.youtube.com/embed/QGy6M8HZSC4?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Else Clauses on Loops https://www.youtube.com/embed/Dh-0lAyc3Bc?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
EDIT: It seems that when you use html.parser, BeautifulSoup doesn't recognize the youtube link on one place, use lxml or html5lib instead:
import requests
from bs4 import BeautifulSoup
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "lxml")
for article in soup.select('article'):
title = article.select_one('.entry-title')
player = article.select_one('iframe.youtube-player') or {'src':''}
print('{:<75}{}'.format(title.text, player['src']))
Been web scraping a while with Python and recently I came across this problem.
BeautifulSoup doesn't seem to be able to read the html file.
For example i'm trying to scrape from this website
https://www.thetvdb.com/series/initial-d/episodes/4889010
And this my code
from bs4 import BeautifulSoup
import requests
url_episode = 'https://www.thetvdb.com/series/initial-d/episodes/4889010'
print(url_episode)
getdetail_episode = requests.get(url_episode)
soup = BeautifulSoup(getdetail_episode.content,'html.parser')
print(soup.prettify())
I was able to scrape data from other links, but not this one.
What else should I be doing to get this working?
Thanks
UPDATE
So I checked with Relp.it and other online python compilers, the code worked. WTF?
And it's not working with my Sublime Text or Python IDLE compiler on my computer?
I am confused.
Okay so I think I figured it out.
The whole trouble was caused by the delay of data loading from the webpage, causing the IDE to think there's no data to scrape.
Ended up using requests-html instead of BeautifulSoup to resolve them.
so pretty much like this
from bs4 import BeautifulSoup
import requests
from requests_html import HTMLSession
session = HTMLSession()
url_episode = 'https://www.thetvdb.com/series/initial-d/episodes/4889010'
getdetail_episode = session.get(url_episode)
soup = BeautifulSoup(getdetail_episode.content,'html.parser')
print(soup.prettify())
I am trying to write some python to scrape the web for firmware/driver updates but different web pages are responding differently.
I've used the requests and lxml packages to find the information based on xpath. Xpath was found by opening URL in chrome, right clicking on the data and inspecting it, then right click again when it is showing the code and selecting copy xpath.
WORKING EXAMPLE
Intel NUC at https://downloadcenter.intel.com/product/76977/Intel-NUC-Kit-D54250WYK.
At 2019-12-25 the data value it correctly picks up is "24.3".
import requests
from lxml import html
url="https://downloadcenter.intel.com/product/76977/Intel-NUC-Kit-D54250WYK"
page = requests.get(url)
XpathToFWtype = '//*[#id="search-results"]/tbody/tr[1]/td[4]/text()'
tree.xpath(XpathToFWtype)
FAILING EXAMPLE
Similar logic fails for ASUS website, where it should scape firmware text Version 1.1.2.3_790:
https://www.asus.com/lk/Networking/DSL-AC56U/HelpDesk_BIOS/
The failing xpath returns from inspect statement as:
//*[#id="Manual-Download"]/div[2]/div[2]/div/div/section/div[1]/div[1]span[1]
Everything I try fails, whether I add "/text()" or any variation. The webpages differ in that the "view source" shows the text for the Intel url, and not the Asus so it is being dynamically generated somewhere - but I am unsure after days of trying everything what to do next.
import requests
from lxml import html
url="https://www.asus.com/lk/Networking/DSL-AC56U/HelpDesk_BIOS/"
page = requests.get(url)
XpathToFWtype = '//*[#id="Manual-Download"]/div[2]/div[2]/div/div/section/div[1]/div[1]/span[1]/text()'
tree.xpath(XpathToFWtype)
# etc -> many traceback errors from lxml :-(
Thanks for any suggestion or direction, its really appreciated
For INTEL website you can do the following:
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://downloadcenter.intel.com/product/76977/Intel-NUC-Kit-D54250WYK")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("td", {'class': 'dc-version collapsible-col collapsible1'}):
item = item.text
print(item[0:item.find("L")])
Output:
24.3
0054
1.0.0
6.1.9
15.40.41.5058
1.01
1
6.0.1.7982
11.0.6.1194
15.36.28.4332
15.40.13.4331
15.36.26.4294
14.5.0.1081
2.4.2013.711
10.1.1.8
10.0.27
2.4.2013.711
2.4.2013.711
For ASUS website it's actually using JavaScript to render it's content. so you will need to use Selenium or PhantomJS. but I've been able to locate the XHR to the JSON API and called it by a request :).
import requests
r = requests.get(
"https://www.asus.com/support/api/product.asmx/GetPDBIOS?website=lk&pdhashedid=RtHWWdjImSzhdG92&model=DSL-AC56U&cpu=").json()
for item in r['Result']['Obj']:
for data in item['Files']:
print(data['Version'])
Output:
1.1.2.3_790
1.1.2.3_743
1.1.2.3_674
1.1.2.3_617
1.1.2.3_552
1.1.2.3_502
1.1.2.3_473
You can parse whatever from here :) https://www.asus.com/support/api/product.asmx/GetPDBIOS?website=lk&pdhashedid=RtHWWdjImSzhdG92&model=DSL-AC56U&cpu=
I am trying to build an app that returns the top 10 youtube trending videos into an excel file but ran into an issue right at the beginning. For some reason, whenever I try to use "soup.find" on any of the id's on this YouTube page, it returns "None" as the result.
I have made sure that my spelling is perfect and everything but it still won't work. I have tried this same code using other sites and get the same error.
#What I did for Youtube which resulted in output being "None"
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.youtube.com/feed/trending')
soup = BeautifulSoup(page.content, 'html.parser')
videos = soup.find(id= "contents")
print(videos)
I expect it to provide me with the HTML code that has this id that I have specified but it keeps saying "None".
The page is using heavy Javascript to modify class, attributes of tags. What you see in Developer Tools isn't always what requests provides you. I recommend to call print(soup.prettify()) and see with what markup you're working with.
You can use this script to get first 10 trending videos:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.youtube.com/feed/trending')
soup = BeautifulSoup(page.content, 'html.parser')
for i, a in enumerate(soup.select('h3.yt-lockup-title a[title]')[:10], 1):
print('{: <4}{}'.format(str(i)+'.', a['title']))
Prints (in my case in Estonia):
1. Jaanus Saks - Su säravad silmad
2. Егор Крид - Сердцеедка (Премьера клипа, 2019)
3. Comment Out #11/ Ольга Бузова х Фёдор Смолов
4. 5MIINUST x NUBLU - (ei ole) aluspükse
5. Артур Пирожков - Алкоголичка (Премьера клипа 2019)
6. Slav school of driving - driving instructor Boris
7. ЧТО ЕДЯТ В АРМИИ США VS РОССИИ?
8. RC Airplane Battle | Dude Perfect
9. ЧЕЙ КОРАБЛИК ОСТАНЕТСЯ ПОСЛЕДНИЙ, ПОЛУЧИТ 1000$ !
10. Khloé Kardashian's New Mom Beauty Routine | Beauty Secrets | Vogue
Since YouTube uses too much of javascript to render and modify the way pages load, it's a better idea to make the page load in a browser and then use it's page source for rendering in BeautifulSoup scripts. So we use Selenium for this purpose. Here once the soup object is obtained then you can do whatever you want with it.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import os
driver = webdriver.Firefox(executable_path="/home/rishabh/Documents/pythonProjects/webScarapping/geckodriver")
driver.get('https://www.youtube.com/feed/trending')
content = driver.page_source
driver.close()
soup = BeautifulSoup(content, 'html.parser')
#Do whatever you want with it
Configure Selenium https://selenium-python.readthedocs.io/installation.html