Web Scraping problem through python, can't read html file? - python-3.x

Been web scraping a while with Python and recently I came across this problem.
BeautifulSoup doesn't seem to be able to read the html file.
For example i'm trying to scrape from this website
https://www.thetvdb.com/series/initial-d/episodes/4889010
And this my code
from bs4 import BeautifulSoup
import requests
url_episode = 'https://www.thetvdb.com/series/initial-d/episodes/4889010'
print(url_episode)
getdetail_episode = requests.get(url_episode)
soup = BeautifulSoup(getdetail_episode.content,'html.parser')
print(soup.prettify())
I was able to scrape data from other links, but not this one.
What else should I be doing to get this working?
Thanks
UPDATE
So I checked with Relp.it and other online python compilers, the code worked. WTF?
And it's not working with my Sublime Text or Python IDLE compiler on my computer?
I am confused.

Okay so I think I figured it out.
The whole trouble was caused by the delay of data loading from the webpage, causing the IDE to think there's no data to scrape.
Ended up using requests-html instead of BeautifulSoup to resolve them.
so pretty much like this
from bs4 import BeautifulSoup
import requests
from requests_html import HTMLSession
session = HTMLSession()
url_episode = 'https://www.thetvdb.com/series/initial-d/episodes/4889010'
getdetail_episode = session.get(url_episode)
soup = BeautifulSoup(getdetail_episode.content,'html.parser')
print(soup.prettify())

Related

Why do I get gibberish when I try to web scrap Google search results?

I am trying to make a web scrapper using Python3. I am trying to webscrap the URLs from the Google search page.
My code is as follows:
from bs4 import BeautifulSoup
from requests import *
e = get("https://www.google.com/search?q=Keyword", verify=True)
soup = BeautifulSoup(e.content, 'html.parser')
print(soup.cite)
Now when I try to get the URls(which are marked with the tag 'cite'), I don't get any.
Everything seems ok, but I don't get any results. What's going on?

How to know which tags to use in scraping?

Is there any logic which tags should be used in scraping?
Right now I'm just doing "trial-and-error" on different tag variations to see which works. It takes a lot of time and is really frustrating. I can't understand the logic as to why some tags work and some dont. For example, the code below works fine:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://finance.yahoo.com/quote/IWDA.AS?p=IWDA.AS&.tsrc=fin-srch')
soup = BeautifulSoup(page.content, 'html.parser')
test1 = soup.find_all('div', attrs={'id':'app'})
print(test1)
However, just a slight change to the code and the result is "None":
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://finance.yahoo.com/quote/IWDA.AS?p=IWDA.AS&.tsrc=fin-srch')
soup = BeautifulSoup(page.content, 'html.parser')
test2 = soup.find_all('div', attrs={'id':'YDC-Lead-Stack-Composite'})
print(test2)
Is there any logical explanation why the first example (test1) returns values and why the second example (test2) doesn't return any value?
Is there an efficient way to know which tags will work?
Looks to me like you're trying to scrape a react webapp which will be impossible via the usual web scraping methods.
If you view the raw source (before the scripts are loaded, you'll find that the app is not loaded (as it runs in javascript and fetches the data).
There are two options here:
Find out if there is an API you can query (instead of scraping)
Load the page in a browser and use selenium to scrape (see https://selenium-python.readthedocs.io/getting-started.html)

Web scraping issue searching for contents in Youtube trending page with BeautifulSoup

I am trying to build an app that returns the top 10 youtube trending videos into an excel file but ran into an issue right at the beginning. For some reason, whenever I try to use "soup.find" on any of the id's on this YouTube page, it returns "None" as the result.
I have made sure that my spelling is perfect and everything but it still won't work. I have tried this same code using other sites and get the same error.
#What I did for Youtube which resulted in output being "None"
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.youtube.com/feed/trending')
soup = BeautifulSoup(page.content, 'html.parser')
videos = soup.find(id= "contents")
print(videos)
I expect it to provide me with the HTML code that has this id that I have specified but it keeps saying "None".
The page is using heavy Javascript to modify class, attributes of tags. What you see in Developer Tools isn't always what requests provides you. I recommend to call print(soup.prettify()) and see with what markup you're working with.
You can use this script to get first 10 trending videos:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.youtube.com/feed/trending')
soup = BeautifulSoup(page.content, 'html.parser')
for i, a in enumerate(soup.select('h3.yt-lockup-title a[title]')[:10], 1):
print('{: <4}{}'.format(str(i)+'.', a['title']))
Prints (in my case in Estonia):
1. Jaanus Saks - Su säravad silmad
2. Егор Крид - Сердцеедка (Премьера клипа, 2019)
3. Comment Out #11/ Ольга Бузова х Фёдор Смолов
4. 5MIINUST x NUBLU - (ei ole) aluspükse
5. Артур Пирожков - Алкоголичка (Премьера клипа 2019)
6. Slav school of driving - driving instructor Boris
7. ЧТО ЕДЯТ В АРМИИ США VS РОССИИ?
8. RC Airplane Battle | Dude Perfect
9. ЧЕЙ КОРАБЛИК ОСТАНЕТСЯ ПОСЛЕДНИЙ, ПОЛУЧИТ 1000$ !
10. Khloé Kardashian's New Mom Beauty Routine | Beauty Secrets | Vogue
Since YouTube uses too much of javascript to render and modify the way pages load, it's a better idea to make the page load in a browser and then use it's page source for rendering in BeautifulSoup scripts. So we use Selenium for this purpose. Here once the soup object is obtained then you can do whatever you want with it.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import os
driver = webdriver.Firefox(executable_path="/home/rishabh/Documents/pythonProjects/webScarapping/geckodriver")
driver.get('https://www.youtube.com/feed/trending')
content = driver.page_source
driver.close()
soup = BeautifulSoup(content, 'html.parser')
#Do whatever you want with it
Configure Selenium https://selenium-python.readthedocs.io/installation.html

How to webscrape flights using Python

I am webscraping a website for flight tickets. My problem is: I am using Chrome developer to identify the class of the HTML object I want to scrape. However, my code does not find it. It looks like I am not downloading the HTML code I can see in the Chrome Developer Extension. (inspect item...)
import requests
from BeautifulSoup import BeautifulSoup
url = 'http://www.momondo.de/flightsearch/?Search=true&TripType=2&SegNo=2&SO0=BOS&SD0=LON&SDP0=07-09-2016&SO1=LON&SD1=BOS&SDP1=12-09-2016&AD=1&TK=ECO&DO=false&NA=false'
req = requests.get(url)
soup = BeautifulSoup(req.content)
x = soup.findAll("span" ,{"class":"value"} )
Please try the following:
from bs4 import BeautifulSoup
import urllib.request
source = urllib.request.urlopen('http://www.momon...e&NA=false').read()
soup = BeautifulSoup(source,'html5lib')
for item in soup.find_all("span", class_="value"):
print(item.text)
With this you can scrape all the spans of the webpage with the class "value". If you want to see the whole html element and its attributes instead of just the content, remove .text from print(item.text).
You will probably need to install html5lib with pip, if you are having trouble doing this try running CMD as admin (assuming you are using windows).
You can also try this:
for values_in_x in x:
print(values_in_x.text)

Python 3 web scraping options

I'm new to Python so I'm sorry if this is a newbie question.
I'm trying to build a program involving webscraping and I've noticed that Python 3 seems to have significantly fewer web-scraping modules than the Python 2.x series.
Beautiful Soup, mechanize, and scrapy -- the three modules recommended to me -- all seem to be incompatible.
I'm wondering if anyone on this forum has a good option for webscraping using python 3.
Any suggestions would be greatly appreciated.
Thanks,
Will
lxml.html works on Python 3, and gets you html parsing, at least.
BeautifulSoup 4, which is in the works, should support Python 3 (I've done some work on this).
I'm kind of new to, but I found BeautifulSoup 4 to be really good and I'm learning and using this one with requests and lxml modules. requests module is for getting url and lxml (also you can use built in html.parser for parsing, but lxml is faster I guess) is for parsing.
Simple usage is:
import requests
from bs4 import BeautifulSoup
url = 'someUrl'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
Not simple example how to get the href's from html:
links = set()
for link in soup.find_all('a'):
if 'href' in link.attrs:
links.add(link)
Then you will get the set with unique links from your url.
Other example how you can parse the specific parts of html, e.g. if you wish to pars all <p> tags that has class of testClass:
list_of_p = []
for p in soup.find_all('p', {'class': 'testClass'}):
for item in p:
list_of_p.append(item)
and many more you can do with it as easy as it seems.

Resources