Web Scraping: My first project and no idea where to start - python-3.x

I have very little experience but a lot of persistence. One of my hobbies is football and I help in a local team. I'm a big fan of statistics and unfortunately the only way to collect data from the local football federation is by web scraping. I have seen python with beautifulsoup package can help but I cannot identify the tags as these I believe are on a table.
My goal is to automatically collect the information to build up a database with players, fixtures, teams, referees,... and build up stats such as how many times has player been on the starting line up, when are teams more likely to score a goal, when to receive a goal,...
I have two links for a reference.
The first is for the general fixtures of a given group.
The second is the details with a match of any of a given match.
Any clues or where to start with the pattern would be great.

you should first get into web-scraping using python native libraries like requests to kinda contact the page that you want. then depending on the page, you should use bs4 (Beautifulsoup) to find what you are looking for inside the page. Then what you wanna do is transform and refine that information into variables (list or dataframes based on your need) and finally save those variables into a dataset. here is an example code:
import requests
from bs4 import BeautifulSoup #assuming you already have bs4 installed
page = requests.get('www.DESTINATIONSITE.com/BLAHBLAH')
soup = BeautifulSoup(page.text, 'html.parser')
soup_str = str(soup.text)
up to this point, you have the string values of the entire page at hand. now you wanna do some regular expression or any other python coding to seperate the information that you want from soup_str and store them in variable(s).
for the rest of the process, if you want to save that data, i suggest you look into libraries like pandas

I have done some progress
import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
send request
url = 'http://fcf.cat/acta/1920/futbol-11/infantil-primera-divisio/grup-11/1i/sant-ildefons-ue-b/1i/lhospitalet-centre-esports-c'
page = requests.get(url)
soup = BeautifulSoup(page.text,'html.parser')
read acta
acta_text = []
acta_text_element = soup.find_all(class_='acta-table')
for item in acta_text_element:
acta_text.append(item.text)
when I print the items I get many \n

Mahdi_J thanks for the initial response. I have already started the approach:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://fcf.cat/acta/1920/futbol-11/infantil-primera-divisio/grup-11/1i/sant-ildefons-ue-b/1i/lhospitalet-centre-esports-c", headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
c = r.content
soup = BeautifulSoup(c,"html.parser")
What I need is some support on where to start for the parsing. On the url I neeed to collect the players for both teams, the goals, bookings to save them to else where. What I'm having trouble is how to parse. I have very little skills and I need some starting point for the parsing.

Related

Python web-scraping misses an element from the list of searched objects

I'm trying to scrape some data using beautifulsoup and requests libraries in Python 3.7. For each of the items (tag article) on this webpage, there is a youtube link. After finding all the instances of article, I can successfully extract the headlines. This code also successfully finds instances of youtube-player class inside each article, except at index 7, where the output is None.
from bs4 import BeautifulSoup
import requests
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "html.parser")
articles = soup.find_all('article')
for article in articles:
headline = article.h2.a.text
print(headline)
link = article.find('iframe', {'class': 'youtube-player'})
print(link)
However, from the source (output of beautifulsoup), if I directly search for youtube-player, I get all the instances correctly.
links = soup.find_all('iframe', {'class': 'youtube-player'})
for link in links:
print(link)
How can I improve my code to get all the youtube-player instances within article loop?
You can use zip() built-in function to tie titles and youtube links together.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "html.parser")
for title, player in zip(soup.select('.entry-title'),
soup.select('iframe.youtube-player')):
print('{:<75}{}'.format(title.text, player['src']))
Prints:
Git: Difference between “add -A”, “add -u”, “add .”, and “add *” https://www.youtube.com/embed/tcd4txbTtAY?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Programming Terms: Combinations and Permutations https://www.youtube.com/embed/QI9EczPQzPQ?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Chrome Quick Tip: Quickly Bookmark Open Tabs for Later Viewing https://www.youtube.com/embed/tsiSg_beudo?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Comprehensions – How they work and why you should be using them https://www.youtube.com/embed/3dt4OGnU5sM?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Generators – How to use them and the benefits you receive https://www.youtube.com/embed/bD05uGo_sVI?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Quickest and Easiest Way to Run a Local Web-Server https://www.youtube.com/embed/lE6Y6M9xPLw?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Git for Beginners: Command-Line Fundamentals https://www.youtube.com/embed/HVsySz-h9r4?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Time-Saving Keyboard Shortcuts for the Mac Terminal https://www.youtube.com/embed/TXzrk3b9sKM?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Overview of Online Learning Resources in 2015 https://www.youtube.com/embed/QGy6M8HZSC4?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Else Clauses on Loops https://www.youtube.com/embed/Dh-0lAyc3Bc?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
EDIT: It seems that when you use html.parser, BeautifulSoup doesn't recognize the youtube link on one place, use lxml or html5lib instead:
import requests
from bs4 import BeautifulSoup
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "lxml")
for article in soup.select('article'):
title = article.select_one('.entry-title')
player = article.select_one('iframe.youtube-player') or {'src':''}
print('{:<75}{}'.format(title.text, player['src']))

Interacting with a website and getting data using python

I am trying to interact with a website. For my data analysis project, I have a list of 1 million websites and I want to find the category of each website. That is why I am using that website.
Now, I want to automate that process of typing out 1 million websites and getting their category. I want to use python for this. Can anyone please suggest me any ideas on how I can do this?
You can use BeautifulSoup, i.e.:
import requests, traceback
from bs4 import BeautifulSoup
domains = ["duckduckgo.com", "opensource.com"]
for dom in domains:
try:
req = requests.get(f"https://fortiguard.com/webfilter?q={dom}&version=8")
if req.status_code == 200:
soup = BeautifulSoup(req.text, 'html.parser')
cat = soup.find("meta", property="description")["content"].split(":")[1].strip()
print(dom, cat)
except:
pass
print(traceback.format_exc())
Output:
duckduckgo.com Search Engines and Portals
opensource.com Information Technology
Demo

Python - requests, lmxl and xpath not working

I am trying to write some python to scrape the web for firmware/driver updates but different web pages are responding differently.
I've used the requests and lxml packages to find the information based on xpath. Xpath was found by opening URL in chrome, right clicking on the data and inspecting it, then right click again when it is showing the code and selecting copy xpath.
WORKING EXAMPLE
Intel NUC at https://downloadcenter.intel.com/product/76977/Intel-NUC-Kit-D54250WYK.
At 2019-12-25 the data value it correctly picks up is "24.3".
import requests
from lxml import html
url="https://downloadcenter.intel.com/product/76977/Intel-NUC-Kit-D54250WYK"
page = requests.get(url)
XpathToFWtype = '//*[#id="search-results"]/tbody/tr[1]/td[4]/text()'
tree.xpath(XpathToFWtype)
FAILING EXAMPLE
Similar logic fails for ASUS website, where it should scape firmware text Version 1.1.2.3_790:
https://www.asus.com/lk/Networking/DSL-AC56U/HelpDesk_BIOS/
The failing xpath returns from inspect statement as:
//*[#id="Manual-Download"]/div[2]/div[2]/div/div/section/div[1]/div[1]span[1]
Everything I try fails, whether I add "/text()" or any variation. The webpages differ in that the "view source" shows the text for the Intel url, and not the Asus so it is being dynamically generated somewhere - but I am unsure after days of trying everything what to do next.
import requests
from lxml import html
url="https://www.asus.com/lk/Networking/DSL-AC56U/HelpDesk_BIOS/"
page = requests.get(url)
XpathToFWtype = '//*[#id="Manual-Download"]/div[2]/div[2]/div/div/section/div[1]/div[1]/span[1]/text()'
tree.xpath(XpathToFWtype)
# etc -> many traceback errors from lxml :-(
Thanks for any suggestion or direction, its really appreciated
For INTEL website you can do the following:
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://downloadcenter.intel.com/product/76977/Intel-NUC-Kit-D54250WYK")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("td", {'class': 'dc-version collapsible-col collapsible1'}):
item = item.text
print(item[0:item.find("L")])
Output:
24.3
0054
1.0.0
6.1.9
15.40.41.5058
1.01
1
6.0.1.7982
11.0.6.1194
15.36.28.4332
15.40.13.4331
15.36.26.4294
14.5.0.1081
2.4.2013.711
10.1.1.8
10.0.27
2.4.2013.711
2.4.2013.711
For ASUS website it's actually using JavaScript to render it's content. so you will need to use Selenium or PhantomJS. but I've been able to locate the XHR to the JSON API and called it by a request :).
import requests
r = requests.get(
"https://www.asus.com/support/api/product.asmx/GetPDBIOS?website=lk&pdhashedid=RtHWWdjImSzhdG92&model=DSL-AC56U&cpu=").json()
for item in r['Result']['Obj']:
for data in item['Files']:
print(data['Version'])
Output:
1.1.2.3_790
1.1.2.3_743
1.1.2.3_674
1.1.2.3_617
1.1.2.3_552
1.1.2.3_502
1.1.2.3_473
You can parse whatever from here :) https://www.asus.com/support/api/product.asmx/GetPDBIOS?website=lk&pdhashedid=RtHWWdjImSzhdG92&model=DSL-AC56U&cpu=

Web scraping issue searching for contents in Youtube trending page with BeautifulSoup

I am trying to build an app that returns the top 10 youtube trending videos into an excel file but ran into an issue right at the beginning. For some reason, whenever I try to use "soup.find" on any of the id's on this YouTube page, it returns "None" as the result.
I have made sure that my spelling is perfect and everything but it still won't work. I have tried this same code using other sites and get the same error.
#What I did for Youtube which resulted in output being "None"
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.youtube.com/feed/trending')
soup = BeautifulSoup(page.content, 'html.parser')
videos = soup.find(id= "contents")
print(videos)
I expect it to provide me with the HTML code that has this id that I have specified but it keeps saying "None".
The page is using heavy Javascript to modify class, attributes of tags. What you see in Developer Tools isn't always what requests provides you. I recommend to call print(soup.prettify()) and see with what markup you're working with.
You can use this script to get first 10 trending videos:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.youtube.com/feed/trending')
soup = BeautifulSoup(page.content, 'html.parser')
for i, a in enumerate(soup.select('h3.yt-lockup-title a[title]')[:10], 1):
print('{: <4}{}'.format(str(i)+'.', a['title']))
Prints (in my case in Estonia):
1. Jaanus Saks - Su säravad silmad
2. Егор Крид - Сердцеедка (Премьера клипа, 2019)
3. Comment Out #11/ Ольга Бузова х Фёдор Смолов
4. 5MIINUST x NUBLU - (ei ole) aluspükse
5. Артур Пирожков - Алкоголичка (Премьера клипа 2019)
6. Slav school of driving - driving instructor Boris
7. ЧТО ЕДЯТ В АРМИИ США VS РОССИИ?
8. RC Airplane Battle | Dude Perfect
9. ЧЕЙ КОРАБЛИК ОСТАНЕТСЯ ПОСЛЕДНИЙ, ПОЛУЧИТ 1000$ !
10. Khloé Kardashian's New Mom Beauty Routine | Beauty Secrets | Vogue
Since YouTube uses too much of javascript to render and modify the way pages load, it's a better idea to make the page load in a browser and then use it's page source for rendering in BeautifulSoup scripts. So we use Selenium for this purpose. Here once the soup object is obtained then you can do whatever you want with it.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import os
driver = webdriver.Firefox(executable_path="/home/rishabh/Documents/pythonProjects/webScarapping/geckodriver")
driver.get('https://www.youtube.com/feed/trending')
content = driver.page_source
driver.close()
soup = BeautifulSoup(content, 'html.parser')
#Do whatever you want with it
Configure Selenium https://selenium-python.readthedocs.io/installation.html

Python BeautifulSoup

I am using Python BeautifulSoup to extract some data from a famous song site.
Here is the snippet of code:
import requests
from bs4 import BeautifulSoup
url= 'https://gaana.com/playlist/gaana-dj-bollywood-top-50-1'
res = requests.get(url)
while(res.status_code!=200):
try:
res = requests.get('url')
except:
pass
print (res)
soup = BeautifulSoup(res.text,'lxml')
songs = soup.find_all('meta',{'property':'music:song'})
print (songs[0])
Here is the sample output:
<Response [200]>
<meta content="https://gaana.com/song/o-saathi" property="music:song"/>
Now i want to extract the url within content as string so that i can further use that url in my program.
Someone please Help me.
It's in the comments, but I just want to explain: beautifulsoup returns most results as a list or other iterable object. You show that you understand this in your code by using songs[0], but in this case what's been returned is a dictionary.
As explained in this StackOverflow post, you have need to query not only songs[0] but also the property within the dictionary (the two together are called a key pair and are the chief way to get data out of a dictionary).
Last note: while I've been a big fan of BeautifulSoup4 for basic web scraping, you may consider the lxml library. It's pretty well documented; to really take advantage of it you have to learn Python-variety Xpaths, which are sort of like regex for XML/HTML; but for advanced scraping it's probably the last best option short of Selenium, and it returns cleaner data than bs4.
Good luck!

Resources