How can I parse a selected table from webpage with Python BeautifulSoup - python-3.x

I wish to parse the results table from a local sport event (the page basically just contain a table), but when I try with the script below I just get the "menu", not the actual result list. What am I missing?
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
site = "https://rittresultater.no/nb/sb_tid/923?pv2=11027&pv1=U"
html = urlopen(site)
soup = BeautifulSoup(html, "lxml") #BeautifulSoup(urlopen(html, "lxml"))
table = soup.select("table")
df = pd.read_html(str(table))[0]
print.df

This is happening because there are two <table>s on that page. You can either query on the class name of the table you want (in this case .table-condensed) using the class_ parameter of the find() function, or you can just grab the second table in the list of all tables using the find_all() function.
Solution 1:
table = soup.find('table', class_='table-condensed')
print(table)
Solution 2:
tables = soup.find_all('table')
print(tables[1])

Related

How to filter on this artifact in the HTML?

I am using this code and it works:
from bs4 import BeautifulSoup
import sys
import requests
page = requests.get("https://finance.yahoo.com/quote/GOOGL?p=GOOGL")
soup = BeautifulSoup(page.content, 'html.parser')
fin-streamer= soup.find("fin-streamer", class_="Fz(36px)")
print(fin-streamer)
print (fin-streamer.get_text())
It prints this for fin-streamer:
<fin-streamer active="" class="Fw(b) Fz(36px) Mb(-4px) D(ib)" data-field="regularMarketPrice" data-pricehint="2" data-reactid="47" data-symbol="GOOGL" data-test="qsp-price" data-trend="none" value="2897.04">2,897.04</fin-streamer>
What I'd like to do is filter on something more useful than the Fz(36px) class, such as
data-symbol="GOOGL"
but I don't know the syntax for that.
How to select?
To select your element more specific, simply take use of css selectors:
soup.select_one('fin-streamer[data-symbol="GOOGL"]')
Above line selects the first <fin-streamer> with attribute data-symbol="GOOGL" - To get its value just call ['value] as alternativ call .text method.
Note: There is a difference in format of value / text
Example
from bs4 import BeautifulSoup
import requests
page = requests.get("https://finance.yahoo.com/quote/GOOGL?p=GOOGL")
soup = BeautifulSoup(page.content, 'html.parser')
soup.select_one('fin-streamer[data-symbol="GOOGL"]')['value']
You can use a dictionary:
fin_streamer = soup.find("fin-streamer", {"data-symbol":"GOOGL"})

My code doesn't finds a table in Wikipedia

I'm trying to grab the last table (titled "Registro de los casos") on this wikipedia page
with this python 3.7 code
import requests
from bs4 import BeautifulSoup, NavigableString, Tag
def webcrawler():
url = "https://es.wikipedia.org/wiki/Pandemia_de_enfermedad_por_coronavirus_de_2020_en_Argentina"#Cronolog%C3%ADa"
page = requests.get(url)
soup = BeautifulSoup(page.text,"html.parser")
tables = soup.findAll("table", class_='wikitable')[0]
#print(tables)
for table in tables:
if isinstance(table, NavigableString):
continue
ths = table.find_all('th')
headings = [th.text.strip() for th in ths]
print(headings)
webcrawler()
But it only finds the first table, and not the last. What am I doing wrong?
You set tables to the first item that is returned by soup.findAll("table", class_='wikitable')[0]. If you take out [0] you write all tables with that class to the tables variable

Beautiful Soup Error: Trying to retrieve data from web page returns empty array

I am trying to download a list of voting intention opinion polls from this web page using beautiful soup. However, the code I wrote returns an empty array or nothing. The code I used is below:
The page code is like this:
<div class="ST-c2-dv1 ST-ch ST-PS" style="width:33px"></div>
<div class="ST-c2-dv2">41.8</div>
That's what I tried:
import requests
from bs4 import BeautifulSoup
request = requests.get(quote_page) # take the page link
page = request.content # extract page content
soup = BeautifulSoup(page, "html.parser")
# extract all the divs
for each_div in soup.findAll('div',{'class':'ST-c2-dv2'}):
print each_div
At this point, it prints nothing.
I've tried also this:
tutti_a = soup.find_all("html_element", class_="ST-c2-dv2")
and also:
tutti_a = soup.find_all("div", class_="ST-c2-dv2")
But I get an empty array [] or nothing at all
I think you can use the following url
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.marktest.com/wap/a/sf/v~[73D5799E1B0E]/name~Dossier_5fSondagensLegislativas_5f2011.HighCharts.Sondagens.xml.aspx')
soup = bs(r.content, 'lxml')
results = []
for record in soup.select('p'):
results.append([item.text for item in record.select('b')])
df = pd.DataFrame(results)
print(df)
Columns 5,6,7,8,9,10 correspond with PS, PSD,CDS,CDU,Bloco,Outros/Brancos/Nulos
You can drop unwanted columns, add appropriate headers etc.

Scraping with Python 3

Python3:
I'm new to scraping and to train I'm trying to get all the functions from this page:
https://www.w3schools.com/python/python_ref_functions.asp
from bs4 import BeautifulSoup
import requests
url = "https://www.w3schools.com/python/python_ref_functions.asp"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
print(soup.td.text)
# Output: abs()
no matter what I try, I only get the 1st one: abs()
Can you help me get them all from abs() to zip()?
To get all similar tags from any webpage use find_all() it returns list of item .
To get all single tag use find() it returns single item.
trick is to get parent tag of all elements which you need then use different methods of your choice and convenience Here you can find more.
from bs4 import BeautifulSoup
import requests
url = "https://www.w3schools.com/python/python_ref_functions.asp"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
#scrape table which contains all functions
tabledata = soup.find("table", attrs={"class": "w3-table-all notranslate"})
#print(tabledata)
#from table data get all a tags of functions
functions = tabledata.find_all("a")
#find_all() method returns list of elements iterate over it
for func in functions:
print(func.contents)
You can use find_all to iterate through ancestors that match the selector:
for tag in soup.find_all('td'):
print(tag.text)
This will include the Description column though, so you'll need to change this to ignore cells.
soup.td will only return the first matching tag.
So one solution would be:
for tag in soup.find_all('tr'):
cell = tag.td
if cell:
print(cell.text)

Using Beautifulsoup to parse a big comment?

I'm using BS4 to parse this webpage:
You'll notice there are two separate tables on the page. Here's the relevant snipped of my code, which is successfully returning the data I want from the first table, but does not find anything from the second table:
# import packages
import urllib3
import certifi
from bs4 import BeautifulSoup
import pandas as pd
#settings
http = urllib3.PoolManager(
cert_reqs='CERT_REQUIRED',
ca_certs=certifi.where())
gamelog_offense = []
#scrape the data and write the .csv files
url = "https://www.sports-reference.com/cfb/schools/florida/2018/gamelog/"
response = http.request('GET', url)
soup = BeautifulSoup(response.data, features="html.parser")
cnt = 0
for row in soup.findAll('tr'):
try:
col=row.findAll('td')
Pass_cmp = col[4].get_text()
Pass_att = col[5].get_text()
gamelog_offense.append([Pass_cmp, Pass_att])
cnt += 1
except:
pass
print("Finished writing with " + str(cnt) + " records")
Finished writing with 13 records
I've verified the data from the SECOND table is contained within the soup (I can see it!). After lots of troubleshooting, I've discovered that the entire second table is completely contained within one big comment(why?). I've managed to extract this comment into a single comment object using the code below, but can't figure out what to do with it after that to extract the data I want. Ideally, I'd like to parse the comment in same way I'm successfully parsing the first table. I've tried using the ideas from similar stack overflow questions (selenium, phantomjs)...no luck.
import bs4
defense = soup.find(id="all_defense")
for item in defense.children:
if isinstance(item, bs4.element.Comment):
big_comment = item
print(big_comment)
<div class="table_outer_container">
<div class="overthrow table_container" id="div_defense">
...and so on....
Posting an answer here in case others find helpful. Many thanks to #TomasCarvalho for directing me to find a solution. I was able to pass the big comment as html into a second soup instance using the following code, and then just use the original parsing code on the new soup instance. (note: the try/except is because some of the teams have no gamelog, and you can't call .children on a NoneType.
try:
defense = soup.find(id="all_defense")
for item in defense.children:
if isinstance(item, bs4.element.Comment):
html = item
Dsoup = BeautifulSoup(html, features="html.parser")
except:
html = ''
Dsoup = BeautifulSoup(html, features="html.parser")

Resources