Placing BeautifulSoup data into a Pandas dataframe - coming up blank - python-3.x

Goal: I'd like to create a data frame after scraping data from a website and narrowing it down to the table of interest (I am looking to get average meat consumption per capita for all countries in the world)
Problem: I have the table of interest but I am having trouble placing it into a data frame. However, everything I try ends up with a blank data frame.
Output:
<table class="wikitable sortable">
<caption>Countries by meat consumption per capita
</caption>
<tbody><tr>
<th>Country</th>
<th>kg/person (2002)<sup class="reference" id="cite_ref-9">[9]</sup><sup class="reference" id="cite_ref-11">[note 1]</sup></th>
<th>kg/person (2009)<sup class="reference" id="cite_ref-FAO2013_10-1">[10]</sup></th>
<th>kg/person (2017)<sup class="reference" id="cite_ref-12">[11]</sup>
</th></tr>
<tr>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="700" data-file-width="980" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/36/Flag_of_Albania.svg/21px-Flag_of_Albania.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/36/Flag_of_Albania.svg/32px-Flag_of_Albania.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/36/Flag_of_Albania.svg/42px-Flag_of_Albania.svg.png 2x" width="21"/> </span>Albania</td>
<td>38.2</td>
<td></td>
<td>
</td></tr>
<tr>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/7/77/Flag_of_Algeria.svg/23px-Flag_of_Algeria.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/7/77/Flag_of_Algeria.svg/35px-Flag_of_Algeria.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/7/77/Flag_of_Algeria.svg/45px-Flag_of_Algeria.svg.png 2x" width="23"/> </span>Algeria</td>
<td>18.3</td>
<td>19.5</td>
<td>17.33
</td></tr>
<tr>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="500" data-file-width="1000" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Flag_of_American_Samoa.svg/23px-Flag_of_American_Samoa.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Flag_of_American_Samoa.svg/35px-Flag_of_American_Samoa.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/87/Flag_of_American_Samoa.svg/46px-Flag_of_American_Samoa.svg.png 2x" width="23"/> </span>American Samoa</td>
<td>24.9</td>
<td>26.8</td>
<td>
</td></tr>
<tr>
I am looking to pull the following column titles for a chart on meat consumption per capita for all of the countries in the world: Country, kg/person (2002), kg/person (2009), kg/person (2017)
My Code:
A=[]
B=[]
C=[]
for row in table_meat1.findAll('tr'):
cells=row.findAll('td')
if len(cells)==3:
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
C.append(cells[2].find(text=True))
Need help placing the data into a data frame!

The answer to this question would be:
Use Selenium with chrome driver, to do so you can use :
pip install selenium
Then download the appropriate chrome driver from here considering the os I checked version 86.0.4240.22 it worked fine.
unzip and put it somewhere like: /Users/admin/software/chromedriver
Then run this code.
from selenium import webdriver
URL = 'https://www.amazon.com/Metagenics-Ultra-Potent-C-1000-Count/dp/B004GLEUHI/ref=sr_1_2_sspa?crid=11YWA9XFVALBP&dchild=1&keywords=metagenics&qid=1603050330&sprefix=metageni%2Caps%2C224&sr=8-2-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUFRRDdMVU5GNDFKQ1QmZW5jcnlwdGVkSWQ9QTA1NTc3NzAxSFYxV0k5MlFGUUZTJmVuY3J5cHRlZEFkSWQ9QTA2MzM0MzAyWDBDSjNCNlFGRVJNJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome("/Users/admin/software/chromedriver",chrome_options=options)
driver.implicitly_wait(5)
driver.get(URL)
content = driver.page_source;
soup = BeautifulSoup(content)
price=soup.find('table', class_='wikitable sortable')
print(price)
But be aware that web scraping is forbidden on some websites and you have to use their provided web API.

Related

Python + BeautifulSoup: Can't seem to scrape the specific data that I want from a website due to the website's formatting

I'm a brand new programmer. This is the first program I have ever written, and this is the first post I have ever made on this website.
I am trying to web scrape data for my own personal stock uses and I can't seem to get the proper information to be extracted due to the way the website is formatted. I was wondering if someone could help me. I have tried searching around, but can't find an answer to my problem.
I need the second to last line to be web scraped that reads "3.60/2.56%" but I'm having problems getting to it. I was wondering if maybe there was a way to call a specific line of code from this section.
<table class="name-value-pair hide-for-960">
<tr>
<td>Beta
<div class="tooltip">
<h3>Beta</h3>
<p>A measure of the volatility, or systematic risk, of a security or a portfolio in comparison to the market as a whole.</p>
</div>
</td>
<td class="num">0.674</td>
</tr>
<tr>
<td>Volume
<div class="tooltip">
<h3>Volume</h3>
<p>The number of shares or contracts traded in a security or an entire market during a given period of time.</p>
</div>
</td>
<td class="num" id="quoteVolume">1,513,740.00</td>
</tr>
<tr>
<td>Div & Yield
<div class="tooltip">
<h3>Dividend / Dividend Yield</h3>
<p>A dividend is a distribution of a portion of a company's earnings, decided by the board of directors, to a class of its shareholders. Dividends can be issued as cash payments, as shares of stock, or other property. A dividend yield indicates how much a company pays out in dividends each year relative to its share price.</p>
</div>
</td>
<td class="num">3.60/2.56% </td>
</tr>
This is what my code looks like right now.
#Importing Packages
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
#Asking For Company's Stock Market Ticker
Ticker = input("Enter the Company's Ticker:")
#Adding The Ticker To The Website Search URL
my_url = 'https://www.investopedia.com/markets/stocks/' + Ticker + "/"
#Opening Up Connection, Grabbing The Page And Inputting "my_url" Variable
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#Parsing the HTML Code
page_soup = soup(page_html, "html.parser")
#Finding The Company Name
company_name = page_soup.find("span", {"id": "quoteName"})
#Converting The Company Name To Text Without HTML
print(company_name.text)
#Finding The Company's Price Per Share
share_cost = page_soup.find("td", {"class": "value-price"})
#Converting The Share Cost To Just The Number Without HTML
print("Price Per Share: $" + share_cost.text.strip())
#Finding The Share's Daily Change
share_change = page_soup.find("span", {"id": "quoteChange"})
#Converting The Rate of Change To Just The Number Without HTML
print("Daily Rate of Change: $" + share_change.text.strip())
share_dividend_yield = page_soup.find("table", {"class": "name-value-pair hide-for-960"})
print(share_dividend_yield)
I tried modifying the print(share_dividend_yield) with ".tr.td.div.h3.p" at the end of yield before the parenthesis to get down to the line that I wanted and it won't let me go further than h3.
Any help would be greatly appreciated. Sorry, if my post wasn't formatted properly, and thanks for taking the time to read my post!
If I understood you correctly, you need a number that comes after "Dividend / Dividend Yield" phrase.
If so, then you can do something like this:
...(your code above)...
share_dividend_yield = page_soup.find("table", {"class": "name-value-pair hide-for-960"})
tds = share_dividend_yield.find_all('td')
for i in tds:
if 'Dividend' in i.text:
print(i.find_next('td').text)

select specific rows with class names

I am parsing an HTML which has bunch of rows that I want to select. Here are example of those rows
<tr class="constantstring-randomvalue1-row" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue1-row'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
<tr class="constantstring-randomvalue1-row" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue1-row'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
<tr class="constantstring-randomvalue2-row-2" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue2-row-2'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
<tr class="constantstring-randomvalue2-row-2" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue2-row-2'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
What i was trying to do is use BeautifulSoup4 and find_all using a regex find_all(re.compile(regext))
However, the problem is that i am unable to come up with a good regext which will select all rows that i am interested in.
all the rows that i want start with constantstring-. I don't care what it is followed by. What would be the proper way, should i use re.compile and if so, what will be the correct regex?
If you want to accomplish this with RE the following will do, I added an extra row to demo it not picking up the final row.
http://rextester.com/OSSFB8621
from bs4 import BeautifulSoup
import re
html ="""
<tr class="constantstring-randomvalue1-row" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue1-row'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
<tr class="constantstring-randomvalue1-row" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue1-row'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
<tr class="constantstring-randomvalue2-row-2" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue2-row-2'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
<tr class="constantstring-randomvalue2-row-2" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue2-row-2'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
<tr class="axcconstantstring-randomvalue2-row-2" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue2-row-2'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
"""
bs = BeautifulSoup(html,'lxml')
for tr in bs.find_all("tr", {"class" : re.compile('^(constantstring)')}):
print(tr)
Instead of regex you can use in-built string methods for the same task. Like,
rows = soup.find_all('tr)'
selected_rows = [i for i in rows if str(i).startswith('tr class="constantstring-randomvalue')]
If you miss str() the if condition will fail.
Hope this helps! Cheers!

Pandas read_html misses table with several rows

I'm scraping a website and in order to get the table I'm using pd.read_html.
I get the node doing this:
table=WebDriverWait(browser,10).until(EC.presence_of_element_located((
By.XPATH,'//tbody[ancestor::div[contains(#id,"cornerOddsDiv")]]')))
newt=pd.read_html(table.get_attribute('outerHTML'))
This returns:
ValueError: No tables found
Giving the table node this output:
table.get_attribute('outerHTML')
>>'<tbody><tr><th colspan="10" align="center" class="bg1">365 Corner Odds</th></tr><tr bgcolor="#FCEAAB"><td colspan="10" align="center"><strong>Over/Under</strong></td></tr><tr onclick="goCorner(1510721)" style="cursor:pointer;" align="center" class="bg1" id="trCornerTotal" odds="1.19,0.25,0.72"><td width="14%" bgcolor="#EBF2F8">early</td><td width="10%" class="bg2">1 </td><td width="10%" class="bg2">10.5</td><td width="10%" class="bg2">0.8</td><td width="6%" class="bg2">detail</td><td width="14%" bgcolor="#EBF2F8">0.25</td><td width="10%" class="bg2">1</td><td width="10%" class="bg2">0.72</td><td width="10%" class="bg2">0.8</td><td width="6%" class="bg2">detail</td></tr></tbody>'
Why is it not working? I have followed the same procedure for other tables and they did work.
I finally found the answer. The node is of a structure like the following
<div>
<div>
<table>
<tbody>
<tr>..</tr>
<tr>..</tr>
...
</tbody>
Etc
The key is, instead of passing the node of tbody, for reasons unknown to me,I have to pass table node, and then it works just as fine as the others when using tbody.
So, it would be:
table=WebDriverWait(browser,10).until(EC.presence_of_element_located((
By.XPATH,'//table[contains(#class,"bhTable") and
ancestor::div[contains(#id,"cornerOddsDiv")]]')))
and that returns the desired output

xpath: How do I extract text within the "strong" tag?

I'm using scrapy and need to extract "Gray / Gray" using xpath selectors.
Here's the html snippet:
<div class="Vehicle-Overview">
<div class="Txt-YMM">
2006 GMC Sierra 1500
</div>
<div class="Txt-Price">
Price : $8,499
</div>
<table width="100%" border="0" cellpadding="0" cellspacing="0"
class="Table-Specs">
<tr>
<td>
<strong>2006 GMC Sierra 1500 Crew Cab 143.5 WB 4WD
SLE</strong>
<strong class="text-right t-none"></strong>
</td>
</tr>
<tr>
<td>
<strong>Gray / Gray</strong><br />
<strong>209,123
Miles
/ VIN: XXXXXXXXXX
</td>
</tr>
</table>
I'm stuck trying to extract "Gray / Gray" within the "strong" tag. Any help is appreciated.
This XPath will work in Scrapy and also in Google/Firefox Developer's Console:
//div[#class='Vehicle-Overview']/table[#class='Table-Specs']//tr[2]/td[1]/strong[1]/text()
You can use this code in your spider:
color = response.xpath("//div[#class='Vehicle-Overview']/table[#class='Table-Specs']//tr[2]/td[1]/strong[1]/text()").extract_first()
You can use this XPath expression with your sample XML/HTML:
//div[#class='Vehicle-Overview']/table[#class='Table-Specs']/tr[2]/td[1]/strong[1]
A full XPath given the full file mentioned below with respect to a namespace "http://www.w3.org/1999/xhtml" can be
/html/body/div/div/div[#class='content-bg']/div/div/div[#class='Vehicle-Overview']/table[#class='Table-Specs']/tr[2]/td[1]/strong[1]

Get href within a table

Sorry, has most likely been asked before but I can't seem to find an answer on stack/from search engine.
I'm trying to scrape some data from a table, but there are href links which I need to get. Html as follows:
<table class="featprop results">
<tr>
**1)**<td class="propname" colspan="2"> West Drayton</td>
</tr>
<tr><td class="propimg" colspan="2">
<div class="imgcrop">
**2)**<img src="content/images/1/1/641/w296/858.jpg" alt=" Ashford" width="148"/>
<div class="let"> </div>
</div>
</td></tr>
<tr><td class="proprooms">
So far I have used the following:
for table in soup.findAll('table', {'class': 'featprop results'}):
for tr in table.findAll('tr'):
for a in tr.findAll('a'):
print(a)
Which returns both 1 and 2 in the above html, could anyone help me strip out just the href link?
for table in soup.findAll('table', {'class': 'featprop results'}):
for tr in table.findAll('tr'):
for a in tr.findAll('a'):
print(a['href'])
out:
/lettings-search-results?task=View&itemid=136
/lettings-search-results?task=View&itemid=136
Attributes
EDIT:
links = set() # set will remove the dupilcate
for a in tr.findAll('a', href=re.compile(r'^/lettings-search-results?')):
links.add(a['href'])
regular expression
This provide you an array of tags under the element of the selected class name.
result = soup.select(".featprop a");
for a in result:
print(a['href'])
Give you the below result:
/lettings-search-results?task=View&itemid=136
/lettings-search-results?task=View&itemid=136

Resources