Finding item in beautiful soup by text not tag - python-3.x

So i'm trying to get the Area for certain locations by scraping it from their wikipedia page. Using Cumbria as an example (https://en.wikipedia.org/wiki/Cumbria) i can get the info box by;
url = 'https://en.wikipedia.org/wiki/Cumbria'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
value = soup.find('table', {"class": "infobox geography vcard"}) \
.find('tr', {"class":"mergedrow"}).text
however the infobox geography vcard has multiple <tr class='mergerow'> subsets and within each is a <th scope='row'>.
The <th scope='row'> that i want is <th scope="row">Area</th> and i was wondering if i could get the text from the subset of <th scope="row">Area</th> by searching for 'Area' instead of the tags as everything else is ubiquitous under the infobox geography vcard

You can search for all th with scope=row directly. Then iterate over them and see which ones have Area as text, and use find_next_sibling to get the next sibling (which will be the td with the data you need).
Note that this table has 2 Area entries, one for 'Ceremonial county' and one for 'Non-metropolitan county', whatever that means ;).
ths = soup.find_all('th', {'scope': 'row'})
for th in ths:
if th.text == 'Area':
area = th.find_next_sibling().text
print(area)
# 6,768 km2 (2,613 sq mi)
# 6,768 km2 (2,613 sq mi)

Related

How to scrape nested text between tags using BeautifulSoup?

I found a website using the following HTML structure somewhere:
...
<td>
<span>some span text</span>
some td text
</td>
...
I'm interested in retrieving the "some td text" and not the "some span text" but the get_text() method seems to return all the text as "some span textsome td text". Is there a way to get just the text inside a certain element using BeautifulSoup?
Not all the tds follow the same structure, so unfortunately I cannot predict the structure of the resulting string to trim it where necessary.
Each element has a name attribute, which tells you the type of tag, e.g. div, td, span. In the case there is no tag (bare content), it will be None.
So you can just use a simple list comprehension to filter out all the tag elements.
from bs4 import BeautifulSoup
html = '''
<td>
<span>some span text</span>
some td text
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
content = soup.find('td')
text = [c.strip() for c in content if c.name is None and c.strip() != '']
print(text)
This will print:
['some td text']
after some cleaning of newlines and empty strings.
If you wanted to join up the content afterwards, you could use join:
print('\n'.join(text))

Beautiful soup class selector

I have two types of tables in a html page.
<table class="ui-table hp-formTable ui-table_type1 ui-table_sortable">
<table class="ui-table hp-raceRecords ui-table_type2">
I need to select only the first one.
If I try something like this:
BSdata1 = BeautifulSoup(driver1.page_source, 'lxml')
Parameters = BSdata1.find_all('table',{'class':'ui-table hp-formTable ui-table_type1 ui-table_sortable'})
It keeps selecting both Tables. How can I select only the first one?
I am using Python3, BS4 and lxml parser on Windows machine.
The sample tables in your questions aren't properly formatted, but using css selectors this should work:
driver1.page_source = """
<doc>
<table class="ui-table hp-formTable ui-table_type1 ui-table_sortable">First Table</table>
<table class="ui-table hp-raceRecords ui-table_type2">Second Table</table>
</doc>
"""
BSdata1 = BeautifulSoup(driver1.page_source, 'lxml')
Parameters = BSdata1.select('table.ui-table.hp-formTable.ui-table_type1.ui-table_sortable')
Parameters
Output:
[<table class="ui-table hp-formTable ui-table_type1 ui-table_sortable">First Table</table>]

Using BS4 - how to get text only, and not tags?

I am trying to scrape a page on medicine and the market asset for some companies on https://www.formularylookup.com/
Below code gets me the desired data as in Number of plans, which pharmacies are covering the medicine, and the status in %. Here is an example of my output, where the desired output would just be "1330 plans":
Number of plans:
<td class="plan-count" role="gridcell">1330 plans</td>
I have tried using .text after each tag.find, but it doesn't work.
Here's my code concerning this specific part. There's a whole lot more going on above, but it includes log in information I cannot share.
total = []
soup = BeautifulSoup(html, "lxml")
for tag in soup.find_all("tbody", {"role":"rowgroup"}):
#name = tag.find("td", {"class":"payer-name"}) #gives me whole tag
name = tag.find("tr", {"role":"row"}).find("td").get("payer-name") #gives me None output
plan = tag.find("td", {"class":"plan-count"}) #gives me whole tag
stat = tag.find("td", {"class":"icon-status"}) #gives me whole tag
data = {"Payer": name, "Number of plans": plan, "Status": stat}
total.append(data)
df = pd.DataFrame(total)
print(df)
Here is a snippet using the inspect function.
<tbody role="rowgroup">
<tr data-uid="a5795205-1518-4a74-b039-abcd1b35b409" role="row">
<td class="payer-name" role="gridcell">CVS Caremark RX</td>
<td class="plan-count" role="gridcell">1330 plans</td>
<td role="gridcell" class="icon-status icon-status-not-covered">98% Not Covered</td>
</tr>
EDIT: After diving deeper into SO I see a solution could be using the Contents function of BS4. Will report back if it works.
- This didn't work:
"AttributeError: 'NoneType' object has no attribute 'contents'"
I figured it out. Apparently there are other tags starting with tbody rowgroup further above, which are classified as None, and therefore it is not possible to get .text of these, until my code reaches the parts I want.
I just need to change this line:
for tag in soup.find_all("tbody", {"role":"rowgroup"}):

Python + BeautifulSoup: Can't seem to scrape the specific data that I want from a website due to the website's formatting

I'm a brand new programmer. This is the first program I have ever written, and this is the first post I have ever made on this website.
I am trying to web scrape data for my own personal stock uses and I can't seem to get the proper information to be extracted due to the way the website is formatted. I was wondering if someone could help me. I have tried searching around, but can't find an answer to my problem.
I need the second to last line to be web scraped that reads "3.60/2.56%" but I'm having problems getting to it. I was wondering if maybe there was a way to call a specific line of code from this section.
<table class="name-value-pair hide-for-960">
<tr>
<td>Beta
<div class="tooltip">
<h3>Beta</h3>
<p>A measure of the volatility, or systematic risk, of a security or a portfolio in comparison to the market as a whole.</p>
</div>
</td>
<td class="num">0.674</td>
</tr>
<tr>
<td>Volume
<div class="tooltip">
<h3>Volume</h3>
<p>The number of shares or contracts traded in a security or an entire market during a given period of time.</p>
</div>
</td>
<td class="num" id="quoteVolume">1,513,740.00</td>
</tr>
<tr>
<td>Div & Yield
<div class="tooltip">
<h3>Dividend / Dividend Yield</h3>
<p>A dividend is a distribution of a portion of a company's earnings, decided by the board of directors, to a class of its shareholders. Dividends can be issued as cash payments, as shares of stock, or other property. A dividend yield indicates how much a company pays out in dividends each year relative to its share price.</p>
</div>
</td>
<td class="num">3.60/2.56% </td>
</tr>
This is what my code looks like right now.
#Importing Packages
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
#Asking For Company's Stock Market Ticker
Ticker = input("Enter the Company's Ticker:")
#Adding The Ticker To The Website Search URL
my_url = 'https://www.investopedia.com/markets/stocks/' + Ticker + "/"
#Opening Up Connection, Grabbing The Page And Inputting "my_url" Variable
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#Parsing the HTML Code
page_soup = soup(page_html, "html.parser")
#Finding The Company Name
company_name = page_soup.find("span", {"id": "quoteName"})
#Converting The Company Name To Text Without HTML
print(company_name.text)
#Finding The Company's Price Per Share
share_cost = page_soup.find("td", {"class": "value-price"})
#Converting The Share Cost To Just The Number Without HTML
print("Price Per Share: $" + share_cost.text.strip())
#Finding The Share's Daily Change
share_change = page_soup.find("span", {"id": "quoteChange"})
#Converting The Rate of Change To Just The Number Without HTML
print("Daily Rate of Change: $" + share_change.text.strip())
share_dividend_yield = page_soup.find("table", {"class": "name-value-pair hide-for-960"})
print(share_dividend_yield)
I tried modifying the print(share_dividend_yield) with ".tr.td.div.h3.p" at the end of yield before the parenthesis to get down to the line that I wanted and it won't let me go further than h3.
Any help would be greatly appreciated. Sorry, if my post wasn't formatted properly, and thanks for taking the time to read my post!
If I understood you correctly, you need a number that comes after "Dividend / Dividend Yield" phrase.
If so, then you can do something like this:
...(your code above)...
share_dividend_yield = page_soup.find("table", {"class": "name-value-pair hide-for-960"})
tds = share_dividend_yield.find_all('td')
for i in tds:
if 'Dividend' in i.text:
print(i.find_next('td').text)

Beautifulsoup placing </li> in the wrong place because there is no closing tag [duplicate]

I would like to scrape the table from html code using beautifulsoup. A snippet of the html is shown below. When using table.findAll('tr') I get the entire table and not only the rows. (probably because the closing tags are missing from the html code?)
<TABLE COLS=9 BORDER=0 CELLSPACING=3 CELLPADDING=0>
<TR><TD><B>Artikelbezeichnung</B>
<TD><B>Anbieter</B>
<TD><B>Menge</B>
<TD><B>Taxe-EK</B>
<TD><B>Taxe-VK</B>
<TD><B>Empf.-VK</B>
<TD><B>FB</B>
<TD><B>PZN</B>
<TD><B>Nachfolge</B>
<TR><TD>ACTIQ 200 Mikrogramm Lutschtabl.m.integr.Appl.
<TD>Orifarm
<TD ID=R> 30 St
<TD ID=R> 266,67
<TD ID=R> 336,98
<TD>
<TD>
<TD>12516714
<TD>
</TABLE>
Here is my python code to show what I am struggling with:
soup = BeautifulSoup(data, "html.parser")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.text)
As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = BeautifulSoup(data, "lxml")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.get_text(strip=True))
Note that lxml added html & body tags because they weren't present in the source (It'll try to create a well formed document as previously state).

Resources