Using BS4 - how to get text only, and not tags?

Using BS4 - how to get text only, and not tags? - python-3.x

I am trying to scrape a page on medicine and the market asset for some companies on https://www.formularylookup.com/
Below code gets me the desired data as in Number of plans, which pharmacies are covering the medicine, and the status in %. Here is an example of my output, where the desired output would just be "1330 plans":
Number of plans:
<td class="plan-count" role="gridcell">1330 plans</td>
I have tried using .text after each tag.find, but it doesn't work.
Here's my code concerning this specific part. There's a whole lot more going on above, but it includes log in information I cannot share.
total = []
soup = BeautifulSoup(html, "lxml")
for tag in soup.find_all("tbody", {"role":"rowgroup"}):
#name = tag.find("td", {"class":"payer-name"}) #gives me whole tag
name = tag.find("tr", {"role":"row"}).find("td").get("payer-name") #gives me None output
plan = tag.find("td", {"class":"plan-count"}) #gives me whole tag
stat = tag.find("td", {"class":"icon-status"}) #gives me whole tag
data = {"Payer": name, "Number of plans": plan, "Status": stat}
total.append(data)
df = pd.DataFrame(total)
print(df)
Here is a snippet using the inspect function.
<tbody role="rowgroup">
<tr data-uid="a5795205-1518-4a74-b039-abcd1b35b409" role="row">
<td class="payer-name" role="gridcell">CVS Caremark RX</td>
<td class="plan-count" role="gridcell">1330 plans</td>
<td role="gridcell" class="icon-status icon-status-not-covered">98% Not Covered</td>
</tr>
EDIT: After diving deeper into SO I see a solution could be using the Contents function of BS4. Will report back if it works.
- This didn't work:
"AttributeError: 'NoneType' object has no attribute 'contents'"

I figured it out. Apparently there are other tags starting with tbody rowgroup further above, which are classified as None, and therefore it is not possible to get .text of these, until my code reaches the parts I want.
I just need to change this line:
for tag in soup.find_all("tbody", {"role":"rowgroup"}):

Related

How do you use find_previous() in a select query in Python?

I am trying to pull the span (lets call it AAA before a specific span - BBB. This BBB span only shows up certain times on the page and I only want the AAA's which directly precede the BBBs.
Is there a way to select AAA's that are only proceeded by BBB? Or, to get to my proposed question, how can you use find_previous when you're running a select query? I am successful if I just use select_one -
AAA= selsoup.select_one('span.BBB').find_previous().text
but when I try to use select to pull all entries I get an error message (You're probably treating a list of elements like a single element.)
I've tried applying .find_previous in a for loop but that doesnt work either. Any suggestions?
Sorry, I probably should have added this before:
Adding code from the page -
<tr class="tree">
<th class="AAA">What I want right here<span class="BBB">(Aba: The New Look)</span></th>

Instead of .find_previous() you can use + in your CSS selector:
from bs4 import BeautifulSoup
html_doc = """
<span class="ccc"">txt</span>
<span class="aaa"">This I don't Want</span>
<span class="bbb"">txt</span>
<span class="aaa"">* This I Want *</span>
<span class="ccc"">txt</span>
<span class="aaa"">This I don't Want</span>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for aaa in soup.select(".bbb + .aaa"):
print(aaa.text)
Prints:
* This I Want *
EDIT: Based on your edit:
bbb = soup.select_one(".AAA .BBB")
print(bbb.text)
Prints:
(Aba: The New Look)

Python + BeautifulSoup: Can't seem to scrape the specific data that I want from a website due to the website's formatting

I'm a brand new programmer. This is the first program I have ever written, and this is the first post I have ever made on this website.
I am trying to web scrape data for my own personal stock uses and I can't seem to get the proper information to be extracted due to the way the website is formatted. I was wondering if someone could help me. I have tried searching around, but can't find an answer to my problem.
I need the second to last line to be web scraped that reads "3.60/2.56%" but I'm having problems getting to it. I was wondering if maybe there was a way to call a specific line of code from this section.
<table class="name-value-pair hide-for-960">
<tr>
<td>Beta
<div class="tooltip">
<h3>Beta</h3>
<p>A measure of the volatility, or systematic risk, of a security or a portfolio in comparison to the market as a whole.</p>
</div>
</td>
<td class="num">0.674</td>
</tr>
<tr>
<td>Volume
<div class="tooltip">
<h3>Volume</h3>
<p>The number of shares or contracts traded in a security or an entire market during a given period of time.</p>
</div>
</td>
<td class="num" id="quoteVolume">1,513,740.00</td>
</tr>
<tr>
<td>Div & Yield
<div class="tooltip">
<h3>Dividend / Dividend Yield</h3>
<p>A dividend is a distribution of a portion of a company's earnings, decided by the board of directors, to a class of its shareholders. Dividends can be issued as cash payments, as shares of stock, or other property. A dividend yield indicates how much a company pays out in dividends each year relative to its share price.</p>
</div>
</td>
<td class="num">3.60/2.56% </td>
</tr>
This is what my code looks like right now.
#Importing Packages
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
#Asking For Company's Stock Market Ticker
Ticker = input("Enter the Company's Ticker:")
#Adding The Ticker To The Website Search URL
my_url = 'https://www.investopedia.com/markets/stocks/' + Ticker + "/"
#Opening Up Connection, Grabbing The Page And Inputting "my_url" Variable
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#Parsing the HTML Code
page_soup = soup(page_html, "html.parser")
#Finding The Company Name
company_name = page_soup.find("span", {"id": "quoteName"})
#Converting The Company Name To Text Without HTML
print(company_name.text)
#Finding The Company's Price Per Share
share_cost = page_soup.find("td", {"class": "value-price"})
#Converting The Share Cost To Just The Number Without HTML
print("Price Per Share: $" + share_cost.text.strip())
#Finding The Share's Daily Change
share_change = page_soup.find("span", {"id": "quoteChange"})
#Converting The Rate of Change To Just The Number Without HTML
print("Daily Rate of Change: $" + share_change.text.strip())
share_dividend_yield = page_soup.find("table", {"class": "name-value-pair hide-for-960"})
print(share_dividend_yield)
I tried modifying the print(share_dividend_yield) with ".tr.td.div.h3.p" at the end of yield before the parenthesis to get down to the line that I wanted and it won't let me go further than h3.
Any help would be greatly appreciated. Sorry, if my post wasn't formatted properly, and thanks for taking the time to read my post!

If I understood you correctly, you need a number that comes after "Dividend / Dividend Yield" phrase.
If so, then you can do something like this:
...(your code above)...
share_dividend_yield = page_soup.find("table", {"class": "name-value-pair hide-for-960"})
tds = share_dividend_yield.find_all('td')
for i in tds:
if 'Dividend' in i.text:
print(i.find_next('td').text)

Finding item in beautiful soup by text not tag

So i'm trying to get the Area for certain locations by scraping it from their wikipedia page. Using Cumbria as an example (https://en.wikipedia.org/wiki/Cumbria) i can get the info box by;
url = 'https://en.wikipedia.org/wiki/Cumbria'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
value = soup.find('table', {"class": "infobox geography vcard"}) \
.find('tr', {"class":"mergedrow"}).text
however the infobox geography vcard has multiple <tr class='mergerow'> subsets and within each is a <th scope='row'>.
The <th scope='row'> that i want is <th scope="row">Area</th> and i was wondering if i could get the text from the subset of <th scope="row">Area</th> by searching for 'Area' instead of the tags as everything else is ubiquitous under the infobox geography vcard

You can search for all th with scope=row directly. Then iterate over them and see which ones have Area as text, and use find_next_sibling to get the next sibling (which will be the td with the data you need).
Note that this table has 2 Area entries, one for 'Ceremonial county' and one for 'Non-metropolitan county', whatever that means ;).
ths = soup.find_all('th', {'scope': 'row'})
for th in ths:
if th.text == 'Area':
area = th.find_next_sibling().text
print(area)
# 6,768 km2 (2,613 sq mi)
# 6,768 km2 (2,613 sq mi)

Return value from an HTML class/scope?

I have a webpage that I'm trying to return a value from, however I can't find the right way to grab it with Selenium.
Here's the relevant HTML part:
<table class="table table-striped">
<tbody>
<tr class="hidden-sm hidden-xs">
<th scope="row"><a style="cursor: pointer"
onClick="document.formShip.P_IMO.value='9526942';document.formShip.submit();">
9526942</a>
</th>
I'm trying to get 9526942.
I've tried:
imo = driver.find_element_by_xpath("//*[contains(text(), 'document.formShip.P_IMO.value')]")
and looked around here, but don't know what element this is. I tried looking for the class hidden-sm hidden-xs, to no avail:
imo = driver.find_element_by_class_name('hidden-sm hidden-xs')

if you want to get the text you need to use .text. The .text method can be used with a webelement which some text in that.
in your first example which you tried, you are passing a different parameter with text(). usually when you use text(), you need to pass the value which is there between closing and open tags (the text which you see on the screen)
you simply try this.
imo = driver.find_element_by_xpath(.//tr[#class='hidden-sm hidden-xs']).text

Beautifulsoup placing </li> in the wrong place because there is no closing tag [duplicate]

I would like to scrape the table from html code using beautifulsoup. A snippet of the html is shown below. When using table.findAll('tr') I get the entire table and not only the rows. (probably because the closing tags are missing from the html code?)
<TABLE COLS=9 BORDER=0 CELLSPACING=3 CELLPADDING=0>
<TR><TD><B>Artikelbezeichnung</B>
<TD><B>Anbieter</B>
<TD><B>Menge</B>
<TD><B>Taxe-EK</B>
<TD><B>Taxe-VK</B>
<TD><B>Empf.-VK</B>
<TD><B>FB</B>
<TD><B>PZN</B>
<TD><B>Nachfolge</B>
<TR><TD>ACTIQ 200 Mikrogramm Lutschtabl.m.integr.Appl.
<TD>Orifarm
<TD ID=R> 30 St
<TD ID=R> 266,67
<TD ID=R> 336,98
<TD>
<TD>
<TD>12516714
<TD>
</TABLE>
Here is my python code to show what I am struggling with:
soup = BeautifulSoup(data, "html.parser")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.text)

As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = BeautifulSoup(data, "lxml")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.get_text(strip=True))
Note that lxml added html & body tags because they weren't present in the source (It'll try to create a well formed document as previously state).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Using BS4 - how to get text only, and not tags? - python-3.x

Related

How do you use find_previous() in a select query in Python?

Python + BeautifulSoup: Can't seem to scrape the specific data that I want from a website due to the website's formatting

Finding item in beautiful soup by text not tag

Return value from an HTML class/scope?

Beautifulsoup placing </li> in the wrong place because there is no closing tag [duplicate]

Categories

Resources