Beautiful soup class selector - python-3.x

I have two types of tables in a html page.
<table class="ui-table hp-formTable ui-table_type1 ui-table_sortable">
<table class="ui-table hp-raceRecords ui-table_type2">
I need to select only the first one.
If I try something like this:
BSdata1 = BeautifulSoup(driver1.page_source, 'lxml')
Parameters = BSdata1.find_all('table',{'class':'ui-table hp-formTable ui-table_type1 ui-table_sortable'})
It keeps selecting both Tables. How can I select only the first one?
I am using Python3, BS4 and lxml parser on Windows machine.

The sample tables in your questions aren't properly formatted, but using css selectors this should work:
driver1.page_source = """
<doc>
<table class="ui-table hp-formTable ui-table_type1 ui-table_sortable">First Table</table>
<table class="ui-table hp-raceRecords ui-table_type2">Second Table</table>
</doc>
"""
BSdata1 = BeautifulSoup(driver1.page_source, 'lxml')
Parameters = BSdata1.select('table.ui-table.hp-formTable.ui-table_type1.ui-table_sortable')
Parameters
Output:
[<table class="ui-table hp-formTable ui-table_type1 ui-table_sortable">First Table</table>]

Related

Using BS4 - how to get text only, and not tags?

I am trying to scrape a page on medicine and the market asset for some companies on https://www.formularylookup.com/
Below code gets me the desired data as in Number of plans, which pharmacies are covering the medicine, and the status in %. Here is an example of my output, where the desired output would just be "1330 plans":
Number of plans:
<td class="plan-count" role="gridcell">1330 plans</td>
I have tried using .text after each tag.find, but it doesn't work.
Here's my code concerning this specific part. There's a whole lot more going on above, but it includes log in information I cannot share.
total = []
soup = BeautifulSoup(html, "lxml")
for tag in soup.find_all("tbody", {"role":"rowgroup"}):
#name = tag.find("td", {"class":"payer-name"}) #gives me whole tag
name = tag.find("tr", {"role":"row"}).find("td").get("payer-name") #gives me None output
plan = tag.find("td", {"class":"plan-count"}) #gives me whole tag
stat = tag.find("td", {"class":"icon-status"}) #gives me whole tag
data = {"Payer": name, "Number of plans": plan, "Status": stat}
total.append(data)
df = pd.DataFrame(total)
print(df)
Here is a snippet using the inspect function.
<tbody role="rowgroup">
<tr data-uid="a5795205-1518-4a74-b039-abcd1b35b409" role="row">
<td class="payer-name" role="gridcell">CVS Caremark RX</td>
<td class="plan-count" role="gridcell">1330 plans</td>
<td role="gridcell" class="icon-status icon-status-not-covered">98% Not Covered</td>
</tr>
EDIT: After diving deeper into SO I see a solution could be using the Contents function of BS4. Will report back if it works.
- This didn't work:
"AttributeError: 'NoneType' object has no attribute 'contents'"
I figured it out. Apparently there are other tags starting with tbody rowgroup further above, which are classified as None, and therefore it is not possible to get .text of these, until my code reaches the parts I want.
I just need to change this line:
for tag in soup.find_all("tbody", {"role":"rowgroup"}):

Python: Beautiful soup to get text

I'm trying to get link under a href and also the text available in the next <td scope = "raw">
I've tried
url = "https://www.sec.gov/Archives/edgar/data/1491829/0001171520-19-000171-index.htm"
records = []
for link in soup.find_all('a'):
Name = link.text
Links = link.get('href')
records.append((Name, Links))
However this gives me eps8453.htm as text since this is the text under tag <a href>. Is there any way we can look for the text i.e. "10-K" in the tag <td scope = "raw"> next to tag <a href>
Please help!
Use find_next <td> tag after <a> tag inside table.
import requests
from bs4 import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/1491829/0001171520-19-000171-index.htm"
html=requests.get(url).text
soup=BeautifulSoup(html,'html.parser')
records = []
for link in soup.find('table', class_='tableFile').find_all('a'):
Name = link.text
Links = link.get('href')
text=link.find_next('td').contents[0]
print(Name,text)
records.append((Name, Links,text))
Output:
eps8453.htm 10-K
ex31-1.htm EX-31.1
ex31-2.htm EX-31.2
ex32-1.htm EX-32.1
yu-logo.jpg GRAPHIC
yu_sig.jpg GRAPHIC
0001171520-19-000171.txt
 

Beautiful Soup extract tag attributes, then find_all with multiple attributes

I am trying to extract the same information which appears numerous times on the same page. I am able to find the tag that it fits in which looks like this:
<div class="title" style="visibility: visible">
From this, i'd like to extract:
class="title"
AND
style="visibility: visible"
Then do a:
find_all('div),{'class':'title,'style''visibility: visible'}
This is going to happen in numerous instances, so I can't hardcode it. Sometimes the tag will have a class, sometimes a class and style....sometimes more....
Is this possible?
Really appreciate any direction on this.
Many thanks,
Also, you can use find_all method if you want more than one div in the content
code:
from bs4 import BeautifulSoup
import json
data = """<div class="title" style="visibility: visible"> </div>"""
soup = BeautifulSoup(data, 'html.parser') #parse content to BeautifulSoup Module
div_content = dict(soup.find("div").attrs)
print("div_content : {0}".format(div_content)) #div content
print("style_content : {0}".format(div_content.get("style"))) # style attribute
print("class_content : {0}".format(div_content.get("class")[0])) # class attribute
output:
div_content : {u'style': u'visibility: visible', u'class': [u'title']}
style_content : visibility: visible
class_content : title

Finding item in beautiful soup by text not tag

So i'm trying to get the Area for certain locations by scraping it from their wikipedia page. Using Cumbria as an example (https://en.wikipedia.org/wiki/Cumbria) i can get the info box by;
url = 'https://en.wikipedia.org/wiki/Cumbria'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
value = soup.find('table', {"class": "infobox geography vcard"}) \
.find('tr', {"class":"mergedrow"}).text
however the infobox geography vcard has multiple <tr class='mergerow'> subsets and within each is a <th scope='row'>.
The <th scope='row'> that i want is <th scope="row">Area</th> and i was wondering if i could get the text from the subset of <th scope="row">Area</th> by searching for 'Area' instead of the tags as everything else is ubiquitous under the infobox geography vcard
You can search for all th with scope=row directly. Then iterate over them and see which ones have Area as text, and use find_next_sibling to get the next sibling (which will be the td with the data you need).
Note that this table has 2 Area entries, one for 'Ceremonial county' and one for 'Non-metropolitan county', whatever that means ;).
ths = soup.find_all('th', {'scope': 'row'})
for th in ths:
if th.text == 'Area':
area = th.find_next_sibling().text
print(area)
# 6,768 km2 (2,613 sq mi)
# 6,768 km2 (2,613 sq mi)

Beautifulsoup placing </li> in the wrong place because there is no closing tag [duplicate]

I would like to scrape the table from html code using beautifulsoup. A snippet of the html is shown below. When using table.findAll('tr') I get the entire table and not only the rows. (probably because the closing tags are missing from the html code?)
<TABLE COLS=9 BORDER=0 CELLSPACING=3 CELLPADDING=0>
<TR><TD><B>Artikelbezeichnung</B>
<TD><B>Anbieter</B>
<TD><B>Menge</B>
<TD><B>Taxe-EK</B>
<TD><B>Taxe-VK</B>
<TD><B>Empf.-VK</B>
<TD><B>FB</B>
<TD><B>PZN</B>
<TD><B>Nachfolge</B>
<TR><TD>ACTIQ 200 Mikrogramm Lutschtabl.m.integr.Appl.
<TD>Orifarm
<TD ID=R> 30 St
<TD ID=R> 266,67
<TD ID=R> 336,98
<TD>
<TD>
<TD>12516714
<TD>
</TABLE>
Here is my python code to show what I am struggling with:
soup = BeautifulSoup(data, "html.parser")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.text)
As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = BeautifulSoup(data, "lxml")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.get_text(strip=True))
Note that lxml added html & body tags because they weren't present in the source (It'll try to create a well formed document as previously state).

Resources