Beautifulsoup placing </li> in the wrong place because there is no closing tag [duplicate] - python-3.x

I would like to scrape the table from html code using beautifulsoup. A snippet of the html is shown below. When using table.findAll('tr') I get the entire table and not only the rows. (probably because the closing tags are missing from the html code?)
<TABLE COLS=9 BORDER=0 CELLSPACING=3 CELLPADDING=0>
<TR><TD><B>Artikelbezeichnung</B>
<TD><B>Anbieter</B>
<TD><B>Menge</B>
<TD><B>Taxe-EK</B>
<TD><B>Taxe-VK</B>
<TD><B>Empf.-VK</B>
<TD><B>FB</B>
<TD><B>PZN</B>
<TD><B>Nachfolge</B>
<TR><TD>ACTIQ 200 Mikrogramm Lutschtabl.m.integr.Appl.
<TD>Orifarm
<TD ID=R> 30 St
<TD ID=R> 266,67
<TD ID=R> 336,98
<TD>
<TD>
<TD>12516714
<TD>
</TABLE>
Here is my python code to show what I am struggling with:
soup = BeautifulSoup(data, "html.parser")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.text)

As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = BeautifulSoup(data, "lxml")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.get_text(strip=True))
Note that lxml added html & body tags because they weren't present in the source (It'll try to create a well formed document as previously state).

Related

How to scrape nested text between tags using BeautifulSoup?

I found a website using the following HTML structure somewhere:
...
<td>
<span>some span text</span>
some td text
</td>
...
I'm interested in retrieving the "some td text" and not the "some span text" but the get_text() method seems to return all the text as "some span textsome td text". Is there a way to get just the text inside a certain element using BeautifulSoup?
Not all the tds follow the same structure, so unfortunately I cannot predict the structure of the resulting string to trim it where necessary.
Each element has a name attribute, which tells you the type of tag, e.g. div, td, span. In the case there is no tag (bare content), it will be None.
So you can just use a simple list comprehension to filter out all the tag elements.
from bs4 import BeautifulSoup
html = '''
<td>
<span>some span text</span>
some td text
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
content = soup.find('td')
text = [c.strip() for c in content if c.name is None and c.strip() != '']
print(text)
This will print:
['some td text']
after some cleaning of newlines and empty strings.
If you wanted to join up the content afterwards, you could use join:
print('\n'.join(text))

Beautiful soup class selector

I have two types of tables in a html page.
<table class="ui-table hp-formTable ui-table_type1 ui-table_sortable">
<table class="ui-table hp-raceRecords ui-table_type2">
I need to select only the first one.
If I try something like this:
BSdata1 = BeautifulSoup(driver1.page_source, 'lxml')
Parameters = BSdata1.find_all('table',{'class':'ui-table hp-formTable ui-table_type1 ui-table_sortable'})
It keeps selecting both Tables. How can I select only the first one?
I am using Python3, BS4 and lxml parser on Windows machine.
The sample tables in your questions aren't properly formatted, but using css selectors this should work:
driver1.page_source = """
<doc>
<table class="ui-table hp-formTable ui-table_type1 ui-table_sortable">First Table</table>
<table class="ui-table hp-raceRecords ui-table_type2">Second Table</table>
</doc>
"""
BSdata1 = BeautifulSoup(driver1.page_source, 'lxml')
Parameters = BSdata1.select('table.ui-table.hp-formTable.ui-table_type1.ui-table_sortable')
Parameters
Output:
[<table class="ui-table hp-formTable ui-table_type1 ui-table_sortable">First Table</table>]

Using BS4 - how to get text only, and not tags?

I am trying to scrape a page on medicine and the market asset for some companies on https://www.formularylookup.com/
Below code gets me the desired data as in Number of plans, which pharmacies are covering the medicine, and the status in %. Here is an example of my output, where the desired output would just be "1330 plans":
Number of plans:
<td class="plan-count" role="gridcell">1330 plans</td>
I have tried using .text after each tag.find, but it doesn't work.
Here's my code concerning this specific part. There's a whole lot more going on above, but it includes log in information I cannot share.
total = []
soup = BeautifulSoup(html, "lxml")
for tag in soup.find_all("tbody", {"role":"rowgroup"}):
#name = tag.find("td", {"class":"payer-name"}) #gives me whole tag
name = tag.find("tr", {"role":"row"}).find("td").get("payer-name") #gives me None output
plan = tag.find("td", {"class":"plan-count"}) #gives me whole tag
stat = tag.find("td", {"class":"icon-status"}) #gives me whole tag
data = {"Payer": name, "Number of plans": plan, "Status": stat}
total.append(data)
df = pd.DataFrame(total)
print(df)
Here is a snippet using the inspect function.
<tbody role="rowgroup">
<tr data-uid="a5795205-1518-4a74-b039-abcd1b35b409" role="row">
<td class="payer-name" role="gridcell">CVS Caremark RX</td>
<td class="plan-count" role="gridcell">1330 plans</td>
<td role="gridcell" class="icon-status icon-status-not-covered">98% Not Covered</td>
</tr>
EDIT: After diving deeper into SO I see a solution could be using the Contents function of BS4. Will report back if it works.
- This didn't work:
"AttributeError: 'NoneType' object has no attribute 'contents'"
I figured it out. Apparently there are other tags starting with tbody rowgroup further above, which are classified as None, and therefore it is not possible to get .text of these, until my code reaches the parts I want.
I just need to change this line:
for tag in soup.find_all("tbody", {"role":"rowgroup"}):

Python: Beautiful soup to get text

I'm trying to get link under a href and also the text available in the next <td scope = "raw">
I've tried
url = "https://www.sec.gov/Archives/edgar/data/1491829/0001171520-19-000171-index.htm"
records = []
for link in soup.find_all('a'):
Name = link.text
Links = link.get('href')
records.append((Name, Links))
However this gives me eps8453.htm as text since this is the text under tag <a href>. Is there any way we can look for the text i.e. "10-K" in the tag <td scope = "raw"> next to tag <a href>
Please help!
Use find_next <td> tag after <a> tag inside table.
import requests
from bs4 import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/1491829/0001171520-19-000171-index.htm"
html=requests.get(url).text
soup=BeautifulSoup(html,'html.parser')
records = []
for link in soup.find('table', class_='tableFile').find_all('a'):
Name = link.text
Links = link.get('href')
text=link.find_next('td').contents[0]
print(Name,text)
records.append((Name, Links,text))
Output:
eps8453.htm 10-K
ex31-1.htm EX-31.1
ex31-2.htm EX-31.2
ex32-1.htm EX-32.1
yu-logo.jpg GRAPHIC
yu_sig.jpg GRAPHIC
0001171520-19-000171.txt
 

Get href within a table

Sorry, has most likely been asked before but I can't seem to find an answer on stack/from search engine.
I'm trying to scrape some data from a table, but there are href links which I need to get. Html as follows:
<table class="featprop results">
<tr>
**1)**<td class="propname" colspan="2"> West Drayton</td>
</tr>
<tr><td class="propimg" colspan="2">
<div class="imgcrop">
**2)**<img src="content/images/1/1/641/w296/858.jpg" alt=" Ashford" width="148"/>
<div class="let"> </div>
</div>
</td></tr>
<tr><td class="proprooms">
So far I have used the following:
for table in soup.findAll('table', {'class': 'featprop results'}):
for tr in table.findAll('tr'):
for a in tr.findAll('a'):
print(a)
Which returns both 1 and 2 in the above html, could anyone help me strip out just the href link?
for table in soup.findAll('table', {'class': 'featprop results'}):
for tr in table.findAll('tr'):
for a in tr.findAll('a'):
print(a['href'])
out:
/lettings-search-results?task=View&itemid=136
/lettings-search-results?task=View&itemid=136
Attributes
EDIT:
links = set() # set will remove the dupilcate
for a in tr.findAll('a', href=re.compile(r'^/lettings-search-results?')):
links.add(a['href'])
regular expression
This provide you an array of tags under the element of the selected class name.
result = soup.select(".featprop a");
for a in result:
print(a['href'])
Give you the below result:
/lettings-search-results?task=View&itemid=136
/lettings-search-results?task=View&itemid=136

Resources