How to scrape nested text between tags using BeautifulSoup? - python-3.x

I found a website using the following HTML structure somewhere:
...
<td>
<span>some span text</span>
some td text
</td>
...
I'm interested in retrieving the "some td text" and not the "some span text" but the get_text() method seems to return all the text as "some span textsome td text". Is there a way to get just the text inside a certain element using BeautifulSoup?
Not all the tds follow the same structure, so unfortunately I cannot predict the structure of the resulting string to trim it where necessary.

Each element has a name attribute, which tells you the type of tag, e.g. div, td, span. In the case there is no tag (bare content), it will be None.
So you can just use a simple list comprehension to filter out all the tag elements.
from bs4 import BeautifulSoup
html = '''
<td>
<span>some span text</span>
some td text
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
content = soup.find('td')
text = [c.strip() for c in content if c.name is None and c.strip() != '']
print(text)
This will print:
['some td text']
after some cleaning of newlines and empty strings.
If you wanted to join up the content afterwards, you could use join:
print('\n'.join(text))

Related

extract content wherever we have div tag followed by hearder tag by using beautifulsoup

I am trying to extract div tags and header tags when they are together.
ex:
<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
I tried solution provided in below link.
here the header tag inside div tag...
but my requirement is div tag after header tag.
Scraping text in h3 and div tags using beautifulSoup, Python
also i tried something like this but not worked
soup = bs4.BeautifulSoup(page, 'lxml')
found = soup..find_all({"h3", "div"})
I need content from H3 tag and all the content inside div tag where ever these two combination exists.
You could use CSS selector h3:has(+div) - this will select all <h3> which have div immediately after it:
data = '''<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
<h3>This header is not selected</h3>
<p>Beacause this is P tag, not DIV</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for h3 in soup.select('h3:has(+div)'):
print('Header:')
print(h3.text)
print('Next <div>:')
print(h3.find_next_sibling('div').get_text(separator=",", strip=True))
Prints:
Header:
header
Next <div>:
some text here,list,list,list
Further reading:
CSS Selectors reference

Python: Beautiful soup to get text

I'm trying to get link under a href and also the text available in the next <td scope = "raw">
I've tried
url = "https://www.sec.gov/Archives/edgar/data/1491829/0001171520-19-000171-index.htm"
records = []
for link in soup.find_all('a'):
Name = link.text
Links = link.get('href')
records.append((Name, Links))
However this gives me eps8453.htm as text since this is the text under tag <a href>. Is there any way we can look for the text i.e. "10-K" in the tag <td scope = "raw"> next to tag <a href>
Please help!
Use find_next <td> tag after <a> tag inside table.
import requests
from bs4 import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/1491829/0001171520-19-000171-index.htm"
html=requests.get(url).text
soup=BeautifulSoup(html,'html.parser')
records = []
for link in soup.find('table', class_='tableFile').find_all('a'):
Name = link.text
Links = link.get('href')
text=link.find_next('td').contents[0]
print(Name,text)
records.append((Name, Links,text))
Output:
eps8453.htm 10-K
ex31-1.htm EX-31.1
ex31-2.htm EX-31.2
ex32-1.htm EX-32.1
yu-logo.jpg GRAPHIC
yu_sig.jpg GRAPHIC
0001171520-19-000171.txt
 

beautifulsoup get value of attribute using get_attr method

I'd like to print all items in the list, but not containing the style tag = the following value: "text-align: center"
test = soup.find_all("p")
for x in test:
if not x.has_attr('style'):
print(x)
Essentially, return me all items in list where style is not equal to: "text-align: center". Probably just a small error here, but is it possible to define the value of style in has_attr?
Just check if the specific style is present in the Tag's style. Style is not considered a multi-valued attribute and the entire string inside quotes is the value of style attribute. Using x.get("style",'') instead of x['style'] also handles cases in which there is no style attribute and avoids KeyError.
for x in test:
if 'text-align: center' not in x.get("style",''):
print(x)
You can also use list comprehension to skip a few lines.
test=[x for x in soup.find_all("p") if 'text-align: center' not in x.get("style",'')]
print(test)
If you wanted to consider a different approach you could use the :not selector
from bs4 import BeautifulSoup as bs
html = '''
<html>
<head>
<title>Try jsoup</title>
</head>
<body>
<p style="color:green">This is the chosen paragraph.</p>
<p style="text-align: center">This is another paragraph.</p>
</body>
</html>
'''
soup = bs(html, 'lxml')
items = [item.text for item in soup.select('p:not([style="text-align: center"])')]
print(items)

Scrape a span text from multiple span elements of same name within a p tag in a website

I want to scrape the text from the span tag within multiple span tags with similar names. Using python, beautifulsoup to parse the website.
Just cannot uniquely identify that specific gross-amount span element.
The span tag has name=nv and a data value but the other one has that too. I just wanna extract the gross numerical dollar figure in millions.
Please advise.
this is the structure :
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>
Want the text from second span under span class= text muted Gross.
What you can do is find the <span> tag that has the text 'Gross:'. Then, once it finds that tag, tell it to go find the next <span> tag (which is the value amount), and get that text.
from bs4 import BeautifulSoup as BS
html = '''<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>'''
soup = BS(html, 'html.parser')
gross_value = soup.find('span', text='Gross:').find_next('span').text
Output:
print (gross_value)
$69.65M
or if you want to get the data-value, change that last line to:
gross_value = soup.find('span', text='Gross:').find_next('span')['data-value']
Output:
print (gross_value)
69,645,701
And finally, if you need those values as an integer instead of a string, so you can aggregate in some way later:
gross_value = int(soup.find('span', text='Gross:').find_next('span')['data-value'].replace(',', ''))
Output:
print (gross_value)
69645701

Beautifulsoup placing </li> in the wrong place because there is no closing tag [duplicate]

I would like to scrape the table from html code using beautifulsoup. A snippet of the html is shown below. When using table.findAll('tr') I get the entire table and not only the rows. (probably because the closing tags are missing from the html code?)
<TABLE COLS=9 BORDER=0 CELLSPACING=3 CELLPADDING=0>
<TR><TD><B>Artikelbezeichnung</B>
<TD><B>Anbieter</B>
<TD><B>Menge</B>
<TD><B>Taxe-EK</B>
<TD><B>Taxe-VK</B>
<TD><B>Empf.-VK</B>
<TD><B>FB</B>
<TD><B>PZN</B>
<TD><B>Nachfolge</B>
<TR><TD>ACTIQ 200 Mikrogramm Lutschtabl.m.integr.Appl.
<TD>Orifarm
<TD ID=R> 30 St
<TD ID=R> 266,67
<TD ID=R> 336,98
<TD>
<TD>
<TD>12516714
<TD>
</TABLE>
Here is my python code to show what I am struggling with:
soup = BeautifulSoup(data, "html.parser")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.text)
As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = BeautifulSoup(data, "lxml")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.get_text(strip=True))
Note that lxml added html & body tags because they weren't present in the source (It'll try to create a well formed document as previously state).

Resources