find tags contain partial strings in text - python-3.x

<div id="a">123 456</div>
<div id="b">456</div>
<div id="c">123 456 789</div>
soup.findAll('div', text = re.compile('456'))
Only returns div b, no others.
soup.findAll('div', text = re.compile('45'))
Only returns div b, no others.
How to return other DIVs actually partially matches the specific string?

The answer to your question is almost similar to this answer. All you have to do is to tweak the lambda function a bit. Here is the full code:
from bs4 import BeautifulSoup
html = '''
<div id="a">123 456</div>
<div id="b">456</div>
<div id="c">123 456 789</div>
'''
soup = BeautifulSoup(html,'html5lib')
divs = soup.find_all("div", text=lambda text: text and '456' in text)
Output:
>>> divs
[<div id="a">123 456</div>, <div id="b">456</div>, <div id="c">123 456 789</div>]

Related

How to get the text of all the elements in a given html in Python3?

How to extract all the text of elements from the following html:
from bs4 import BeautifulSoup
html3 = """
<div class="tab-cell l1">
<span class="cyan-90">***</span>
<h2 class="white-80">
<a class="k-link" href="#" title="Jump">Jump</a>
</h2>
<h3 class="black-70">
<span>Red</span>
<span class="black-50">lock</span>
</h3>
<div class="l-block">
<a class="lang-menu" href="#">A</a>
<a class="lang-menu" href="#">B</a>
<a class="lang-menu" href="#">C</a>
</div>
<div class="black-50">
<div class="p-bold">Period</div>
<div class="tab--cell">$</div><div class="white-90">Method</div>
<div class="tab--cell">$</div><div class="tab--cell">Type</div>
</div>
</div>
"""
soup = BeautifulSoup(html3, "lxml")
if soup.find('div', attrs={'class': 'tab-cell l1'}):
div_descendants = soup.div.descendants
for des in div_descendants:
if des.name is not None:
print(des.name)
if des.find(class_='k-link'):
print(des.a.string)
if des.find(class_='black-70'):
print('span')
print(des.span.text)
I'm getting text of only first link, after that I'm unable to get anything.
I would like to crawl line by line and get whatever I want, if anyone have any idea please let me know.
Your own if-conditions hinder you to get all things. You only print in two cases based on a class_=... condition - you do not print in all conditions:
# html3 = see above
from bs4 import BeautifulSoup
import lxml
soup = BeautifulSoup(html3, "lxml")
if soup.find('div', attrs={'class': 'tab-cell l1'}):
div_descendants = soup.div.descendants
for des in div_descendants:
if des.name is not None:
print(des.name)
found = False
if des.find(class_='k-link'):
print(des.a.string)
found = True
if des.find(class_='black-70'):
print('span')
print(des.span.text)
found = True
# find all others that are not already reported:
if not found:
print(f"Other {des.name}: {des.string}")
Output:
span
Other span: ***
h2
Jump
a
Other a: Jump
h3
Other h3: None
span
Other span: Red
span
Other span: lock
div
Other div: None
a
Other a: A
a
Other a: B
a
Other a: C
div
Other div: None
div
Other div: Period
div
Other div: $
div
Other div: Method
div
Other div: $
div
Other div: Type
Sorted out the issue like this:
from bs4 import BeautifulSoup
import lxml
html3 = """
<div class="tab-cell l1">
<span class="cyan-90">***</span>
<h2 class="white-80">
<a class="k-link" href="#" title="Jump">Jump</a>
</h2>
<h3 class="black-70">
<span>Red</span>
<span class="black-50">lock</span>
</h3>
<div class="l-block">
<a class="lang-menu" href="#">A</a>
<a class="lang-menu" href="#">B</a>
<a class="lang-menu" href="#">C</a>
</div>
<div class="black-50">
<div class="p-bold">Period</div>
<div class="tab--cell">$</div><div class="white-90">Method</div>
<div class="tab--cell">$</div><div class="tab--cell">Type</div>
</div>
</div>
"""
soup = BeautifulSoup(html3, "lxml")
if soup.find('div', attrs={'class': 'tab-cell l1'}):
div_descendants = soup.div.descendants
for des in div_descendants:
if des.name is not None and des.string is not None:
print(f"{des.name}: {des.string}")

Python3 Beautifulsoup4 extracting text from multiple container siblings

I am new to python, and I am trying to using beautifulsoup to extract only the text from a group of tags. The first tag is 'name' the second is 'date' I can grab the text either from name, or date just not together. Here is the html code to the page I am trying to scrape
<div class="results">
<h1>
Info Records
</h1>
<div class="group">
<a class="name" href="https://" target="_blank">
Firstname, Lastname
</a>
<br/>
<span class="date">
8/24/2020: Text info
</span>
</div>
<div class="group">
<a class="name" href="https://" target="_blank">
Different Firstname, Different Lastname
</a>
<br/>
<span class="date">
8/23/2020: Different Text Info
</span>
</div>
for name I use this code which pulls the names, and prints them to terminal for the dates I change the class name to 'date'
for arrest in soup.find_all('a', {'class': 'name'}):
name = arrest.text
print(name)
The html has about 20 names, and dates I only posted the first 2. When I try to print both classes together it doesn't work.
test = soup.find_all("div", {"class": ["name", "date"]})
print(test)
Also what is working doesn't write to text file. Ideally what I am trying to accomplish is something like this to be added to a output file:
firsname lastname
8/24/2020 Text info
firstname last name
8/23/20920 different text info
Any advice would be helpful. I've been reading all day trying to figure it out.
Option 1: use a CSS selector.
Option 2: use zip()
1:
from bs4 import BeautifulSoup
html = """YOUR ABOVE HTML SNIPPET"""
soup = BeautifulSoup(html, "html.parser")
with open("output.txt", "w") as f:
# select `name` and `date` class
for tags in soup.select(".name, .date"):
f.write(tags.text.strip() + "\n")
2:
with open("output.txt", "w") as f:
for name, date in zip(
soup.find_all("a", {"class": "name"}), soup.find_all("span", {"class": "date"})
):
f.write(name.text.strip() + "\n")
f.write(date.text.strip() + "\n")
output.txt:
Firstname, Lastname
8/24/2020: Text info
Different Firstname, Different Lastname
8/23/2020: Different Text Info

How to get some class value in soup.findAll python 3.2

How to get some class values in one string
<div class="col-md-9 bt-product-main-info"></div>
I'm using
soup.findAll(match_class("col-lg-3 col-md-4 col-sm-6 bt-product-list"))
But it's not working.
Thank You.
Given the following HTML text:
text = """
<div class="col-md-9 bt-product-main-info">hij</div>
<div class="col-md-9">asdas</div>
<div class="bt-product-list">sdshij</div>
"""
If you want only records which has exact class name match, for example: col-md-9 bt-product-main-info, then do:
soup.find_all('div', class_ = 'col-md-9 bt-product-main-info')
The output will be:
[<div class="col-md-9 bt-product-main-info">hij</div>]
If you want records that match any of the following class names, for example: col-md-9 or bt-product-main-info, then do:
soup.find_all('div', class_ = ['col-md-9', 'bt-product-main-info'])
The output will be:
[<div class="col-md-9 bt-product-main-info">hij</div>,
<div class="col-md-9">asdas</div>]

Scrape a span text from multiple span elements of same name within a p tag in a website

I want to scrape the text from the span tag within multiple span tags with similar names. Using python, beautifulsoup to parse the website.
Just cannot uniquely identify that specific gross-amount span element.
The span tag has name=nv and a data value but the other one has that too. I just wanna extract the gross numerical dollar figure in millions.
Please advise.
this is the structure :
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>
Want the text from second span under span class= text muted Gross.
What you can do is find the <span> tag that has the text 'Gross:'. Then, once it finds that tag, tell it to go find the next <span> tag (which is the value amount), and get that text.
from bs4 import BeautifulSoup as BS
html = '''<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>'''
soup = BS(html, 'html.parser')
gross_value = soup.find('span', text='Gross:').find_next('span').text
Output:
print (gross_value)
$69.65M
or if you want to get the data-value, change that last line to:
gross_value = soup.find('span', text='Gross:').find_next('span')['data-value']
Output:
print (gross_value)
69,645,701
And finally, if you need those values as an integer instead of a string, so you can aggregate in some way later:
gross_value = int(soup.find('span', text='Gross:').find_next('span')['data-value'].replace(',', ''))
Output:
print (gross_value)
69645701

Selecting proper div class with BeautifulSoup

I have a html with 3 types of div classes:
<div class="message">
<div class="message message__current">
<div class="message message__current message--grouped">
When I do
all_messages_2 = soup.find_all("div", class_="message message__current")
it selects only type 2 of div.
But then when I want to select only type 1 and I do
all_messages_1 = soup.find_all("div", class_="message")
it selects all 3 types of div.
Could you help, please?
Use a lambda to select each div tag with the class attribute matches what you want exactly.
from bs4 import BeautifulSoup
html = """
<div class="message">
<div class="message message__current">
<div class="message message__current message--grouped">
"""
soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['message'])
print (len(tags))
tags = soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['message', 'message__current'])
print (len(tags))
tags = soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['message', 'message__current', 'message--grouped'])
print (len(tags))
Outputs:
1
1
1
Note the way it is without closing tags will get interpreted as all the tags close at the end of your HTML. So selecting text from the first will have all the text from the other two as well. Likewise selecting text from the second will have text from the third.

Resources