Get text after dynamic class with title text Python/bs4 - python-3.x

The class "label" with text "Owner 1" dynamically changes so indexing the same class name isn't consistent. I'm trying to grab the name "Joe Smith" following the class text label. Some records have "Company Name" first.
<div>
<div class="label">Owner 1 Name</div>
<div class="value">
<div>Joe Smith</div>
</div>
<div>
<div class="label">Company Name</div>
<div class="value">
<div>ACME CO</div>
</div>
There are roughly ten "label" class in a row like the code above.
Owner 1 Name dynamically changes by record and ends up in a different location everytime. I just need the name value for each record.

Try it this way:
company = """your html above"""
from bs4 import BeautifulSoup as bs
soup = bs(company,'lxml')
target = soup.select('div[class="label"]:contains("Owner")+div>div')
print(target[0].text)
Output:
Joe Smith

This did the trick:
target = soup.find("div", text="Owner 1 Name")
print(target.find_next_sibling("div").get_text())

Related

find tags contain partial strings in text

<div id="a">123 456</div>
<div id="b">456</div>
<div id="c">123 456 789</div>
soup.findAll('div', text = re.compile('456'))
Only returns div b, no others.
soup.findAll('div', text = re.compile('45'))
Only returns div b, no others.
How to return other DIVs actually partially matches the specific string?
The answer to your question is almost similar to this answer. All you have to do is to tweak the lambda function a bit. Here is the full code:
from bs4 import BeautifulSoup
html = '''
<div id="a">123 456</div>
<div id="b">456</div>
<div id="c">123 456 789</div>
'''
soup = BeautifulSoup(html,'html5lib')
divs = soup.find_all("div", text=lambda text: text and '456' in text)
Output:
>>> divs
[<div id="a">123 456</div>, <div id="b">456</div>, <div id="c">123 456 789</div>]

Python3 Beautifulsoup4 extracting text from multiple container siblings

I am new to python, and I am trying to using beautifulsoup to extract only the text from a group of tags. The first tag is 'name' the second is 'date' I can grab the text either from name, or date just not together. Here is the html code to the page I am trying to scrape
<div class="results">
<h1>
Info Records
</h1>
<div class="group">
<a class="name" href="https://" target="_blank">
Firstname, Lastname
</a>
<br/>
<span class="date">
8/24/2020: Text info
</span>
</div>
<div class="group">
<a class="name" href="https://" target="_blank">
Different Firstname, Different Lastname
</a>
<br/>
<span class="date">
8/23/2020: Different Text Info
</span>
</div>
for name I use this code which pulls the names, and prints them to terminal for the dates I change the class name to 'date'
for arrest in soup.find_all('a', {'class': 'name'}):
name = arrest.text
print(name)
The html has about 20 names, and dates I only posted the first 2. When I try to print both classes together it doesn't work.
test = soup.find_all("div", {"class": ["name", "date"]})
print(test)
Also what is working doesn't write to text file. Ideally what I am trying to accomplish is something like this to be added to a output file:
firsname lastname
8/24/2020 Text info
firstname last name
8/23/20920 different text info
Any advice would be helpful. I've been reading all day trying to figure it out.
Option 1: use a CSS selector.
Option 2: use zip()
1:
from bs4 import BeautifulSoup
html = """YOUR ABOVE HTML SNIPPET"""
soup = BeautifulSoup(html, "html.parser")
with open("output.txt", "w") as f:
# select `name` and `date` class
for tags in soup.select(".name, .date"):
f.write(tags.text.strip() + "\n")
2:
with open("output.txt", "w") as f:
for name, date in zip(
soup.find_all("a", {"class": "name"}), soup.find_all("span", {"class": "date"})
):
f.write(name.text.strip() + "\n")
f.write(date.text.strip() + "\n")
output.txt:
Firstname, Lastname
8/24/2020: Text info
Different Firstname, Different Lastname
8/23/2020: Different Text Info

Python - Web Scraping :How to access div tag of 1 class when I am getting data for div tags for multiple classes

I want div tag of 2 different classes in my result.
I am using following command to scrape the data -
'''
result = soup.select('div', {'class' : ['col-s-12', 'search-page-text clearfix row'] })
'''
Now, I have specific set of information in class 'col-s-12' and another set of information n class 'search-page-text clearfix row'
Now, I want to find children of only div tag with class - 'col-s-12'. When I am running below code, then it looks for children of both the div tags, since I have not specified anywhere which class I want to search
'''
for div in result:
print(div)
prod_name = div.find("a" , recursive=False)[0] #should come from 'col-s-12' only
prod_info = div.find("a" , recursive=False)[0] # should come from 'search-page-text clearfix row' only
'''
Example -
'''
<div class = 'col-s-12'>
This is what I want or variable **prod_name**
</div>
<div class = 'search-page-text clearfix row'>
<a> This should be stored in variable **prod_info** </a>
</div>
'''
You can search for first <a> tag under tag with class="col-s-12" and then use .find_next('a') to search next <a> tag.
Note: .select() method accepts only CSS selectors, not dictionaries.
For example:
txt = '''<div class = 'col-s-12'>
This is what I want or variable **prod_name**
</div>
<div class = 'search-page-text clearfix row'>
<a> This should be stored in variable **prod_info** </a>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
prod_name = soup.select_one('.col-s-12 > a')
prod_info = prod_name.find_next('a')
print(prod_name.get_text(strip=True))
print(prod_info.get_text(strip=True))
Prints:
This is what I want or variable **prod_name**
This should be stored in variable **prod_info**

How to get some class value in soup.findAll python 3.2

How to get some class values in one string
<div class="col-md-9 bt-product-main-info"></div>
I'm using
soup.findAll(match_class("col-lg-3 col-md-4 col-sm-6 bt-product-list"))
But it's not working.
Thank You.
Given the following HTML text:
text = """
<div class="col-md-9 bt-product-main-info">hij</div>
<div class="col-md-9">asdas</div>
<div class="bt-product-list">sdshij</div>
"""
If you want only records which has exact class name match, for example: col-md-9 bt-product-main-info, then do:
soup.find_all('div', class_ = 'col-md-9 bt-product-main-info')
The output will be:
[<div class="col-md-9 bt-product-main-info">hij</div>]
If you want records that match any of the following class names, for example: col-md-9 or bt-product-main-info, then do:
soup.find_all('div', class_ = ['col-md-9', 'bt-product-main-info'])
The output will be:
[<div class="col-md-9 bt-product-main-info">hij</div>,
<div class="col-md-9">asdas</div>]

Scrape a span text from multiple span elements of same name within a p tag in a website

I want to scrape the text from the span tag within multiple span tags with similar names. Using python, beautifulsoup to parse the website.
Just cannot uniquely identify that specific gross-amount span element.
The span tag has name=nv and a data value but the other one has that too. I just wanna extract the gross numerical dollar figure in millions.
Please advise.
this is the structure :
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>
Want the text from second span under span class= text muted Gross.
What you can do is find the <span> tag that has the text 'Gross:'. Then, once it finds that tag, tell it to go find the next <span> tag (which is the value amount), and get that text.
from bs4 import BeautifulSoup as BS
html = '''<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>'''
soup = BS(html, 'html.parser')
gross_value = soup.find('span', text='Gross:').find_next('span').text
Output:
print (gross_value)
$69.65M
or if you want to get the data-value, change that last line to:
gross_value = soup.find('span', text='Gross:').find_next('span')['data-value']
Output:
print (gross_value)
69,645,701
And finally, if you need those values as an integer instead of a string, so you can aggregate in some way later:
gross_value = int(soup.find('span', text='Gross:').find_next('span')['data-value'].replace(',', ''))
Output:
print (gross_value)
69645701

Resources