Python3 Beautifulsoup4 extracting text from multiple container siblings - python-3.x

I am new to python, and I am trying to using beautifulsoup to extract only the text from a group of tags. The first tag is 'name' the second is 'date' I can grab the text either from name, or date just not together. Here is the html code to the page I am trying to scrape
<div class="results">
<h1>
Info Records
</h1>
<div class="group">
<a class="name" href="https://" target="_blank">
Firstname, Lastname
</a>
<br/>
<span class="date">
8/24/2020: Text info
</span>
</div>
<div class="group">
<a class="name" href="https://" target="_blank">
Different Firstname, Different Lastname
</a>
<br/>
<span class="date">
8/23/2020: Different Text Info
</span>
</div>
for name I use this code which pulls the names, and prints them to terminal for the dates I change the class name to 'date'
for arrest in soup.find_all('a', {'class': 'name'}):
name = arrest.text
print(name)
The html has about 20 names, and dates I only posted the first 2. When I try to print both classes together it doesn't work.
test = soup.find_all("div", {"class": ["name", "date"]})
print(test)
Also what is working doesn't write to text file. Ideally what I am trying to accomplish is something like this to be added to a output file:
firsname lastname
8/24/2020 Text info
firstname last name
8/23/20920 different text info
Any advice would be helpful. I've been reading all day trying to figure it out.

Option 1: use a CSS selector.
Option 2: use zip()
1:
from bs4 import BeautifulSoup
html = """YOUR ABOVE HTML SNIPPET"""
soup = BeautifulSoup(html, "html.parser")
with open("output.txt", "w") as f:
# select `name` and `date` class
for tags in soup.select(".name, .date"):
f.write(tags.text.strip() + "\n")
2:
with open("output.txt", "w") as f:
for name, date in zip(
soup.find_all("a", {"class": "name"}), soup.find_all("span", {"class": "date"})
):
f.write(name.text.strip() + "\n")
f.write(date.text.strip() + "\n")
output.txt:
Firstname, Lastname
8/24/2020: Text info
Different Firstname, Different Lastname
8/23/2020: Different Text Info

Related

find tags contain partial strings in text

<div id="a">123 456</div>
<div id="b">456</div>
<div id="c">123 456 789</div>
soup.findAll('div', text = re.compile('456'))
Only returns div b, no others.
soup.findAll('div', text = re.compile('45'))
Only returns div b, no others.
How to return other DIVs actually partially matches the specific string?
The answer to your question is almost similar to this answer. All you have to do is to tweak the lambda function a bit. Here is the full code:
from bs4 import BeautifulSoup
html = '''
<div id="a">123 456</div>
<div id="b">456</div>
<div id="c">123 456 789</div>
'''
soup = BeautifulSoup(html,'html5lib')
divs = soup.find_all("div", text=lambda text: text and '456' in text)
Output:
>>> divs
[<div id="a">123 456</div>, <div id="b">456</div>, <div id="c">123 456 789</div>]

How to get data from a tag if it's present in HTML else Empty String if the tag is not present in web scraping Python

Picture contains HTML code for the situation
case 1:
<li>
<a> some text: </a><strong> 'identifier:''random words' </strong>
</li>
case 2:
<li>
<a> some text: </a>
</li>
I want to scrape values for identifiers if it's present, else I want to put an empty string if there is no identifier in that particular case.
I am using scrapy or you can help me with BeautifulSoup as well and will really appreciate your help
It's a little bit unclear what do you want exactly, because your screenshot is little bit different than your example in your question. I suppose you want to search text "some text:" and then get next value inside <strong> (or empty string if there isn't any):
from bs4 import BeautifulSoup
txt = '''
<li>
<a> some text: </a><strong> 'identifier:''random words' </strong>
</li>
<li>
<a> some text: </a>
</li>
'''
soup = BeautifulSoup(txt, 'html.parser')
for t in soup.find_all(lambda t: t.contents[0].strip() == 'some text:'):
identifier = t.parent.find('strong')
identifier = identifier.get_text(strip=True) if identifier else ''
print('Found:', identifier)
Prints:
Found: 'identifier:''random words'
Found:

Get text after dynamic class with title text Python/bs4

The class "label" with text "Owner 1" dynamically changes so indexing the same class name isn't consistent. I'm trying to grab the name "Joe Smith" following the class text label. Some records have "Company Name" first.
<div>
<div class="label">Owner 1 Name</div>
<div class="value">
<div>Joe Smith</div>
</div>
<div>
<div class="label">Company Name</div>
<div class="value">
<div>ACME CO</div>
</div>
There are roughly ten "label" class in a row like the code above.
Owner 1 Name dynamically changes by record and ends up in a different location everytime. I just need the name value for each record.
Try it this way:
company = """your html above"""
from bs4 import BeautifulSoup as bs
soup = bs(company,'lxml')
target = soup.select('div[class="label"]:contains("Owner")+div>div')
print(target[0].text)
Output:
Joe Smith
This did the trick:
target = soup.find("div", text="Owner 1 Name")
print(target.find_next_sibling("div").get_text())

How to get some class value in soup.findAll python 3.2

How to get some class values in one string
<div class="col-md-9 bt-product-main-info"></div>
I'm using
soup.findAll(match_class("col-lg-3 col-md-4 col-sm-6 bt-product-list"))
But it's not working.
Thank You.
Given the following HTML text:
text = """
<div class="col-md-9 bt-product-main-info">hij</div>
<div class="col-md-9">asdas</div>
<div class="bt-product-list">sdshij</div>
"""
If you want only records which has exact class name match, for example: col-md-9 bt-product-main-info, then do:
soup.find_all('div', class_ = 'col-md-9 bt-product-main-info')
The output will be:
[<div class="col-md-9 bt-product-main-info">hij</div>]
If you want records that match any of the following class names, for example: col-md-9 or bt-product-main-info, then do:
soup.find_all('div', class_ = ['col-md-9', 'bt-product-main-info'])
The output will be:
[<div class="col-md-9 bt-product-main-info">hij</div>,
<div class="col-md-9">asdas</div>]

Scrape a span text from multiple span elements of same name within a p tag in a website

I want to scrape the text from the span tag within multiple span tags with similar names. Using python, beautifulsoup to parse the website.
Just cannot uniquely identify that specific gross-amount span element.
The span tag has name=nv and a data value but the other one has that too. I just wanna extract the gross numerical dollar figure in millions.
Please advise.
this is the structure :
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>
Want the text from second span under span class= text muted Gross.
What you can do is find the <span> tag that has the text 'Gross:'. Then, once it finds that tag, tell it to go find the next <span> tag (which is the value amount), and get that text.
from bs4 import BeautifulSoup as BS
html = '''<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>'''
soup = BS(html, 'html.parser')
gross_value = soup.find('span', text='Gross:').find_next('span').text
Output:
print (gross_value)
$69.65M
or if you want to get the data-value, change that last line to:
gross_value = soup.find('span', text='Gross:').find_next('span')['data-value']
Output:
print (gross_value)
69,645,701
And finally, if you need those values as an integer instead of a string, so you can aggregate in some way later:
gross_value = int(soup.find('span', text='Gross:').find_next('span')['data-value'].replace(',', ''))
Output:
print (gross_value)
69645701

Resources