Selecting proper div class with BeautifulSoup - python-3.x

I have a html with 3 types of div classes:
<div class="message">
<div class="message message__current">
<div class="message message__current message--grouped">
When I do
all_messages_2 = soup.find_all("div", class_="message message__current")
it selects only type 2 of div.
But then when I want to select only type 1 and I do
all_messages_1 = soup.find_all("div", class_="message")
it selects all 3 types of div.
Could you help, please?

Use a lambda to select each div tag with the class attribute matches what you want exactly.
from bs4 import BeautifulSoup
html = """
<div class="message">
<div class="message message__current">
<div class="message message__current message--grouped">
"""
soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['message'])
print (len(tags))
tags = soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['message', 'message__current'])
print (len(tags))
tags = soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['message', 'message__current', 'message--grouped'])
print (len(tags))
Outputs:
1
1
1
Note the way it is without closing tags will get interpreted as all the tags close at the end of your HTML. So selecting text from the first will have all the text from the other two as well. Likewise selecting text from the second will have text from the third.

Related

find tags contain partial strings in text

<div id="a">123 456</div>
<div id="b">456</div>
<div id="c">123 456 789</div>
soup.findAll('div', text = re.compile('456'))
Only returns div b, no others.
soup.findAll('div', text = re.compile('45'))
Only returns div b, no others.
How to return other DIVs actually partially matches the specific string?
The answer to your question is almost similar to this answer. All you have to do is to tweak the lambda function a bit. Here is the full code:
from bs4 import BeautifulSoup
html = '''
<div id="a">123 456</div>
<div id="b">456</div>
<div id="c">123 456 789</div>
'''
soup = BeautifulSoup(html,'html5lib')
divs = soup.find_all("div", text=lambda text: text and '456' in text)
Output:
>>> divs
[<div id="a">123 456</div>, <div id="b">456</div>, <div id="c">123 456 789</div>]

How to get data from a tag if it's present in HTML else Empty String if the tag is not present in web scraping Python

Picture contains HTML code for the situation
case 1:
<li>
<a> some text: </a><strong> 'identifier:''random words' </strong>
</li>
case 2:
<li>
<a> some text: </a>
</li>
I want to scrape values for identifiers if it's present, else I want to put an empty string if there is no identifier in that particular case.
I am using scrapy or you can help me with BeautifulSoup as well and will really appreciate your help
It's a little bit unclear what do you want exactly, because your screenshot is little bit different than your example in your question. I suppose you want to search text "some text:" and then get next value inside <strong> (or empty string if there isn't any):
from bs4 import BeautifulSoup
txt = '''
<li>
<a> some text: </a><strong> 'identifier:''random words' </strong>
</li>
<li>
<a> some text: </a>
</li>
'''
soup = BeautifulSoup(txt, 'html.parser')
for t in soup.find_all(lambda t: t.contents[0].strip() == 'some text:'):
identifier = t.parent.find('strong')
identifier = identifier.get_text(strip=True) if identifier else ''
print('Found:', identifier)
Prints:
Found: 'identifier:''random words'
Found:

Python - Web Scraping :How to access div tag of 1 class when I am getting data for div tags for multiple classes

I want div tag of 2 different classes in my result.
I am using following command to scrape the data -
'''
result = soup.select('div', {'class' : ['col-s-12', 'search-page-text clearfix row'] })
'''
Now, I have specific set of information in class 'col-s-12' and another set of information n class 'search-page-text clearfix row'
Now, I want to find children of only div tag with class - 'col-s-12'. When I am running below code, then it looks for children of both the div tags, since I have not specified anywhere which class I want to search
'''
for div in result:
print(div)
prod_name = div.find("a" , recursive=False)[0] #should come from 'col-s-12' only
prod_info = div.find("a" , recursive=False)[0] # should come from 'search-page-text clearfix row' only
'''
Example -
'''
<div class = 'col-s-12'>
This is what I want or variable **prod_name**
</div>
<div class = 'search-page-text clearfix row'>
<a> This should be stored in variable **prod_info** </a>
</div>
'''
You can search for first <a> tag under tag with class="col-s-12" and then use .find_next('a') to search next <a> tag.
Note: .select() method accepts only CSS selectors, not dictionaries.
For example:
txt = '''<div class = 'col-s-12'>
This is what I want or variable **prod_name**
</div>
<div class = 'search-page-text clearfix row'>
<a> This should be stored in variable **prod_info** </a>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
prod_name = soup.select_one('.col-s-12 > a')
prod_info = prod_name.find_next('a')
print(prod_name.get_text(strip=True))
print(prod_info.get_text(strip=True))
Prints:
This is what I want or variable **prod_name**
This should be stored in variable **prod_info**

extract content wherever we have div tag followed by hearder tag by using beautifulsoup

I am trying to extract div tags and header tags when they are together.
ex:
<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
I tried solution provided in below link.
here the header tag inside div tag...
but my requirement is div tag after header tag.
Scraping text in h3 and div tags using beautifulSoup, Python
also i tried something like this but not worked
soup = bs4.BeautifulSoup(page, 'lxml')
found = soup..find_all({"h3", "div"})
I need content from H3 tag and all the content inside div tag where ever these two combination exists.
You could use CSS selector h3:has(+div) - this will select all <h3> which have div immediately after it:
data = '''<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
<h3>This header is not selected</h3>
<p>Beacause this is P tag, not DIV</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for h3 in soup.select('h3:has(+div)'):
print('Header:')
print(h3.text)
print('Next <div>:')
print(h3.find_next_sibling('div').get_text(separator=",", strip=True))
Prints:
Header:
header
Next <div>:
some text here,list,list,list
Further reading:
CSS Selectors reference

How to fix missing ul tags in html list snippet with Python and Beautiful Soup

If I have a snippet of html like this:
<p><br><p>
<li>stuff</li>
<li>stuff</li>
Is there a way to clean this and add the missing ul/ol tags using beautiful soup, or another python library?
I tried soup.prettify() but it left as is.
It doesn't seem like there's a built-in method which wraps groups of li elements into an ul. However, you can simply loop over the li elements, identify the first element of each li group and wrap it in ul tags. The next elements in the group are appended to the previously created ul:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
ulgroup = 0
uls = []
for li in soup.findAll('li'):
previous_element = li.findPrevious()
# if <li> already wrapped in <ul>, do nothing
if previous_element and previous_element.name == 'ul':
continue
# if <li> is the first element of a <li> group, wrap it in a new <ul>
if not previous_element or previous_element.name != 'li':
ulgroup += 1
ul = soup.new_tag("ul")
li.wrap(ul)
uls.append(ul)
# append rest of <li> group to previously created <ul>
elif ulgroup > 0:
uls[ulgroup-1].append(li)
print(soup.prettify())
For example, the following input:
html = '''
<p><br><p>
<li>stuff1</li>
<li>stuff2</li>
<div></div>
<li>stuff3</li>
<li>stuff4</li>
<li>stuff5</li>
'''
outputs:
<p>
<br/>
<p>
<ul>
<li>
stuff1
</li>
<li>
stuff2
</li>
</ul>
<div>
</div>
<ul>
<li>
stuff3
</li>
<li>
stuff4
</li>
<li>
stuff5
</li>
</ul>
</p>
</p>
Demo: https://repl.it/#glhr/55619920-fixing-uls
First, you have to decide which parser you are going to use. Different parsers treat malformed html differently.
The following BeautifulSoup methods will help you accomplish what you require
new_tag() - create a new ul tag
append() - To append the newly created ul tag somewhere in the soup tree.
extract() - To extract the li tags one by one (which we can append to the ul tag)
decompose() - To remove any unwanted tags from the tree. Which may be formed as a result of the parser's interpretation of the malformed html.
My Solution
Let's create a soup object using html5lib parser and see what we get
from bs4 import BeautifulSoup
html="""
<p><br><p>
<li>stuff</li>
<li>stuff</li>
"""
soup=BeautifulSoup(html,'html5lib')
print(soup)
Outputs:
<html><head></head><body><p><br/></p><p>
</p><li>stuff</li>
<li>stuff</li>
</body></html>
The next step may vary according to what you want to accomplish. I want to remove the second empty p. Add a new ul tag and get all the li tags inside it.
from bs4 import BeautifulSoup
html="""
<p><br><p>
<li>stuff</li>
<li>stuff</li>
"""
soup=BeautifulSoup(html,'html5lib')
second_p=soup.find_all('p')[1]
second_p.decompose()
ul_tag=soup.new_tag('ul')
soup.find('body').append(ul_tag)
for li_tag in soup.find_all('li'):
ul_tag.append(li_tag.extract())
print(soup.prettify())
Outputs:
<html>
<head>
</head>
<body>
<p>
<br/>
</p>
<ul>
<li>
stuff
</li>
<li>
stuff
</li>
</ul>
</body>
</html>

Resources