How to Find a tag without specific attribute using beautifulsoup? - python-3.x

I'm trying to get the content of the 'p' tags that didn't have the specific attribute.
I have some tags with 'class'='cost', and some tags with 'class'='cost' and 'itemprop'='price'
all_cars = soup.find_all('div', attrs={'class': 'listdata'})
...
...
tatal_cost= car.findChildren('p', attrs={'class': 'cost'})
cost= car.findChildren('p', attrs={'class': 'cost', 'itemprop':'price'})
I am trying to find 'p' tags without 'itemprop' attribute, but i cant find any solution.

BeautifulSoup's built-in attribute filters are enough for this. You can give True as value to simple check if the attribute is present. None can be used to specify that the attribute should not be present. Likewise the value can be any attribute value (eg 'cost').
from bs4 import BeautifulSoup
html="""
<p class="cost">paragraph 1</p>
<p class="cost">paragraph 2</p>
<p class="cost">paragraph 3</p>
<p class="cost" itemprop="1">paragraph 4</p>
<p class="somethingelse">paragraph 5</p>
"""
soup=BeautifulSoup(html,'html.parser')
print("---without 'itemprop' attribute")
print(soup.find_all('p',itemprop=None))
print("---with class = 'cost' and without 'itemprop' attribute----")
print(soup.find_all('p',attrs={'itemprop':None,"class":'cost'}))
#below is an alternative way to specify this
#print(soup.find_all('p',itemprop=None,class_='cost'))
Output
---without 'itemprop' attribute
[<p class="cost">paragraph 1</p>, <p class="cost">paragraph 2</p>, <p class="cost">paragraph 3</p>, <p class="somethingelse">paragraph 5</p>]
---with class = 'cost' and without 'itemprop' attribute----
[<p class="cost">paragraph 1</p>, <p class="cost">paragraph 2</p>, <p class="cost">paragraph 3</p>]

BeautifulSoup lets you define a function and pass it into its find_all() method:
def has_class_but_not_itemprop(tag):
return tag.has_attr('class') and not tag.has_attr('itemprop')
# Pass this function into find_all() and you’ll pick up all the <p>
# tags you're after:
soup.find_all(has_class_but_not_itemprop)
# [<p class="cost">...</p>,
# <p class="cost">...</p>,
# <p class="cost">...</p>]
For more information, see the BeautifulSoup documentation.

Related

How to get some class value in soup.findAll python 3.2

How to get some class values in one string
<div class="col-md-9 bt-product-main-info"></div>
I'm using
soup.findAll(match_class("col-lg-3 col-md-4 col-sm-6 bt-product-list"))
But it's not working.
Thank You.
Given the following HTML text:
text = """
<div class="col-md-9 bt-product-main-info">hij</div>
<div class="col-md-9">asdas</div>
<div class="bt-product-list">sdshij</div>
"""
If you want only records which has exact class name match, for example: col-md-9 bt-product-main-info, then do:
soup.find_all('div', class_ = 'col-md-9 bt-product-main-info')
The output will be:
[<div class="col-md-9 bt-product-main-info">hij</div>]
If you want records that match any of the following class names, for example: col-md-9 or bt-product-main-info, then do:
soup.find_all('div', class_ = ['col-md-9', 'bt-product-main-info'])
The output will be:
[<div class="col-md-9 bt-product-main-info">hij</div>,
<div class="col-md-9">asdas</div>]

Why does attribute splitting happen in BeautifulSoup?

I try to get the attribute of the parent element:
<div class="detailMS__incidentRow incidentRow--away odd">
<div class="time-box">45'</div>
<div class="icon-box soccer-ball-own"><span class="icon soccer-ball-own"> </span></div>
<span class=" note-name">(Autogoal)</span><span class="participant-name">
Reynaldo
</span>
</div>
span_autogoal = soup.find('span', class_='note-name')
print(span_autogoal)
print(span_autogoal.find_parent('div')['class'])
# print(span_autogoal.find_parent('div').get('class')
Output:
<span class="note-name">(Autogoal)</span>
['detailMS__incidentRow', 'incidentRow--away', 'odd']
I know i can do something like this:
print(' '.join(span_autogoal.find_parent('div')['class']))
But i want to know why this is happening and is it possible to do this more correctly?
Above answer is correct however if you want get mutli attribute value return as string try use xml parser after get the parent element.
from bs4 import BeautifulSoup
data='''<div class="detailMS__incidentRow incidentRow--away odd">
<div class="time-box">45'</div>
<div class="icon-box soccer-ball-own"><span class="icon soccer-ball-own"> </span></div>
<span class=" note-name">(Autogoal)</span><span class="participant-name">
Reynaldo
</span>
</div>'''
soup=BeautifulSoup(data,'lxml')
span_autogoal = soup.find('span', class_='note-name')
print(span_autogoal)
parentdiv=span_autogoal.find_parent('div')
data=str(parentdiv)
soup=BeautifulSoup(data,'xml')
print(soup.div['class'])
Output on console:
<span class="note-name">(Autogoal)</span>
detailMS__incidentRow incidentRow--away odd
According to the BeautifulSoup documentation:
HTML 4 defines a few attributes that can have multiple values. HTML 5
removes a couple of them, but defines a few more. The most common
multi-valued attribute is class (that is, a tag can have more than one
CSS class). Others include rel, rev, accept-charset, headers, and
accesskey. Beautiful Soup presents the value(s) of a multi-valued
attribute as a list:
css_soup = BeautifulSoup('<p class="body"></p>') css_soup.p['class']
# ["body"]
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]
So in your case in <div class="detailMS__incidentRow incidentRow--away odd"> a class attribute is multi-valued.
That's why span_autogoal.find_parent('div')['class'] gives you list as an output.

How to fix missing ul tags in html list snippet with Python and Beautiful Soup

If I have a snippet of html like this:
<p><br><p>
<li>stuff</li>
<li>stuff</li>
Is there a way to clean this and add the missing ul/ol tags using beautiful soup, or another python library?
I tried soup.prettify() but it left as is.
It doesn't seem like there's a built-in method which wraps groups of li elements into an ul. However, you can simply loop over the li elements, identify the first element of each li group and wrap it in ul tags. The next elements in the group are appended to the previously created ul:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
ulgroup = 0
uls = []
for li in soup.findAll('li'):
previous_element = li.findPrevious()
# if <li> already wrapped in <ul>, do nothing
if previous_element and previous_element.name == 'ul':
continue
# if <li> is the first element of a <li> group, wrap it in a new <ul>
if not previous_element or previous_element.name != 'li':
ulgroup += 1
ul = soup.new_tag("ul")
li.wrap(ul)
uls.append(ul)
# append rest of <li> group to previously created <ul>
elif ulgroup > 0:
uls[ulgroup-1].append(li)
print(soup.prettify())
For example, the following input:
html = '''
<p><br><p>
<li>stuff1</li>
<li>stuff2</li>
<div></div>
<li>stuff3</li>
<li>stuff4</li>
<li>stuff5</li>
'''
outputs:
<p>
<br/>
<p>
<ul>
<li>
stuff1
</li>
<li>
stuff2
</li>
</ul>
<div>
</div>
<ul>
<li>
stuff3
</li>
<li>
stuff4
</li>
<li>
stuff5
</li>
</ul>
</p>
</p>
Demo: https://repl.it/#glhr/55619920-fixing-uls
First, you have to decide which parser you are going to use. Different parsers treat malformed html differently.
The following BeautifulSoup methods will help you accomplish what you require
new_tag() - create a new ul tag
append() - To append the newly created ul tag somewhere in the soup tree.
extract() - To extract the li tags one by one (which we can append to the ul tag)
decompose() - To remove any unwanted tags from the tree. Which may be formed as a result of the parser's interpretation of the malformed html.
My Solution
Let's create a soup object using html5lib parser and see what we get
from bs4 import BeautifulSoup
html="""
<p><br><p>
<li>stuff</li>
<li>stuff</li>
"""
soup=BeautifulSoup(html,'html5lib')
print(soup)
Outputs:
<html><head></head><body><p><br/></p><p>
</p><li>stuff</li>
<li>stuff</li>
</body></html>
The next step may vary according to what you want to accomplish. I want to remove the second empty p. Add a new ul tag and get all the li tags inside it.
from bs4 import BeautifulSoup
html="""
<p><br><p>
<li>stuff</li>
<li>stuff</li>
"""
soup=BeautifulSoup(html,'html5lib')
second_p=soup.find_all('p')[1]
second_p.decompose()
ul_tag=soup.new_tag('ul')
soup.find('body').append(ul_tag)
for li_tag in soup.find_all('li'):
ul_tag.append(li_tag.extract())
print(soup.prettify())
Outputs:
<html>
<head>
</head>
<body>
<p>
<br/>
</p>
<ul>
<li>
stuff
</li>
<li>
stuff
</li>
</ul>
</body>
</html>

how to get soup.find_all to work in BeautifulSoup?

I'm trying to scrape information a page consisting names of attorneys using BeaurifulSoup
#importing libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
Following is an example of each attorney's names that are nested in HTML tags
</a>
<div class="person-info search-person-info people-search-person-info">
<div class="col person-name-position">
<a href="https://www.foxrothschild.com/richard-s-caputo/">
Richard S. Caputo
</a>
I tried using the following script to extract the name of each of the attorneys using 'a' as the tag and "col person-name-position" as the class. But it does not seem to work. Instead it prints out an empty list.
page=requests.get("https://www.foxrothschild.com/people/?search%5Bname%5D=&search%5Bkeyword%5D=&search%5Boffice%5D=&search%5Bpeople-position%5D=&search%5Bpeople-bar-admission%5D=&search%5Bpeople-language%5D=&search%5Bpeople-school%5D=Villanova+University+School+of+Law&search%5Bpractice-area%5D=") #insert page here
soup=BeautifulSoup(page.content,'html.parser')
#print(soup.prettify())
find_name=soup.find_all('a',class_='col person-name-position')
print(find_name)
You need to change your soup.find_all to div since the class goes with div and not a
page=requests.get("https://www.foxrothschild.com/people/search%5Bname%5D=&search%5Bkeywod%5D=&search%5Boffice%5D=&search%5Bpeople-position%5D=&search%5Bpeople-bar-admission%5D=&search%5Bpeople-language%5D=&search%5Bpeople-school%5D=Villanova+University+School+of+Law&search%5Bpractice-area%5D=")
#insert page here
soup=BeautifulSoup(page.content,'html.parser')
#print(soup.prettify())
find_name=soup.find_all('div',class_='col person-name-position')
print(find_name)
class="col person-name-position" is a property of a div object, so you need to use:
find_name=soup.find_all('div',class_='col person-name-position')
for entry in find_name:
a_element = entry.find("a")
#...

How to extract value from href in python?

Hi developer. I am facing a problem in extracting a href value in python.
I have a button there after clicking on "view Answer" it take me a next link I want to extract that data which is present in that link.
<div class="col-md-11 col-xs-12">
<a href="https://www.marham.pk/forum/thread/4471/i-need-a-good-orthopedic- dr">
<div class="hover-div">
<h2 itemprop="name">i need a good Orthopedic dr</h2>
</div>
</a>
<div class="thread-details">
<a href="https://www.marham.pk/forum/thread/4471/i-need-a-good-orthopedic-dr">
<p class="pull-left"><span class="glyphicon glyphicon-comment"></span> View Answers (<span itemprop="answerCount">1</span>) </p>
</a>
</div>
</div>
I need to extract this href tag.
You Can Use Data Scraping In Python.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen("Your URL WILL GO HERE").read()
soup = bs.BeautifulSoup(sauce,'html5lib')
print(soup)

Resources