How to get some class value in soup.findAll python 3.2 - python-3.x

How to get some class values in one string
<div class="col-md-9 bt-product-main-info"></div>
I'm using
soup.findAll(match_class("col-lg-3 col-md-4 col-sm-6 bt-product-list"))
But it's not working.
Thank You.

Given the following HTML text:
text = """
<div class="col-md-9 bt-product-main-info">hij</div>
<div class="col-md-9">asdas</div>
<div class="bt-product-list">sdshij</div>
"""
If you want only records which has exact class name match, for example: col-md-9 bt-product-main-info, then do:
soup.find_all('div', class_ = 'col-md-9 bt-product-main-info')
The output will be:
[<div class="col-md-9 bt-product-main-info">hij</div>]
If you want records that match any of the following class names, for example: col-md-9 or bt-product-main-info, then do:
soup.find_all('div', class_ = ['col-md-9', 'bt-product-main-info'])
The output will be:
[<div class="col-md-9 bt-product-main-info">hij</div>,
<div class="col-md-9">asdas</div>]

Related

find tags contain partial strings in text

<div id="a">123 456</div>
<div id="b">456</div>
<div id="c">123 456 789</div>
soup.findAll('div', text = re.compile('456'))
Only returns div b, no others.
soup.findAll('div', text = re.compile('45'))
Only returns div b, no others.
How to return other DIVs actually partially matches the specific string?
The answer to your question is almost similar to this answer. All you have to do is to tweak the lambda function a bit. Here is the full code:
from bs4 import BeautifulSoup
html = '''
<div id="a">123 456</div>
<div id="b">456</div>
<div id="c">123 456 789</div>
'''
soup = BeautifulSoup(html,'html5lib')
divs = soup.find_all("div", text=lambda text: text and '456' in text)
Output:
>>> divs
[<div id="a">123 456</div>, <div id="b">456</div>, <div id="c">123 456 789</div>]

Python3 Beautifulsoup4 extracting text from multiple container siblings

I am new to python, and I am trying to using beautifulsoup to extract only the text from a group of tags. The first tag is 'name' the second is 'date' I can grab the text either from name, or date just not together. Here is the html code to the page I am trying to scrape
<div class="results">
<h1>
Info Records
</h1>
<div class="group">
<a class="name" href="https://" target="_blank">
Firstname, Lastname
</a>
<br/>
<span class="date">
8/24/2020: Text info
</span>
</div>
<div class="group">
<a class="name" href="https://" target="_blank">
Different Firstname, Different Lastname
</a>
<br/>
<span class="date">
8/23/2020: Different Text Info
</span>
</div>
for name I use this code which pulls the names, and prints them to terminal for the dates I change the class name to 'date'
for arrest in soup.find_all('a', {'class': 'name'}):
name = arrest.text
print(name)
The html has about 20 names, and dates I only posted the first 2. When I try to print both classes together it doesn't work.
test = soup.find_all("div", {"class": ["name", "date"]})
print(test)
Also what is working doesn't write to text file. Ideally what I am trying to accomplish is something like this to be added to a output file:
firsname lastname
8/24/2020 Text info
firstname last name
8/23/20920 different text info
Any advice would be helpful. I've been reading all day trying to figure it out.
Option 1: use a CSS selector.
Option 2: use zip()
1:
from bs4 import BeautifulSoup
html = """YOUR ABOVE HTML SNIPPET"""
soup = BeautifulSoup(html, "html.parser")
with open("output.txt", "w") as f:
# select `name` and `date` class
for tags in soup.select(".name, .date"):
f.write(tags.text.strip() + "\n")
2:
with open("output.txt", "w") as f:
for name, date in zip(
soup.find_all("a", {"class": "name"}), soup.find_all("span", {"class": "date"})
):
f.write(name.text.strip() + "\n")
f.write(date.text.strip() + "\n")
output.txt:
Firstname, Lastname
8/24/2020: Text info
Different Firstname, Different Lastname
8/23/2020: Different Text Info

Get text after dynamic class with title text Python/bs4

The class "label" with text "Owner 1" dynamically changes so indexing the same class name isn't consistent. I'm trying to grab the name "Joe Smith" following the class text label. Some records have "Company Name" first.
<div>
<div class="label">Owner 1 Name</div>
<div class="value">
<div>Joe Smith</div>
</div>
<div>
<div class="label">Company Name</div>
<div class="value">
<div>ACME CO</div>
</div>
There are roughly ten "label" class in a row like the code above.
Owner 1 Name dynamically changes by record and ends up in a different location everytime. I just need the name value for each record.
Try it this way:
company = """your html above"""
from bs4 import BeautifulSoup as bs
soup = bs(company,'lxml')
target = soup.select('div[class="label"]:contains("Owner")+div>div')
print(target[0].text)
Output:
Joe Smith
This did the trick:
target = soup.find("div", text="Owner 1 Name")
print(target.find_next_sibling("div").get_text())

Why does attribute splitting happen in BeautifulSoup?

I try to get the attribute of the parent element:
<div class="detailMS__incidentRow incidentRow--away odd">
<div class="time-box">45'</div>
<div class="icon-box soccer-ball-own"><span class="icon soccer-ball-own"> </span></div>
<span class=" note-name">(Autogoal)</span><span class="participant-name">
Reynaldo
</span>
</div>
span_autogoal = soup.find('span', class_='note-name')
print(span_autogoal)
print(span_autogoal.find_parent('div')['class'])
# print(span_autogoal.find_parent('div').get('class')
Output:
<span class="note-name">(Autogoal)</span>
['detailMS__incidentRow', 'incidentRow--away', 'odd']
I know i can do something like this:
print(' '.join(span_autogoal.find_parent('div')['class']))
But i want to know why this is happening and is it possible to do this more correctly?
Above answer is correct however if you want get mutli attribute value return as string try use xml parser after get the parent element.
from bs4 import BeautifulSoup
data='''<div class="detailMS__incidentRow incidentRow--away odd">
<div class="time-box">45'</div>
<div class="icon-box soccer-ball-own"><span class="icon soccer-ball-own"> </span></div>
<span class=" note-name">(Autogoal)</span><span class="participant-name">
Reynaldo
</span>
</div>'''
soup=BeautifulSoup(data,'lxml')
span_autogoal = soup.find('span', class_='note-name')
print(span_autogoal)
parentdiv=span_autogoal.find_parent('div')
data=str(parentdiv)
soup=BeautifulSoup(data,'xml')
print(soup.div['class'])
Output on console:
<span class="note-name">(Autogoal)</span>
detailMS__incidentRow incidentRow--away odd
According to the BeautifulSoup documentation:
HTML 4 defines a few attributes that can have multiple values. HTML 5
removes a couple of them, but defines a few more. The most common
multi-valued attribute is class (that is, a tag can have more than one
CSS class). Others include rel, rev, accept-charset, headers, and
accesskey. Beautiful Soup presents the value(s) of a multi-valued
attribute as a list:
css_soup = BeautifulSoup('<p class="body"></p>') css_soup.p['class']
# ["body"]
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]
So in your case in <div class="detailMS__incidentRow incidentRow--away odd"> a class attribute is multi-valued.
That's why span_autogoal.find_parent('div')['class'] gives you list as an output.

How to Find a tag without specific attribute using beautifulsoup?

I'm trying to get the content of the 'p' tags that didn't have the specific attribute.
I have some tags with 'class'='cost', and some tags with 'class'='cost' and 'itemprop'='price'
all_cars = soup.find_all('div', attrs={'class': 'listdata'})
...
...
tatal_cost= car.findChildren('p', attrs={'class': 'cost'})
cost= car.findChildren('p', attrs={'class': 'cost', 'itemprop':'price'})
I am trying to find 'p' tags without 'itemprop' attribute, but i cant find any solution.
BeautifulSoup's built-in attribute filters are enough for this. You can give True as value to simple check if the attribute is present. None can be used to specify that the attribute should not be present. Likewise the value can be any attribute value (eg 'cost').
from bs4 import BeautifulSoup
html="""
<p class="cost">paragraph 1</p>
<p class="cost">paragraph 2</p>
<p class="cost">paragraph 3</p>
<p class="cost" itemprop="1">paragraph 4</p>
<p class="somethingelse">paragraph 5</p>
"""
soup=BeautifulSoup(html,'html.parser')
print("---without 'itemprop' attribute")
print(soup.find_all('p',itemprop=None))
print("---with class = 'cost' and without 'itemprop' attribute----")
print(soup.find_all('p',attrs={'itemprop':None,"class":'cost'}))
#below is an alternative way to specify this
#print(soup.find_all('p',itemprop=None,class_='cost'))
Output
---without 'itemprop' attribute
[<p class="cost">paragraph 1</p>, <p class="cost">paragraph 2</p>, <p class="cost">paragraph 3</p>, <p class="somethingelse">paragraph 5</p>]
---with class = 'cost' and without 'itemprop' attribute----
[<p class="cost">paragraph 1</p>, <p class="cost">paragraph 2</p>, <p class="cost">paragraph 3</p>]
BeautifulSoup lets you define a function and pass it into its find_all() method:
def has_class_but_not_itemprop(tag):
return tag.has_attr('class') and not tag.has_attr('itemprop')
# Pass this function into find_all() and you’ll pick up all the <p>
# tags you're after:
soup.find_all(has_class_but_not_itemprop)
# [<p class="cost">...</p>,
# <p class="cost">...</p>,
# <p class="cost">...</p>]
For more information, see the BeautifulSoup documentation.

Resources