How to not include a particular element from soup.select()?

How to not include a particular element from soup.select()? - python-3.x

I use soup.select('.c-w a') to select elements. Inside c-w, there is c-s of which I would like not to include in this selection.
from bs4 import BeautifulSoup
txt = '''
<div class="c-w">
<div class="c-s">
<img class="soundpng" src="file://sound.png"/>
</div></div>
'''
soup = BeautifulSoup(txt, 'html.parser')
for a in soup.select('.c-w a'):
a['href'] = 'entry://'
and the result is
<div class="c-w">
<div class="c-s">
<img class="soundpng" src="file://sound.png"/>
</div></div>
My goal is to not include .c-s .a in this process of replacement. I mean when the search meet c-s, it will ignore this element and search in other ones. Could you please elaborate on how to achieve my goal?

Based on your comments, you can use .find_parent() to determine if the <a> tag is inside tag with class="c-s":
from bs4 import BeautifulSoup
txt = '''
<div class="c-w">
<div class="c-s">
<img class="soundpng" src="file://sound.png"/>
</div>
<div>
...
</div>
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
for a in soup.select('.c-w a'):
if a.find_parent(class_='c-s'):
continue
a['href'] = 'entry://'
print(soup.prettify())
Prints:
<div class="c-w">
<div class="c-s">
<a href="sound://english-french/sound/M000001099.mp3">
<img class="soundpng" src="file://sound.png"/>
</a>
</div>
<div>
<a href="entry://">
...
</a>
</div>
</div>
EDIT: To exclude both .c-s and .c-v, you can do this:
from bs4 import BeautifulSoup
txt = '''
<div class="c-w">
<div class="c-s">
<img class="soundpng" src="file://sound.png"/>
</div>
<div class="c-v">
<img class="soundpng" src="file://sound.png"/>
</div>
<div>
...
</div>
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
for a in soup.select('.c-w a'):
if a.find_parent(class_=['c-s', 'c-v']):
continue
a['href'] = 'entry://'
print(soup.prettify())
Prints:
<div class="c-w">
<div class="c-s">
<a href="sound://english-french/sound/M000001099.mp3">
<img class="soundpng" src="file://sound.png"/>
</a>
</div>
<div class="c-v">
<a href="sound://english-french/sound/M000001099.mp3">
<img class="soundpng" src="file://sound.png"/>
</a>
</div>
<div>
<a href="entry://">
...
</a>
</div>
</div>

Related

Parse a challenging block of html with beautifulsoup

I've tried a lot of options with beautifulsoup but cannot seem to figure how to parse the following:
<div class="docSection profileQuestionsSection">
<div id="D_memberProfileQuestions" class="dotted-section">
<div id="D_memberProfileMeta" class="line">
<div class="unit size1of3">
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Location:</h4>
<p itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span class="locality" itemprop="addressLocality">website</span>, <span class="region" itemprop="addressRegion">WA</span><span class="display-none country-name" itemprop="addressCountry">USA</span>
</p>
</div>
</div>
<div class="unit size1of3">
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Member since:</h4>
<p>July 14, 2021</p>
</div>
</div>
<div class="size1of3 lastUnit">
<div class="D_memberProfileContentItem">
</div>
</div>
</div>
<div class="line">
</div>
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">What types of events in the area interest you?</h4>
<p class="D_empty">No answer yet</p>
</div>
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Introduction</h4>
<p class="D_empty">No introduction yet</p>
</div>
</div>
</div>
From the above snippet I'm trying to parse the following bolded text from <p>:
What types of events in the area interest you?
No answer yet
If I try the following the just prints blank lists [] what might i be doing wrong?
req=requests.get(member)
soupp=BeautifulSoup(req.text, "html.parser")
div=soupp.find('div',attrs={"class":"D_memberProfileContentItem"})
children=div.findChildren("div", recursive=True)
for child in children:
print(child)
Any thoughts? Thanks.

from bs4 import BeautifulSoup
html = '''<div class="docSection profileQuestionsSection">
<div id="D_memberProfileQuestions" class="dotted-section">
<div id="D_memberProfileMeta" class="line">
<div class="unit size1of3">
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Location:</h4>
<p itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<a href="https://www.website.com/cities/us/97298/"><span class="locality"
itemprop="addressLocality">website</span>, <span class="region"
itemprop="addressRegion">WA</span></a><span class="display-none country-name"
itemprop="addressCountry">USA</span>
</p>
</div>
</div>
<div class="unit size1of3">
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Member since:</h4>
<p>July 14, 2021</p>
</div>
</div>
<div class="size1of3 lastUnit">
<div class="D_memberProfileContentItem">
</div>
</div>
</div>
<div class="line">
</div>
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">What types of events in the area interest you?</h4>
<p class="D_empty">No answer yet</p>
</div>
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Introduction</h4>
<p class="D_empty">No introduction yet</p>
</div>
</div>
</div>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('.D_memberProfileContentItem:nth-child(3) > p').text)
Output:
No answer yet

Find string with tag search inside a line using Beautifuloup

I want to extract holy place from <p class="answer"> <i class="fa fa-circle" aria-hidden="true"></i> holy place</p>
and plays from
<p class="answer"><i class="fa fa-circle" aria-hidden="true"></i> plays</p>
HTML Source Code
<div class="card card-custom custom-color">
<h1 class="card-header card-custom-font">A pilgrim is a person who undertakes a journey to a --- <br>
</h1>
<div class="card-body">
<div class="row">
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle" aria-hidden="true"></i> holy place</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> a mosque</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> a bazar</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> a new country</p>
</div>
</div>
<div class="card card-custom custom-color">
<h1 class="card-header card-custom-font">Shakespeare is known mostly for his--- <br>
</h1>
<div class="card-body">
<div class="row">
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> poetry</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> novels</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> autobiography</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle" aria-hidden="true"></i> plays</p>
</div>
</div>
My code
question_block = soup.find_all('div', attrs = {'class':'card card-custom custom-color'})
right_answer = question_block.find('p', attrs={'class':'answer','i':'fa fa-circle'}).get_text(strip=True)
Getting output: None
Thanks in advance and your answer will be highly appreciated.
Happy Coding :)

You want to call the appropriate css pattern on each question block. In this case .answer > .fa-circle will move you adjacent to the value you want, and next_sibling will then return the value you want:
from bs4 import BeautifulSoup as bs
html = '''your html'''
soup = bs(html, 'lxml')
question_blocks = soup.find_all('div', attrs = {'class':'card card-custom custom-color'})
for q in question_blocks:
# print(q)
print(q.select_one('.card-header').text)
print(q.select_one('.answer > .fa-circle').next_sibling.strip())
print('*' * 50)

I have taken you data as html where i have used css selector to locate element i tag and looping over it to find previous tag which contains correct answer text
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"html.parser")
main_div=soup.select("p > i.fa.fa-circle")
for data in main_div:
print(data.find_previous('p').text)
Output:
holy place
plays

You can directly select the p with class name answer and extract the text inside it.
x = soup.find('p', class_="answer")
print(x.text)
This code will extract only holy place and plays from p tags.
p = soup.findAll('p', class_='answer')
for i in p:
if i.text.strip() in ('plays', 'holy place'):
print(i.text.strip())
Output:
holy place
plays

How to move sub-tags to right after a mother tag?

I have an html in which there are many elements <div class="ex_example"> .. </div> inside <div class="c-s">, which is in turn inside <div class="c-w", i.e.
<div class="c-w"
<div class="c-s">
<div class="ex_example"> .. </div>
<div class="ex_example"> .. </div>
<div class="ex_example"> .. </div>
</div></div>
Could you please elaborate in how to move all the <div class="ex_example"> .. </div> to right before <div class="c-w". I mean
<div class="ex_example"> .. </div>
<div class="ex_example"> .. </div>
<div class="ex_example"> .. </div>
<div class="c-w"
<div class="c-s">
</div></div>
My code is
import requests
session = requests.Session()
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
r = session.get('https://dictionnaire.lerobert.com/definition/aimer', headers = headers)
soup = BeautifulSoup(r.content, 'html.parser')
Thank you so much for your help!
Update: I have the situation in which there are more than one <div class="c-w" and it's possible that some of them do not contain <div class="c-s">.
from bs4 import BeautifulSoup
txt = '''
<div class="c-w">
<div class="c-s">
<div class="ex_example"> aa </div>
<div class="ex_example"> aa </div>
</div>
</div>
<div class="audio">link</div>
<div class="c-w">
<div class="c-s">
<div class="ex_example"> xx </div>
<div class="ex_example"> yy </div>
</div>
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')

I hope I understood your question well: You can use .insert_before() to insert tags/strings before some tag/string:
from bs4 import BeautifulSoup
txt = '''
<div class="c-w">
<div class="c-s">
<div class="ex_example"> 1.. </div>
<div class="ex_example"> 2.. </div>
<div class="ex_example"> 3.. </div>
</div></div>
'''
soup = BeautifulSoup(txt, 'html.parser')
for c in list(soup.select_one('div.c-s').contents):
soup.select_one('div.c-w').insert_before(c)
print(soup)
Prints:
<div class="ex_example"> 1.. </div>
<div class="ex_example"> 2.. </div>
<div class="ex_example"> 3.. </div>
<div class="c-w">
<div class="c-s"></div></div>

how to get to the next element of list inside for loop in django template?

i have a list like
lst = [
['img1','img2','img3','img4','img5'],
['img6','img7','img8','img9']
]
and a template like
<div class="row"> -- 1st row
<div class="col-lg-3" data-aos="fade-up"> --col-lg-3
<a href="{{img1}}"> --img1
<img src="{{img1}}"> -- img1
</a>
</div>
<div class="col-lg-6" data-aos="fade-up" data-aos-delay="100"> --col-lg-6
<a href="{{img2}}"> --img2
<img src="{{img2}}"> --img2
</a>
</div>
<div class="col-lg-3" data-aos="fade-up" data-aos-delay="200"> --col-lg-3
<a href="{{img}}"> --img3
<img src="{{img}}"> --img3
</a>
</div>
</div>
<div class="row"> ---2nd row
<div class="col-lg-8" data-aos="fade-up" data-aos-delay="100"> --col-lg-8
<a href="{{img2}}"> --img2
<img src="{{img2}}"> --img2
</a>
</div>
<div class="col-lg-4" data-aos="fade-up" data-aos-delay="200"> --col-lg-4
<a href="{{img}}"> --img3
<img src="{{img}}"> --img3
</a>
</div>
</div>
please note few things:
every col-lg size is of different size
every and tag of a column contain 1 img
i want to iterate over the whole template but in a different way
eg: i can do this in python where i can iterator over list and assign the value as well
for data in lst:
it = iter(data)
print (next(it)) ---where print will be replaced with img tag
print (next(it))
print (next(it))
print (next(it))
print (next(it))
but how can i do this in django template

Django's for-in template loop builds the DOM defined between the template tag loop for each item in data.
That being said, you could adapt your solution like so:
<div class="row align-items-stretch">
{% for img in data %}
<div class="col-6 col-md-6 col-lg-3" data-aos="fade-up">
<a href="{{ data.img }}" class="d-block photo-item" data-fancybox="gallery">
<img src="{{ data.img.url }}" alt="Image" class="img-fluid">
<div class="photo-text-more">
<span class="icon icon-camera"></span>
</div>
</a>
</div>
{% endfor %}
</div>
If data contains three items/images, the output will be something like:
<div class="row align-items-stretch">
# 1st iteration
<div class="col-6 col-md-6 col-lg-3" data-aos="fade-up">
<a href="href_according_to_data" class="d-block photo-item" data-fancybox="gallery">
<img src="link" alt="Image" class="img-fluid">
<div class="photo-text-more">
<span class="icon icon-camera"></span>
</div>
</a>
</div>
# 2nd iteration
<div class="col-6 col-md-6 col-lg-3" data-aos="fade-up">
<a href="href_according_to_data" class="d-block photo-item" data-fancybox="gallery">
<img src="link" alt="Image" class="img-fluid">
<div class="photo-text-more">
<span class="icon icon-camera"></span>
</div>
</a>
</div>
# 3rd iteration
<div class="col-6 col-md-6 col-lg-3" data-aos="fade-up">
<a href="href_according_to_data" class="d-block photo-item" data-fancybox="gallery">
<img src="link" alt="Image" class="img-fluid">
<div class="photo-text-more">
<span class="icon icon-camera"></span>
</div>
</a>
</div>
</div>

Not able to extract urls from HTML BeautifulSoup object

I am looking to extract following url "https://mania.bg/p/pulover-alexander-mcqueen-p409648" from html (BeautifulSoup object) named urls that looks like:
[<a class="product sellout product-sellout float-left status-1" data-id="409648" data-producturl="https://mania.bg/p/pulover-alexander-mcqueen-p409648" data-status="1" href="https://mania.bg/p/pulover-alexander-mcqueen-p409648"> <div class="product-hover clearfix prevent-flicker"><div class="module-icons favourite tooltip" data-id="409648" data-title=" Любима находка на 24 клиент/и. "> <img alt="" class="favourite-product like-product unactivated" data-id="409648" src="dist/assets/icon_favourite_off.png"/></div> <div class="campaign" style="color: #FFF;background-color: #000000;"> NIGHT </div> <div class="profit-icons-wrapper clearfix"> </div> <div class="product-basic-info"> <div class="image-wrapper"> <img alt="Пуловер Alexander McQueen" class="front" data-url="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-2.jpg" src="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-1.jpg" title="Пуловер Alexander McQueen - Mania"> </img></div> <div class="clearfix brand-line"> <div class="brand float-left text-uppercase">Alexander McQueen</div> <div class="size float-right">S</div> </div> </div> <div class="prices-section"> <div class="prices-inner-section"> <div class="price-wrapper clearfix"> <div class="price-title text-uppercase float-left"> Начална цена </div> <div class="price old"> <span>98.00</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left"> -40% </div> <div class="price old"> <span>58.80</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left" style="color: #FFF;background-color: #000000"> -40% </div> <div class="price"> <span>35.28</span> <span class="currency">лв.</span> </div> </div> </div> </div> </div> <div class="button button-auction buy-now text-center float-left tooltip prevent-popup-close" data-id="409648" data-title="Може да добавите този продукт към количката.">ДОБАВЯМ<img alt="" class="bag-icon" src="dist/assets/icon_bag_button.svg"> </img></div> </a>]
With following code:
for num in range(len(urls)):
url = urls[num - 1].a['href']
I also tried to use:
url = urls[num - 1].a['data-producturl']
I get "TypeError: 'NoneType' object is not subscriptable" as url is None.

import requests
import bs4
url = 'https://mania.bg/p/pulover-alexander-mcqueen-p409648'
data = requests.get(url)
soup = bs4.BeautifulSoup(data.text,'html.parser')
urls = soup.find_all('a', attrs={'class': 'product sellout product-sellout float-left status-1'})
for num in range(len(urls)):
url = urls[num]['href']
print(url)

Try this. Here's an example:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
[<a class="product sellout product-sellout float-left status-1" data-id="409648" data-producturl="https://mania.bg/p/pulover-alexander-mcqueen-p409648" data-status="1" href="https://mania.bg/p/pulover-alexander-mcqueen-p409648"> <div class="product-hover clearfix prevent-flicker"><div class="module-icons favourite tooltip" data-id="409648" data-title=" Любима находка на 24 клиент/и. "> <img alt="" class="favourite-product like-product unactivated" data-id="409648" src="dist/assets/icon_favourite_off.png"/></div> <div class="campaign" style="color: #FFF;background-color: #000000;"> NIGHT </div> <div class="profit-icons-wrapper clearfix"> </div> <div class="product-basic-info"> <div class="image-wrapper"> <img alt="Пуловер Alexander McQueen" class="front" data-url="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-2.jpg" src="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-1.jpg" title="Пуловер Alexander McQueen - Mania"> </img></div> <div class="clearfix brand-line"> <div class="brand float-left text-uppercase">Alexander McQueen</div> <div class="size float-right">S</div> </div> </div> <div class="prices-section"> <div class="prices-inner-section"> <div class="price-wrapper clearfix"> <div class="price-title text-uppercase float-left"> Начална цена </div> <div class="price old"> <span>98.00</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left"> -40% </div> <div class="price old"> <span>58.80</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left" style="color: #FFF;background-color: #000000"> -40% </div> <div class="price"> <span>35.28</span> <span class="currency">лв.</span> </div> </div> </div> </div> </div> <div class="button button-auction buy-now text-center float-left tooltip prevent-popup-close" data-id="409648" data-title="Може да добавите този продукт към количката.">ДОБАВЯМ<img alt="" class="bag-icon" src="dist/assets/icon_bag_button.svg"> </img></div> </a>]
'''
doc = SimplifiedDoc(html)
urls = doc.selects('a.product sellout product-sellout float-left status-1')
print ([(url.href,url['data-producturl']) for url in urls])
Result:
[('https://mania.bg/p/pulover-alexander-mcqueen-p409648', 'https://mania.bg/p/pulover-alexander-mcqueen-p409648')]

find_all already gives you the list of a elements; you just need to get the href from each.
from bs4 import BeautifulSoup
import requests
url = 'https://mania.bg/p/pulover-alexander-mcqueen-p409648'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
for a in soup.find_all(
'a',
attrs={'class':
'product sellout product-sellout float-left status-1'}):
print(a['data-producturl'])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to not include a particular element from soup.select()? - python-3.x

Related

Parse a challenging block of html with beautifulsoup

Find string with tag search inside a line using Beautifuloup

How to move sub-tags to right after a mother tag?

how to get to the next element of list inside for loop in django template?

Not able to extract urls from HTML BeautifulSoup object

Categories

Resources