Parse a challenging block of html with beautifulsoup - python-3.x

I've tried a lot of options with beautifulsoup but cannot seem to figure how to parse the following:
<div class="docSection profileQuestionsSection">
<div id="D_memberProfileQuestions" class="dotted-section">
<div id="D_memberProfileMeta" class="line">
<div class="unit size1of3">
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Location:</h4>
<p itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span class="locality" itemprop="addressLocality">website</span>, <span class="region" itemprop="addressRegion">WA</span><span class="display-none country-name" itemprop="addressCountry">USA</span>
</p>
</div>
</div>
<div class="unit size1of3">
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Member since:</h4>
<p>July 14, 2021</p>
</div>
</div>
<div class="size1of3 lastUnit">
<div class="D_memberProfileContentItem">
</div>
</div>
</div>
<div class="line">
</div>
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">What types of events in the area interest you?</h4>
<p class="D_empty">No answer yet</p>
</div>
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Introduction</h4>
<p class="D_empty">No introduction yet</p>
</div>
</div>
</div>
From the above snippet I'm trying to parse the following bolded text from <p>:
What types of events in the area interest you?
No answer yet
If I try the following the just prints blank lists [] what might i be doing wrong?
req=requests.get(member)
soupp=BeautifulSoup(req.text, "html.parser")
div=soupp.find('div',attrs={"class":"D_memberProfileContentItem"})
children=div.findChildren("div", recursive=True)
for child in children:
print(child)
Any thoughts? Thanks.

from bs4 import BeautifulSoup
html = '''<div class="docSection profileQuestionsSection">
<div id="D_memberProfileQuestions" class="dotted-section">
<div id="D_memberProfileMeta" class="line">
<div class="unit size1of3">
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Location:</h4>
<p itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<a href="https://www.website.com/cities/us/97298/"><span class="locality"
itemprop="addressLocality">website</span>, <span class="region"
itemprop="addressRegion">WA</span></a><span class="display-none country-name"
itemprop="addressCountry">USA</span>
</p>
</div>
</div>
<div class="unit size1of3">
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Member since:</h4>
<p>July 14, 2021</p>
</div>
</div>
<div class="size1of3 lastUnit">
<div class="D_memberProfileContentItem">
</div>
</div>
</div>
<div class="line">
</div>
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">What types of events in the area interest you?</h4>
<p class="D_empty">No answer yet</p>
</div>
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Introduction</h4>
<p class="D_empty">No introduction yet</p>
</div>
</div>
</div>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('.D_memberProfileContentItem:nth-child(3) > p').text)
Output:
No answer yet

Related

How to extract text from a div and a p tag and append to a dataframe using beautifulsoup

<div class="col-xs-10 fullWidth">
<div class="col-xs-8ths halfWidth">
<div class="title">
10 YEAR
</div>
<div class="header noimage">
<p class="numbers">7.48%</p>
</div>
</div>
<div class="col-xs-8ths halfWidth">
<div class="title">
7 YEAR
</div>
<div class="header noimage">
<p class="numbers">7.07%</p>
</div>
</div>
<div class="col-xs-8ths halfWidth">
<div class="title">
5 YEAR
</div>
<div class="header noimage">
<p class="numbers">5.32%</p>
</div>
Trying to write the below to be able to extract the <div class='title' text and the <p class='numbers' text into a dataframe.
Unable to get past the below, for some reason i cant seem to pass 2 find_all searches in succession.
result = soup.find_all('div', attrs={'id':Fund_Code})
periods = result.find_all('div', attrs={'class': 'title'})
periods

Why is get route showing TypeError: Cannot read property 'title' of null and crashing the website?

i am building an express app with mongodb in nodejs, When i make a get request to the show posts route , then all posts are rendered correctly when i pass them in ejs file but after the show page is displayed it gives an error on the terminal
events.js:291 throw er; // Unhandled 'error' event ^ TypeError: Cannot read property 'title' of null
this is my get route in router/posts.js
router.get("/commerce", (req,res)=>{
if(req.query.search){
var noMatch;
// gives search results on author name, content and title of the post
const regex = new RegExp(escapeRegex(req.query.search), 'gi')
Post.find({$or: [{title:regex} , {content:regex}, {'author.username':regex}], subject: "commerce"}, function(err,allposts){
if(err) console.log(err)
else{
if(allposts.length<1){
noMatch = "No posts matched the search results , please try again"
}
console.log("searched", allposts)
res.render("commerce", {posts: allposts, noMatch: noMatch, message: req.flash('success')});
}
})
} else{
Post.find({subject: "commerce"}, function(err,allposts){
if(err) {
console.log(err);
res.statusCode = 500;
res.end('error');
}
else{
console.log("actually all posts",allposts)
res.render("commerce", {posts: allposts, noMatch: noMatch , message: req.flash('success')});
}
})
}
})
commerce.ejs
<%- include("./partials/header1.ejs") %>
<link rel="stylesheet" href="/stylesheets/commerce.css">
<%- include("./partials/header2.ejs") %>
<div class="top-horiz-bar">
<div class="logo-section-in-bar">
<img class="top-img-logo" src="./finallogopic.png" alt="logo-pic" height="60px" width="60px">
<!-- <img class="top-name-logo" src="./zoomed-brand-name.png" alt="logo-name" height="80px" width="130px"> -->
<h2 class="top-name-logo">Backbanchers</h2>
</div>
<div class="other-icons">
<div class="search-box">
<input id="search-font" class="search-txt" type="text" name="" placeholder="Type to search">
<a class="search-btn" href="#">
<i class="fas fa-search"></i>
</a>
</div>
<div class="person-account">
<a class="account-info" href="#">
<i class="far fa-user-circle fa-2x"></i>
<span>Sign-in/up</span>
</a>
</div>
</div>
</div>
<div class="vertical-nav-bar">
Home<span><i class="fas fa-home"></i></span>
<a class="blog-section-left" href="#">Blogs<span><i class="fas fa-user-graduate"></i></span>
<!-- <ul class="sections-blog">
<a class="option-blog" href="#"><li>Business & Economics</li></a>
<a class="option-blog" href="#"><li>Commerce</li></a>
<a class="option-blog" href="#"><li>Personality Devlopment</li></a>
</ul> -->
</a>
Authors<span><i class="fas fa-pencil-alt"></i></span>
Newsletter<span><i class="fas fa-newspaper"></i></span>
Contact US<span><i class="fas fa-phone-alt"></i></span>
<!-- Services<span><i class="fas fa-question"></i></span> -->
About Us<span><i class="fas fa-users"></i></span>
<!-- Services<span><i class="fas fa-question"></i></span> -->
</div>
<div class="all-engineering-articles">
<div class="side-box">
<div class="one-of-3-heading">
<h2 class="side-box-heading-one">Trending</h2>
<div class="sub-part-section">
<h2 class="title-side-box">How 3D cameras with integrated data processing reduce the load on network and host PC</h2>
<div class="date-side-box">
<span>15 August 2020</span>
</div>
</div>
<div class="sub-part-section">
<h2 class="title-side-box">Indy Cars Get a Little Safer—Thanks to a 200+ mph Windshield</h2>
<div class="date-side-box">
<span>15 August 2020</span>
</div>
</div>
<div class="sub-part-section">
<h2 class="title-side-box">Additive Manufacturing Qualification & Certification During Crises</h2>
<div class="date-side-box">
<span>15 August 2020</span>
</div>
</div>
</div>
<div class="two-of-3-heading">
<h2 class="side-box-heading-two">Recommended</h2>
<div class="sub-part-section">
<h2 class="title-side-box">Autonomous Mobile Robots and Cobots Improve Worker Safety and Retention</h2>
<div class="date-side-box">
<span>15 August 2020</span>
</div>
</div>
<div class="sub-part-section">
<h2 class="title-side-box">Sustainable Control Panel Design</h2>
<div class="date-side-box">
<span>15 August 2020</span>
</div>
</div>
<div class="sub-part-section">
<h2 class="title-side-box">The Difference Between: Push-In Terminals versus Other Types of Connections</h2>
<div class="date-side-box">
<span>15 August 2020</span>
</div>
</div>
</div>
<div class="three-of-3-heading">
<h2 class="side-box-heading-three">Popular</h2>
<div class="sub-part-section">
<h2 class="title-side-box">
Why Using IIoT for Pneumatics is Simple, Yet Critical</h2>
<div class="date-side-box">
<span>15 August 2020</span>
</div>
</div>
<div class="sub-part-section">
<h2 class="title-side-box">
IR Camera Captures Defects in 3D Printing</h2>
<div class="date-side-box">
<span>15 August 2020</span>
</div>
</div>
<div class="sub-part-section">
<h2 class="title-side-box">Additive Manufacturing Qualification & Certification During Crises</h2>
<div class="date-side-box">
<span>15 August 2020</span>
</div>
</div>
</div>
</div>
<% posts.forEach(function(post){ %>
<div class="blog-post">
<div class="blog-post__img">
<img src="https://image.freepik.com/free-photo/cyborg-hand-pressing-keyboard-laptop-3d-rendering_117023-946.jpg" alt="article-pic" class="blog-post__article-img">
</div>
<div class="blog-post__info">
<div class="blog-post__date">
<span><%= post.publishDay%></span>
<span><%= post.publish_date %></span>
</div>
<div class="author-name">
<ul class="author-content">
<li class="name-aut"><%= post.author.username %>></li>
<li class="save-to-later"><i class="fas fa-plus "></i></li>
</ul>
</div>
console.log(<%= post.title %>)
<h1 class="blog-post__title"><%= post.title %></h1>
<p class="blog-post__text"><%= post.content.substring(0,120) %></p>
Read more
<ul class="btns-on-blogcard">
<!-- <li class="btns-blog"><i class="far fa-eye fa-2x "></i></li> -->
<li class="btns-blog"><i class="far fa-hand-peace fa-2x "></i></li>
<li class="btns-blog"><i class="fas fa-share fa-2x "></i></li>
</ul>
</div>
</div>
<% }) %>
<div class="page-end">
</div>
<div class="next-page">
<h2 class="new-page-blogs">-Page 1 of 1- <span class="next-page-btn">Next page</span></h2>
</div>
</div>
<section class="footer" >
<footer id="foo">
<div class="overall-footer">
<div class="first-part">
<h2 class="our-name">Backbenchers</h2>
<img id="logo-of-learners" src="finallogopic.png" alt="our-logo" height="40%" width="14%">
</div>
<div class="bordering-right">
</div>
<div class="second-part">
<h2 class="second-links">Quick Links</h2>
<ul class="ul-class-footer">
<li class="quick-buttons">Home</li>
<li class="quick-buttons">Buisness & Economics</li>
<li class="quick-buttons">Commerce</li>
<li class="quick-buttons">Engineering</li>
<li class="quick-buttons">Personality Devlopment</li>
</ul>
</div>
<div class="third-party">
<div class="name-social-footer">
<li class="quick-buttons-2">Facebook</li>
<li class="quick-buttons-2">Instagram</li>
<li class="quick-buttons-2">LinkedIn</li>
</div>
<style>
a{
text-decoration: none;
color: white;
}
</style>
<div class="media-buttons-footer">
<a class="btns-footer-class-1" href="#"><i id="facebook-1" class="fab fa-facebook-f "></i></a>
<a class="btns-footer-class-2" href="#"><i id="instagram-1" class="fab fa-instagram "></i></a>
<a class="btns-footer-class-3" href="#"><i id="linkedin-1" class="fab fa-linkedin-in "></i></a>
</div>
</div>
<div class="fourth-party">
<h2 class="fourth-links">Social</h2>
<a href="mailto:ritishgupta45#gmail.com">
<span id="envolope-footer" class="fas fa-envelope fa-2x"></span>
<span id="email-footer" class="text">Backbenchers#gmail.com</span>
</a>
<input id="name-id-id" type="email" placeholder="Name">
<input id="email-id-id" type="email" placeholder="Email-id" required>
<input id="leave-msg-id-id" type="text" placeholder="leave a message">
<div class="send-button-footer">
<button id="send-id-id" type="submit">Send</button>
</div>
</div>
<div class="fifth-party">
<h2 class="fifth-links">Connect with Us</h2>
<h2 class="heading-of-news-footer">Newsletter subscription</h2>
<p class="para-of-news-footer">Subscribe to our fortnightly newsletter <br> to gain great knowledge exposure. <br> Also, stand a chance to participate <br> in brainstorming competitions <br> and win exciting prizes.</p>
<div class="suscribe-footer">
<input id="footer-suscribe-btns" type="email" placeholder="enter your email id" required>
<div class="red-btn-suscribe">
<button id="suscribing" type="button">Subscribe</button>
</div>
</div>
</div>
<div class="final-part">
<div class="box-copy">
<div class="btns-margin">
<h2 class="bottom-copy-footer">About Us</h2>
<h2 class="bottom-copy-footer">Privacy Policy</h2></a>
<h2 class="bottom-copy-footer">Terms and conditions</h2>
</div>
<h2 class="final-copywrite">copyright © All Rights Reserved | Backbenchers</h2>
</div>
</div>
</footer>
</section>
<%- include("./partials/footer.ejs") %>
output ss 1 of console.log("actually all posts",allposts)
output ss 2 of console.log("actually all posts",allposts)
in these output ss
i have added console.log("actually all posts",allposts) before res.render() and got the output whrere all posts are printed and at last null is also printed where is that null coming from? , and after this null , error is printed on the console as visible in the scrrenshots above

Not able to extract urls from HTML BeautifulSoup object

I am looking to extract following url "https://mania.bg/p/pulover-alexander-mcqueen-p409648" from html (BeautifulSoup object) named urls that looks like:
[<a class="product sellout product-sellout float-left status-1" data-id="409648" data-producturl="https://mania.bg/p/pulover-alexander-mcqueen-p409648" data-status="1" href="https://mania.bg/p/pulover-alexander-mcqueen-p409648"> <div class="product-hover clearfix prevent-flicker"><div class="module-icons favourite tooltip" data-id="409648" data-title=" Любима находка на 24 клиент/и. "> <img alt="" class="favourite-product like-product unactivated" data-id="409648" src="dist/assets/icon_favourite_off.png"/></div> <div class="campaign" style="color: #FFF;background-color: #000000;"> NIGHT </div> <div class="profit-icons-wrapper clearfix"> </div> <div class="product-basic-info"> <div class="image-wrapper"> <img alt="Пуловер Alexander McQueen" class="front" data-url="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-2.jpg" src="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-1.jpg" title="Пуловер Alexander McQueen - Mania"> </img></div> <div class="clearfix brand-line"> <div class="brand float-left text-uppercase">Alexander McQueen</div> <div class="size float-right">S</div> </div> </div> <div class="prices-section"> <div class="prices-inner-section"> <div class="price-wrapper clearfix"> <div class="price-title text-uppercase float-left"> Начална цена </div> <div class="price old"> <span>98.00</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left"> -40% </div> <div class="price old"> <span>58.80</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left" style="color: #FFF;background-color: #000000"> -40% </div> <div class="price"> <span>35.28</span> <span class="currency">лв.</span> </div> </div> </div> </div> </div> <div class="button button-auction buy-now text-center float-left tooltip prevent-popup-close" data-id="409648" data-title="Може да добавите този продукт към количката.">ДОБАВЯМ<img alt="" class="bag-icon" src="dist/assets/icon_bag_button.svg"> </img></div> </a>]
With following code:
for num in range(len(urls)):
url = urls[num - 1].a['href']
I also tried to use:
url = urls[num - 1].a['data-producturl']
I get "TypeError: 'NoneType' object is not subscriptable" as url is None.
import requests
import bs4
url = 'https://mania.bg/p/pulover-alexander-mcqueen-p409648'
data = requests.get(url)
soup = bs4.BeautifulSoup(data.text,'html.parser')
urls = soup.find_all('a', attrs={'class': 'product sellout product-sellout float-left status-1'})
for num in range(len(urls)):
url = urls[num]['href']
print(url)
Try this. Here's an example:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
[<a class="product sellout product-sellout float-left status-1" data-id="409648" data-producturl="https://mania.bg/p/pulover-alexander-mcqueen-p409648" data-status="1" href="https://mania.bg/p/pulover-alexander-mcqueen-p409648"> <div class="product-hover clearfix prevent-flicker"><div class="module-icons favourite tooltip" data-id="409648" data-title=" Любима находка на 24 клиент/и. "> <img alt="" class="favourite-product like-product unactivated" data-id="409648" src="dist/assets/icon_favourite_off.png"/></div> <div class="campaign" style="color: #FFF;background-color: #000000;"> NIGHT </div> <div class="profit-icons-wrapper clearfix"> </div> <div class="product-basic-info"> <div class="image-wrapper"> <img alt="Пуловер Alexander McQueen" class="front" data-url="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-2.jpg" src="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-1.jpg" title="Пуловер Alexander McQueen - Mania"> </img></div> <div class="clearfix brand-line"> <div class="brand float-left text-uppercase">Alexander McQueen</div> <div class="size float-right">S</div> </div> </div> <div class="prices-section"> <div class="prices-inner-section"> <div class="price-wrapper clearfix"> <div class="price-title text-uppercase float-left"> Начална цена </div> <div class="price old"> <span>98.00</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left"> -40% </div> <div class="price old"> <span>58.80</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left" style="color: #FFF;background-color: #000000"> -40% </div> <div class="price"> <span>35.28</span> <span class="currency">лв.</span> </div> </div> </div> </div> </div> <div class="button button-auction buy-now text-center float-left tooltip prevent-popup-close" data-id="409648" data-title="Може да добавите този продукт към количката.">ДОБАВЯМ<img alt="" class="bag-icon" src="dist/assets/icon_bag_button.svg"> </img></div> </a>]
'''
doc = SimplifiedDoc(html)
urls = doc.selects('a.product sellout product-sellout float-left status-1')
print ([(url.href,url['data-producturl']) for url in urls])
Result:
[('https://mania.bg/p/pulover-alexander-mcqueen-p409648', 'https://mania.bg/p/pulover-alexander-mcqueen-p409648')]
find_all already gives you the list of a elements; you just need to get the href from each.
from bs4 import BeautifulSoup
import requests
url = 'https://mania.bg/p/pulover-alexander-mcqueen-p409648'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
for a in soup.find_all(
'a',
attrs={'class':
'product sellout product-sellout float-left status-1'}):
print(a['data-producturl'])

What would be the equivalent of following code in .hbs file

I was working on a project involving node, mongoose, handlebars, express. What would be the equivalent of the following code in handlebars?
<div class="panel panel-default" ng-init="getExams()">
<div class="panel-heading">
<h3 class="panel-title">Exams</h3>
</div>
<div class="panel-body">
<div class="row">
<div ng-repeat="exam in exams">
<div class="col-md-6">
<div class="col-md-6">
<h4>{{exam.examName}}</h4>
<a class="btn btn-primary" href="#/exams/details/{{exam._id}}">View Details</a>
</div>
</div>
</div>
</div>
I want to display all the rows of the output of the following function in my model.
module.exports.getExams = (callback, limit) => {
Exam.find(callback).limit(limit);
}
In node.js, when you're rendering :
res.render('your view'),{exams:exams}
In handlebars :
<div class="panel panel-default">
<div class="panel-heading">
<h3 class="panel-title">Exams</h3>
</div>
<div class="panel-body">
<div class="row">
{{#each exams}}
<div>
<div class="col-md-6">
<div class="col-md-6">
<h4>{{examName}}</h4>
<a class="btn btn-primary" href="#/exams/details/{{_id}}">View Details</a>
</div>
</div>
</div>
{{/each}}
</div>

CMSMS does not display html code in edito

I have a very weird problem with CMSMS. I have some HTML in my header file, there should be 2 links, but every time I paste a second one and click Apply- it disappears from editor. Although it shows on website.
CMSMS version: 1.11.9
<div class="header">
<div class="container">
<div class="row-fluid">
<div class="logo_wrapper">
<div class="logo"><img src="{root_url}/ui/images/logo.png" alt="" /></div>
</div>
<div class="span6 pull-right">{if $sid == 1 }
<div class="kabinet pull-right"><a class="rounded" href="apps/customer/web/profile/edit"> Профиль</a></div>
{else}
<div class="kabinet pull-right"><a class="rounded dark" href="#"> Статус доставки</a> <a class="rounded" href="apps/customer/web/login"> Личный кабинет</a></div>
{/if}</div>
</div>
<hr />
<div>
<div class="nav">
<div class="nav-inner">{menu loadprops=0}</div>
</div>
</div>
</div>
</div>
It should be after save:
<div class="header">
<div class="container">
<div class="row-fluid">
<div class="logo_wrapper">
<div class="logo"><img src="{root_url}/ui/images/logo.png" alt="" /></div>
</div>
<div class="span6 pull-right">{if $sid == 1 }
<div class="kabinet pull-right"><a class="rounded" href="apps/customer/web/profile/edit"> Профиль</a>
(this link keeps disappearing)
<a class="rounded no-bg-color" href="apps/customer/web/logout"><i class="icon-off icon-white"></i></a>
</div>
{else}
<div class="kabinet pull-right"><a class="rounded dark" href="#"> Статус доставки</a> <a class="rounded" href="apps/customer/web/login"> Личный кабинет</a></div>
{/if}</div>
</div>
<hr />
<div>
<div class="nav">
<div class="nav-inner">{menu loadprops=0}</div>
</div>
</div>
</div>
</div>
It's probably the HTML editor (MicroTiny I expect) believing that link is invalid HTML and throwing it away.
Set the editor to HTML mode so that you can edit the HTML directly (not WYSIWYG). It will then accept your edits and not try to edit them.

Resources