beautifulsoup4: how to parse a forum post? - python-3.x

I have multiple occurences of the following (simplified) data structure as found in a forum software:
<li id="post12345" class="anchorFixedHeader" style="order: 1">
<div class="messagesidebar member" item-prop="author">
<div class="messageauthor">
<div class="messageauthorcontainer">
<a id="mac12">
<span class="username" itemprop="text">MostInnovativeUsernameEver</span>
</a>
</div>
</div>
</div>
<div class="messagecontent">
<div class="messagebody">
<div class="messagetext" itemprop="text">
Text before the quote.
<blockquote class="quotebox">
<div class="quoteboxcontent">
<p>
Hello, I'm a quote.
</p>
</div>
</blockquote>
Text after the class.
</div>
</div>
</div>
</li>
What I want to do for each occurence is to extract the username and for each username the corresponding messagecontent. I could do that succesfully, if there wasn't a single problem: the quote. When I print the extracted data in the console the data structure of the quote (naturally) gets messed up.
What I (seem) to need is the text before the quote, the quote itself and the text after the quote to deal with them separetely. I tried a bunch of stuff but don't quite find my way around in beautifulsoup just yet.
Ugh ... do you guys understand what I try to do?

Well, if I understood your question, here is a way to solve:
import re
import bs4
html = """<li id="post12345" class="anchorFixedHeader" style="order: 1">
<div class="messagesidebar member" item-prop="author">
<div class="messageauthor">
<div class="messageauthorcontainer">
<a id="mac12">
<span class="username" itemprop="text">MostInnovativeUsernameEver</span>
</a>
</div>
</div>
</div>
<div class="messagecontent">
<div class="messagebody">
<div class="messagetext" itemprop="text">
Text before the quote.
<blockquote class="quotebox">
<div class="quoteboxcontent">
<p>
Hello, I'm a quote.
</p>
</div>
</blockquote>
Text after the class.
</div>
</div>
</div>
</li>"""
forum_posts = []
re_break_line = re.compile(r'[^\n].*')
soup = bs4.BeautifulSoup(html, features='html.parser')
posts = soup.find_all('li')
for post in posts:
username = post.find('span', {'class': 'username'})
content = post.find('div', {'class': 'messagecontent'})
forum_post = {
'username': username.text,
'content': re_break_line.findall(content.text.replace(' ', ''))
}
forum_posts.append(forum_post)
print(forum_posts)
Output console:
[{'content': ['Text before the quote.', "Hello, I'm a quote.", 'Text after the class.'], 'username': 'MostInnovativeUsernameEver'}]

Related

How to extract text from a div and a p tag and append to a dataframe using beautifulsoup

<div class="col-xs-10 fullWidth">
<div class="col-xs-8ths halfWidth">
<div class="title">
10 YEAR
</div>
<div class="header noimage">
<p class="numbers">7.48%</p>
</div>
</div>
<div class="col-xs-8ths halfWidth">
<div class="title">
7 YEAR
</div>
<div class="header noimage">
<p class="numbers">7.07%</p>
</div>
</div>
<div class="col-xs-8ths halfWidth">
<div class="title">
5 YEAR
</div>
<div class="header noimage">
<p class="numbers">5.32%</p>
</div>
Trying to write the below to be able to extract the <div class='title' text and the <p class='numbers' text into a dataframe.
Unable to get past the below, for some reason i cant seem to pass 2 find_all searches in succession.
result = soup.find_all('div', attrs={'id':Fund_Code})
periods = result.find_all('div', attrs={'class': 'title'})
periods

Parse a challenging block of html with beautifulsoup

I've tried a lot of options with beautifulsoup but cannot seem to figure how to parse the following:
<div class="docSection profileQuestionsSection">
<div id="D_memberProfileQuestions" class="dotted-section">
<div id="D_memberProfileMeta" class="line">
<div class="unit size1of3">
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Location:</h4>
<p itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span class="locality" itemprop="addressLocality">website</span>, <span class="region" itemprop="addressRegion">WA</span><span class="display-none country-name" itemprop="addressCountry">USA</span>
</p>
</div>
</div>
<div class="unit size1of3">
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Member since:</h4>
<p>July 14, 2021</p>
</div>
</div>
<div class="size1of3 lastUnit">
<div class="D_memberProfileContentItem">
</div>
</div>
</div>
<div class="line">
</div>
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">What types of events in the area interest you?</h4>
<p class="D_empty">No answer yet</p>
</div>
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Introduction</h4>
<p class="D_empty">No introduction yet</p>
</div>
</div>
</div>
From the above snippet I'm trying to parse the following bolded text from <p>:
What types of events in the area interest you?
No answer yet
If I try the following the just prints blank lists [] what might i be doing wrong?
req=requests.get(member)
soupp=BeautifulSoup(req.text, "html.parser")
div=soupp.find('div',attrs={"class":"D_memberProfileContentItem"})
children=div.findChildren("div", recursive=True)
for child in children:
print(child)
Any thoughts? Thanks.
from bs4 import BeautifulSoup
html = '''<div class="docSection profileQuestionsSection">
<div id="D_memberProfileQuestions" class="dotted-section">
<div id="D_memberProfileMeta" class="line">
<div class="unit size1of3">
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Location:</h4>
<p itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<a href="https://www.website.com/cities/us/97298/"><span class="locality"
itemprop="addressLocality">website</span>, <span class="region"
itemprop="addressRegion">WA</span></a><span class="display-none country-name"
itemprop="addressCountry">USA</span>
</p>
</div>
</div>
<div class="unit size1of3">
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Member since:</h4>
<p>July 14, 2021</p>
</div>
</div>
<div class="size1of3 lastUnit">
<div class="D_memberProfileContentItem">
</div>
</div>
</div>
<div class="line">
</div>
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">What types of events in the area interest you?</h4>
<p class="D_empty">No answer yet</p>
</div>
<div class="D_memberProfileContentItem">
<h4 class="flush--bottom">Introduction</h4>
<p class="D_empty">No introduction yet</p>
</div>
</div>
</div>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('.D_memberProfileContentItem:nth-child(3) > p').text)
Output:
No answer yet

Find string with tag search inside a line using Beautifuloup

I want to extract holy place from <p class="answer"> <i class="fa fa-circle" aria-hidden="true"></i> holy place</p>
and plays from
<p class="answer"><i class="fa fa-circle" aria-hidden="true"></i> plays</p>
HTML Source Code
<div class="card card-custom custom-color">
<h1 class="card-header card-custom-font">A pilgrim is a person who undertakes a journey to a --- <br>
</h1>
<div class="card-body">
<div class="row">
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle" aria-hidden="true"></i> holy place</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> a mosque</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> a bazar</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> a new country</p>
</div>
</div>
<div class="card card-custom custom-color">
<h1 class="card-header card-custom-font">Shakespeare is known mostly for his--- <br>
</h1>
<div class="card-body">
<div class="row">
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> poetry</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> novels</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> autobiography</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle" aria-hidden="true"></i> plays</p>
</div>
</div>
My code
question_block = soup.find_all('div', attrs = {'class':'card card-custom custom-color'})
right_answer = question_block.find('p', attrs={'class':'answer','i':'fa fa-circle'}).get_text(strip=True)
Getting output: None
Thanks in advance and your answer will be highly appreciated.
Happy Coding :)
You want to call the appropriate css pattern on each question block. In this case .answer > .fa-circle will move you adjacent to the value you want, and next_sibling will then return the value you want:
from bs4 import BeautifulSoup as bs
html = '''your html'''
soup = bs(html, 'lxml')
question_blocks = soup.find_all('div', attrs = {'class':'card card-custom custom-color'})
for q in question_blocks:
# print(q)
print(q.select_one('.card-header').text)
print(q.select_one('.answer > .fa-circle').next_sibling.strip())
print('*' * 50)
I have taken you data as html where i have used css selector to locate element i tag and looping over it to find previous tag which contains correct answer text
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"html.parser")
main_div=soup.select("p > i.fa.fa-circle")
for data in main_div:
print(data.find_previous('p').text)
Output:
holy place
plays
You can directly select the p with class name answer and extract the text inside it.
x = soup.find('p', class_="answer")
print(x.text)
This code will extract only holy place and plays from p tags.
p = soup.findAll('p', class_='answer')
for i in p:
if i.text.strip() in ('plays', 'holy place'):
print(i.text.strip())
Output:
holy place
plays

Press "Visit product" button only if an <article> has an <li> class of "availability"

I have the following source code:
<form method="POST" data-component="compareForm" action="#">
<div class="row tsp" data-component="list-page-product">
<article id="123">
<div id='product'>
</div>
<div class="stock">
<ul class="simple" data-product="availability">
<li class="available">
<i class="icon-tick"></i>
<span>Delivery available</span></li>
</ul>
</div>
<div data-component="CT">
<button class="TT" type="button">Visit product</button>
</div>
</article>
<article id="1234">
<div id='product'>
</div>
<div class="stock">
<ul class="simple" data-product="availability">
<li class="available">
<i class="icon-tick"></i>
<span>Delivery available</span></li>
</ul>
</div>
<div data-component="CT">
<button class="TT" type="button">Visit product</button>
</div>
</article>
</div>
</form>
I would like to press the "Visit product" button if I found a class name of "available". In this example only article id="123" should be a match.
My code is:
if self.driver.find_elements_by_xpath("//li[#class='available']"):
self.driver.find_element_by_xpath('//*[#class="TT"]').click()
The first error is that it cannot locate an element using XPath. I don't know what to do next. Any input is much appreciated. Thank you!
If I were you I would search for articles then iterate and if available class is found click.
from selenium import webdriver
d = webdriver.Chrome()
d.get('URL')
articles = d.find_elements_by_xpath('//article')
for article in articles:
try:
available = article.find_element_by_class_name(
'//li[#class="available"]')
article.find_element_by_xpath('//button[#class="TT"]').click()
except:
pass
If button class is not only 'TT' or li class is not only 'available' this will not work. In that case you could use find_element_by_class_name.

Loop through same div' class and grab text using python webdriver

I done all with selenium and webdriver and now not sure how to get text from ALL div class Text3. But also I have problem with div id="TableStart_00023" That changes now and then numbers "TableStart_00023, TableStart_0283 etc.."
Here is HTML PART OF CODE
<div data-reactroot="" id="TableStart_00023">
<ul>
<li class="FirstRow03">
<a class="aClass">
<div class="innerCl">
<div class="Text1"></div>
<div class="Text2"></div>
<div class="Text3">Wanted data</div>
<div class="Text4"></div>
</div>
</a>
</li>
<li class="FirstRow02">
<a class="aClass">
<div class="innerCl">
<div class="Text1"></div>
<div class="Text2"></div>
<div class="Text3">Wanted data 2</div>
<div class="Text4"></div>
</div>
</a>
</li>
</ul>
</div>
Here is Python PART OF CODE what I done
for content in driver.find_elements_by_id('TableStart_00023'):
mytext= content.find_element_by_xpath('.//div[#class="Text3"]').text
print(mytext)
How can I create loop thought all div class Text3 and get text, when ID TableStart changes numbers? What am I doing wrong?
This xpath will return all elements div class Text3 from your table:
//div[starts-with(#id,'TableStart')]//div[#class='Text3']
When you have all this elements (using driver.find_elements_by_xpath) you can get texts from they.

Resources