how to get the exact text that include in multiple class - python-3.x

I am new on python.
I try to scrape the data from the websites.
but I failed to extract that data which I needed.
here I share my python code
import requests
from bs4 import BeautifulSoup
url = 'https://v2.sherpa.ac.uk/view/publisher_list/1.html'
r = requests.get(url)
htmlContent = r.content
soup = BeautifulSoup(htmlContent, 'html.parser')
title = soup.title
print(soup.find_all("div", {"class" :["ep_view_page ep_view_page_view_publisher_list", "row"]}))
what I face is I need the data that is in div class = row but here there are two div class with row name.
and one more thing that what should I write to get the data from the multiple URL and pages if you see that there is the tag having class col span-6 and col span-3; on href tag when I link on that it opens one new page.
<div class="row">
<div class="col span-6">
'Grigore Antipa' National Museum of Natural History
</div>
<div class="**col span-3**">
<strong>Romania</strong>
<span class="label">Country</span>
</div>
<div class="**col span-3**">
<strong>1 [view ]</strong>
<span class="label">Publication Count</span>
</div>
</div>
and here I share the sitemap
<div class="row">
<h1 class="h1_like_h2">Publishers</h1>
<div class="ep_view_page ep_view_page_view_publisher_list">
</p><div class="row">
<div class="**col span-6">
'Grigore Antipa' National Museum of Natural History
</div>
<div class="col span-3">
<strong>Romania</strong>
<span class="label">Country</span>
</div>
<div class="col span-3">
<strong>1 [view ]</strong>
<span class="label">Publication Count</span>
</div>
</div>
<p></p><a name="group_=28"></a><h2>(</h2><p>
</p><div class="row">
<div class="col span-6">
(ISC)²
</div>
<div class="col span-3">
<strong>United States of America</strong>
<span class="label">Country</span>
</div>
<div class="col span-3">
<strong>1 [view ]</strong>
<span class="label">Publication Count</span>
</div>
</div>
<p></p><a name="group_1"></a><h2>1</h2><p>
</p><div class="row">
<div class="col span-6">
1066 Tidsskrift for historie
</div>
<div class="col span-3">
<strong>Denmark</strong>
<span class="label">Country</span>
</div>
<div class="col span-3">
<strong>1 [view ]</strong>
<span class="label">Publication Count</span>
</div>
</div>
<p></p><a name="group_A"></a><h2>A</h2><p>
**so on.......**

I'm not really sure if that's what you wanted, but I've had some fun getting this stuff.
Basically, the code below scrapes the entire page - 4487 entries - for:
Note: This is just a sample of data.
The name of the entity - ANZAMEMS (Australian and New Zealand Association for Medieval and Early Modern Studies)
The URL to its sub-page - http://v2.sherpa.ac.uk/id/publisher/1853?template=romeo
The country - Australia
The view count - 1
The so called "publisher url" - https://v2.sherpa.ac.uk//view/publication_by_publisher/1853.html
and spits all of this out to a .csv file that looks like this:
Here's the code:
import csv
import requests
from bs4 import BeautifulSoup
def make_soup():
p = requests.get("https://v2.sherpa.ac.uk/view/publisher_list/1.html").text
return BeautifulSoup(p, "html.parser")
main_soup = make_soup()
col_span_6_soup = main_soup.find_all("div", {"class": "col span-6"})
col_span_3_soup = main_soup.find_all("div", {"class": "col span-3"})
def get_names_and_urls():
data = [a.find("a") for a in col_span_6_soup if a is not None]
return [[i.text, i.get("href")] for i in data if "romeo" in i.get("href")]
def get_countries():
return [c.find("strong").text for c in col_span_3_soup[::2]]
def get_views_and_publisher():
return [
[
i.find("strong").text.replace(" [view ]", ""),
f"https://v2.sherpa.ac.uk/{i.find('a').get('href')}",
] for i in col_span_3_soup[1::2]
]
table = zip(get_names_and_urls(), get_countries(), get_views_and_publisher())
with open("loads_of_data.csv", "w") as output:
w = csv.writer(output)
w.writerow(["NAME", "URL", "COUNTRY", "VIEWS", "PUBLISHER_URL"])
for col1, col2, col3 in table:
w.writerow([*col1, col2, *col3])
print("You've got all the data!")

Related

How to extract text from a div and a p tag and append to a dataframe using beautifulsoup

<div class="col-xs-10 fullWidth">
<div class="col-xs-8ths halfWidth">
<div class="title">
10 YEAR
</div>
<div class="header noimage">
<p class="numbers">7.48%</p>
</div>
</div>
<div class="col-xs-8ths halfWidth">
<div class="title">
7 YEAR
</div>
<div class="header noimage">
<p class="numbers">7.07%</p>
</div>
</div>
<div class="col-xs-8ths halfWidth">
<div class="title">
5 YEAR
</div>
<div class="header noimage">
<p class="numbers">5.32%</p>
</div>
Trying to write the below to be able to extract the <div class='title' text and the <p class='numbers' text into a dataframe.
Unable to get past the below, for some reason i cant seem to pass 2 find_all searches in succession.
result = soup.find_all('div', attrs={'id':Fund_Code})
periods = result.find_all('div', attrs={'class': 'title'})
periods

Find string with tag search inside a line using Beautifuloup

I want to extract holy place from <p class="answer"> <i class="fa fa-circle" aria-hidden="true"></i> holy place</p>
and plays from
<p class="answer"><i class="fa fa-circle" aria-hidden="true"></i> plays</p>
HTML Source Code
<div class="card card-custom custom-color">
<h1 class="card-header card-custom-font">A pilgrim is a person who undertakes a journey to a --- <br>
</h1>
<div class="card-body">
<div class="row">
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle" aria-hidden="true"></i> holy place</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> a mosque</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> a bazar</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> a new country</p>
</div>
</div>
<div class="card card-custom custom-color">
<h1 class="card-header card-custom-font">Shakespeare is known mostly for his--- <br>
</h1>
<div class="card-body">
<div class="row">
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> poetry</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> novels</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle-o" aria-hidden="true"></i> autobiography</p>
</div>
<div class="col-sm-12">
<p class="answer"><i class="fa fa-circle" aria-hidden="true"></i> plays</p>
</div>
</div>
My code
question_block = soup.find_all('div', attrs = {'class':'card card-custom custom-color'})
right_answer = question_block.find('p', attrs={'class':'answer','i':'fa fa-circle'}).get_text(strip=True)
Getting output: None
Thanks in advance and your answer will be highly appreciated.
Happy Coding :)
You want to call the appropriate css pattern on each question block. In this case .answer > .fa-circle will move you adjacent to the value you want, and next_sibling will then return the value you want:
from bs4 import BeautifulSoup as bs
html = '''your html'''
soup = bs(html, 'lxml')
question_blocks = soup.find_all('div', attrs = {'class':'card card-custom custom-color'})
for q in question_blocks:
# print(q)
print(q.select_one('.card-header').text)
print(q.select_one('.answer > .fa-circle').next_sibling.strip())
print('*' * 50)
I have taken you data as html where i have used css selector to locate element i tag and looping over it to find previous tag which contains correct answer text
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"html.parser")
main_div=soup.select("p > i.fa.fa-circle")
for data in main_div:
print(data.find_previous('p').text)
Output:
holy place
plays
You can directly select the p with class name answer and extract the text inside it.
x = soup.find('p', class_="answer")
print(x.text)
This code will extract only holy place and plays from p tags.
p = soup.findAll('p', class_='answer')
for i in p:
if i.text.strip() in ('plays', 'holy place'):
print(i.text.strip())
Output:
holy place
plays

How to not include a particular element from soup.select()?

I use soup.select('.c-w a') to select elements. Inside c-w, there is c-s of which I would like not to include in this selection.
from bs4 import BeautifulSoup
txt = '''
<div class="c-w">
<div class="c-s">
<img class="soundpng" src="file://sound.png"/>
</div></div>
'''
soup = BeautifulSoup(txt, 'html.parser')
for a in soup.select('.c-w a'):
a['href'] = 'entry://'
and the result is
<div class="c-w">
<div class="c-s">
<img class="soundpng" src="file://sound.png"/>
</div></div>
My goal is to not include .c-s .a in this process of replacement. I mean when the search meet c-s, it will ignore this element and search in other ones. Could you please elaborate on how to achieve my goal?
Based on your comments, you can use .find_parent() to determine if the <a> tag is inside tag with class="c-s":
from bs4 import BeautifulSoup
txt = '''
<div class="c-w">
<div class="c-s">
<img class="soundpng" src="file://sound.png"/>
</div>
<div>
...
</div>
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
for a in soup.select('.c-w a'):
if a.find_parent(class_='c-s'):
continue
a['href'] = 'entry://'
print(soup.prettify())
Prints:
<div class="c-w">
<div class="c-s">
<a href="sound://english-french/sound/M000001099.mp3">
<img class="soundpng" src="file://sound.png"/>
</a>
</div>
<div>
<a href="entry://">
...
</a>
</div>
</div>
EDIT: To exclude both .c-s and .c-v, you can do this:
from bs4 import BeautifulSoup
txt = '''
<div class="c-w">
<div class="c-s">
<img class="soundpng" src="file://sound.png"/>
</div>
<div class="c-v">
<img class="soundpng" src="file://sound.png"/>
</div>
<div>
...
</div>
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
for a in soup.select('.c-w a'):
if a.find_parent(class_=['c-s', 'c-v']):
continue
a['href'] = 'entry://'
print(soup.prettify())
Prints:
<div class="c-w">
<div class="c-s">
<a href="sound://english-french/sound/M000001099.mp3">
<img class="soundpng" src="file://sound.png"/>
</a>
</div>
<div class="c-v">
<a href="sound://english-french/sound/M000001099.mp3">
<img class="soundpng" src="file://sound.png"/>
</a>
</div>
<div>
<a href="entry://">
...
</a>
</div>
</div>

beautifulsoup4: how to parse a forum post?

I have multiple occurences of the following (simplified) data structure as found in a forum software:
<li id="post12345" class="anchorFixedHeader" style="order: 1">
<div class="messagesidebar member" item-prop="author">
<div class="messageauthor">
<div class="messageauthorcontainer">
<a id="mac12">
<span class="username" itemprop="text">MostInnovativeUsernameEver</span>
</a>
</div>
</div>
</div>
<div class="messagecontent">
<div class="messagebody">
<div class="messagetext" itemprop="text">
Text before the quote.
<blockquote class="quotebox">
<div class="quoteboxcontent">
<p>
Hello, I'm a quote.
</p>
</div>
</blockquote>
Text after the class.
</div>
</div>
</div>
</li>
What I want to do for each occurence is to extract the username and for each username the corresponding messagecontent. I could do that succesfully, if there wasn't a single problem: the quote. When I print the extracted data in the console the data structure of the quote (naturally) gets messed up.
What I (seem) to need is the text before the quote, the quote itself and the text after the quote to deal with them separetely. I tried a bunch of stuff but don't quite find my way around in beautifulsoup just yet.
Ugh ... do you guys understand what I try to do?
Well, if I understood your question, here is a way to solve:
import re
import bs4
html = """<li id="post12345" class="anchorFixedHeader" style="order: 1">
<div class="messagesidebar member" item-prop="author">
<div class="messageauthor">
<div class="messageauthorcontainer">
<a id="mac12">
<span class="username" itemprop="text">MostInnovativeUsernameEver</span>
</a>
</div>
</div>
</div>
<div class="messagecontent">
<div class="messagebody">
<div class="messagetext" itemprop="text">
Text before the quote.
<blockquote class="quotebox">
<div class="quoteboxcontent">
<p>
Hello, I'm a quote.
</p>
</div>
</blockquote>
Text after the class.
</div>
</div>
</div>
</li>"""
forum_posts = []
re_break_line = re.compile(r'[^\n].*')
soup = bs4.BeautifulSoup(html, features='html.parser')
posts = soup.find_all('li')
for post in posts:
username = post.find('span', {'class': 'username'})
content = post.find('div', {'class': 'messagecontent'})
forum_post = {
'username': username.text,
'content': re_break_line.findall(content.text.replace(' ', ''))
}
forum_posts.append(forum_post)
print(forum_posts)
Output console:
[{'content': ['Text before the quote.', "Hello, I'm a quote.", 'Text after the class.'], 'username': 'MostInnovativeUsernameEver'}]

how to get a number from a array

[<div class="ticket_type">‐ Help With Steam Workshop<a
href="javascript:
jsTicketsLast7Days.getOptions().appendValueToParam( 'requestid',
'29' ); jsTicketsLast7Days.getOptions().showSelectedRange( true
); $J('#TicketsLast7Days').get(0).scrollIntoView();"> + </a>
</div>,
<div class="ticket_last_24 report_table_right">
<span>15</span>
<span>(</span><span
class="change_increase">+36%</span><span>)</span>
</div>,
<div class="ticket_last_week report_table_right">
<span>271</span>
<span>(</span><span
class="change_increase">+632%</span><span>)</span>
</div>,
<div class="ticket_waiting_not_elevated
report_table_right">0</div>,
<div class="ticket_waiting_elevated
report_table_right">37</div>,
[]]
how i can only get the "ticket_waiting_not_elevated report_table_right" and ticket_waiting_elevated
report_table_right number 0 and 37?
May be this could help,
text ="""[<div class="ticket_type">‐ Help With Steam Workshop + </div>,
<div class="ticket_last_24 report_table_right"><span>15</span><span>(</span><span class="change_increase">+36%</span><span>)</span> </div>,
<div class="ticket_last_week report_table_right"> <span>271</span><span>(</span><span class="change_increase">+632%</span><span>)</span></div>,
<div class="ticket_waiting_not_elevated report_table_right">0</div>,
<div class="ticket_waiting_elevated report_table_right">37</div>,
[]]"""
soup = BeautifulSoup(text, 'html.parser')
for i in soup.find_all('div', attrs={'class': ['ticket_waiting_not_elevated report_table_right', 'ticket_waiting_elevated report_table_right']}):
print(i.get('class')[0], ':', i.text)
# Output is: ticket_waiting_not_elevated : 0
# ticket_waiting_elevated : 37
You can use select() for getting your data:
data = """[<div class="ticket_type">‐ Help With Steam Workshop<a
href="javascript:
jsTicketsLast7Days.getOptions().appendValueToParam( 'requestid',
'29' ); jsTicketsLast7Days.getOptions().showSelectedRange( true
); $J('#TicketsLast7Days').get(0).scrollIntoView();"> + </a>
</div>,
<div class="ticket_last_24 report_table_right">
<span>15</span>
<span>(</span><span
class="change_increase">+36%</span><span>)</span>
</div>,
<div class="ticket_last_week report_table_right">
<span>271</span>
<span>(</span><span
class="change_increase">+632%</span><span>)</span>
</div>,
<div class="ticket_waiting_not_elevated
report_table_right">0</div>,
<div class="ticket_waiting_elevated
report_table_right">37</div>,
[]]"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
nums = soup.select('''div.ticket_waiting_not_elevated.report_table_right,
div.ticket_waiting_elevated.report_table_right''')
print([num.text for num in nums])
Prints:
['0', '37']
The soup.select('div.ticket_waiting_not_elevated.report_table_right, div.ticket_waiting_elevated.report_table_right') selects all divs with ticket_waiting_not_elevated report_table_right class or ticket_waiting_elevated report_table_right class.

Resources