remove duplicated span tag content - python-3.x

<div class="ticket_last_24 report_table_right">
<span>13,978</span>
<span>(</span><span
class="change_increase">+2.3%
</span><span>)</span>
</div>
<div class="ticket_last_week report_table_right">
<span>99,585</span>
<span>(</span><span
class="change_increase">+0.6%
</span><span>)</span>
</div>
<div class="ticket_last_24 report_table_right">
<span>12121</span>
<span>(</span><span
class="change_increase">+2.3%
</span><span>)</span>
</div>
<div class="ticket_last_week report_table_right">
<span>99,222</span>
<span>(</span><span
class="change_increase">+0.6%
</span><span>)</span>
</div>
I tried the code below:
text=[]
from bs4 import BeautifulSoup
TicketNuber=soup.find_all("div")
for div in TicketNuber:
text.append(div.find("span"))
it prints out:[
'13,978',
'13,978',
'99,585',
'12,121'
'12,121'
'99,222'
]
Not sure why the first number will print out twice. I only want the number ['13,978','99492','12,121','99,222']. there is no duplicate number in the same tag

When I do this:
text = []
TicketNumber = soup.find_all("div")
for div in TicketNumber:
text.append(div.find("span").get_text())
print(text)
I get this:
['13,978', '99,585', '12,121', '99,222']
Could you please give this a shot and confirm if this works?

Related

Press "Visit product" button only if an <article> has an <li> class of "availability"

I have the following source code:
<form method="POST" data-component="compareForm" action="#">
<div class="row tsp" data-component="list-page-product">
<article id="123">
<div id='product'>
</div>
<div class="stock">
<ul class="simple" data-product="availability">
<li class="available">
<i class="icon-tick"></i>
<span>Delivery available</span></li>
</ul>
</div>
<div data-component="CT">
<button class="TT" type="button">Visit product</button>
</div>
</article>
<article id="1234">
<div id='product'>
</div>
<div class="stock">
<ul class="simple" data-product="availability">
<li class="available">
<i class="icon-tick"></i>
<span>Delivery available</span></li>
</ul>
</div>
<div data-component="CT">
<button class="TT" type="button">Visit product</button>
</div>
</article>
</div>
</form>
I would like to press the "Visit product" button if I found a class name of "available". In this example only article id="123" should be a match.
My code is:
if self.driver.find_elements_by_xpath("//li[#class='available']"):
self.driver.find_element_by_xpath('//*[#class="TT"]').click()
The first error is that it cannot locate an element using XPath. I don't know what to do next. Any input is much appreciated. Thank you!
If I were you I would search for articles then iterate and if available class is found click.
from selenium import webdriver
d = webdriver.Chrome()
d.get('URL')
articles = d.find_elements_by_xpath('//article')
for article in articles:
try:
available = article.find_element_by_class_name(
'//li[#class="available"]')
article.find_element_by_xpath('//button[#class="TT"]').click()
except:
pass
If button class is not only 'TT' or li class is not only 'available' this will not work. In that case you could use find_element_by_class_name.

how to get the exact text that include in multiple class

I am new on python.
I try to scrape the data from the websites.
but I failed to extract that data which I needed.
here I share my python code
import requests
from bs4 import BeautifulSoup
url = 'https://v2.sherpa.ac.uk/view/publisher_list/1.html'
r = requests.get(url)
htmlContent = r.content
soup = BeautifulSoup(htmlContent, 'html.parser')
title = soup.title
print(soup.find_all("div", {"class" :["ep_view_page ep_view_page_view_publisher_list", "row"]}))
what I face is I need the data that is in div class = row but here there are two div class with row name.
and one more thing that what should I write to get the data from the multiple URL and pages if you see that there is the tag having class col span-6 and col span-3; on href tag when I link on that it opens one new page.
<div class="row">
<div class="col span-6">
'Grigore Antipa' National Museum of Natural History
</div>
<div class="**col span-3**">
<strong>Romania</strong>
<span class="label">Country</span>
</div>
<div class="**col span-3**">
<strong>1 [view ]</strong>
<span class="label">Publication Count</span>
</div>
</div>
and here I share the sitemap
<div class="row">
<h1 class="h1_like_h2">Publishers</h1>
<div class="ep_view_page ep_view_page_view_publisher_list">
</p><div class="row">
<div class="**col span-6">
'Grigore Antipa' National Museum of Natural History
</div>
<div class="col span-3">
<strong>Romania</strong>
<span class="label">Country</span>
</div>
<div class="col span-3">
<strong>1 [view ]</strong>
<span class="label">Publication Count</span>
</div>
</div>
<p></p><a name="group_=28"></a><h2>(</h2><p>
</p><div class="row">
<div class="col span-6">
(ISC)²
</div>
<div class="col span-3">
<strong>United States of America</strong>
<span class="label">Country</span>
</div>
<div class="col span-3">
<strong>1 [view ]</strong>
<span class="label">Publication Count</span>
</div>
</div>
<p></p><a name="group_1"></a><h2>1</h2><p>
</p><div class="row">
<div class="col span-6">
1066 Tidsskrift for historie
</div>
<div class="col span-3">
<strong>Denmark</strong>
<span class="label">Country</span>
</div>
<div class="col span-3">
<strong>1 [view ]</strong>
<span class="label">Publication Count</span>
</div>
</div>
<p></p><a name="group_A"></a><h2>A</h2><p>
**so on.......**
I'm not really sure if that's what you wanted, but I've had some fun getting this stuff.
Basically, the code below scrapes the entire page - 4487 entries - for:
Note: This is just a sample of data.
The name of the entity - ANZAMEMS (Australian and New Zealand Association for Medieval and Early Modern Studies)
The URL to its sub-page - http://v2.sherpa.ac.uk/id/publisher/1853?template=romeo
The country - Australia
The view count - 1
The so called "publisher url" - https://v2.sherpa.ac.uk//view/publication_by_publisher/1853.html
and spits all of this out to a .csv file that looks like this:
Here's the code:
import csv
import requests
from bs4 import BeautifulSoup
def make_soup():
p = requests.get("https://v2.sherpa.ac.uk/view/publisher_list/1.html").text
return BeautifulSoup(p, "html.parser")
main_soup = make_soup()
col_span_6_soup = main_soup.find_all("div", {"class": "col span-6"})
col_span_3_soup = main_soup.find_all("div", {"class": "col span-3"})
def get_names_and_urls():
data = [a.find("a") for a in col_span_6_soup if a is not None]
return [[i.text, i.get("href")] for i in data if "romeo" in i.get("href")]
def get_countries():
return [c.find("strong").text for c in col_span_3_soup[::2]]
def get_views_and_publisher():
return [
[
i.find("strong").text.replace(" [view ]", ""),
f"https://v2.sherpa.ac.uk/{i.find('a').get('href')}",
] for i in col_span_3_soup[1::2]
]
table = zip(get_names_and_urls(), get_countries(), get_views_and_publisher())
with open("loads_of_data.csv", "w") as output:
w = csv.writer(output)
w.writerow(["NAME", "URL", "COUNTRY", "VIEWS", "PUBLISHER_URL"])
for col1, col2, col3 in table:
w.writerow([*col1, col2, *col3])
print("You've got all the data!")

beautifulsoup4: how to parse a forum post?

I have multiple occurences of the following (simplified) data structure as found in a forum software:
<li id="post12345" class="anchorFixedHeader" style="order: 1">
<div class="messagesidebar member" item-prop="author">
<div class="messageauthor">
<div class="messageauthorcontainer">
<a id="mac12">
<span class="username" itemprop="text">MostInnovativeUsernameEver</span>
</a>
</div>
</div>
</div>
<div class="messagecontent">
<div class="messagebody">
<div class="messagetext" itemprop="text">
Text before the quote.
<blockquote class="quotebox">
<div class="quoteboxcontent">
<p>
Hello, I'm a quote.
</p>
</div>
</blockquote>
Text after the class.
</div>
</div>
</div>
</li>
What I want to do for each occurence is to extract the username and for each username the corresponding messagecontent. I could do that succesfully, if there wasn't a single problem: the quote. When I print the extracted data in the console the data structure of the quote (naturally) gets messed up.
What I (seem) to need is the text before the quote, the quote itself and the text after the quote to deal with them separetely. I tried a bunch of stuff but don't quite find my way around in beautifulsoup just yet.
Ugh ... do you guys understand what I try to do?
Well, if I understood your question, here is a way to solve:
import re
import bs4
html = """<li id="post12345" class="anchorFixedHeader" style="order: 1">
<div class="messagesidebar member" item-prop="author">
<div class="messageauthor">
<div class="messageauthorcontainer">
<a id="mac12">
<span class="username" itemprop="text">MostInnovativeUsernameEver</span>
</a>
</div>
</div>
</div>
<div class="messagecontent">
<div class="messagebody">
<div class="messagetext" itemprop="text">
Text before the quote.
<blockquote class="quotebox">
<div class="quoteboxcontent">
<p>
Hello, I'm a quote.
</p>
</div>
</blockquote>
Text after the class.
</div>
</div>
</div>
</li>"""
forum_posts = []
re_break_line = re.compile(r'[^\n].*')
soup = bs4.BeautifulSoup(html, features='html.parser')
posts = soup.find_all('li')
for post in posts:
username = post.find('span', {'class': 'username'})
content = post.find('div', {'class': 'messagecontent'})
forum_post = {
'username': username.text,
'content': re_break_line.findall(content.text.replace(' ', ''))
}
forum_posts.append(forum_post)
print(forum_posts)
Output console:
[{'content': ['Text before the quote.', "Hello, I'm a quote.", 'Text after the class.'], 'username': 'MostInnovativeUsernameEver'}]

how to get a number from a array

[<div class="ticket_type">‐ Help With Steam Workshop<a
href="javascript:
jsTicketsLast7Days.getOptions().appendValueToParam( 'requestid',
'29' ); jsTicketsLast7Days.getOptions().showSelectedRange( true
); $J('#TicketsLast7Days').get(0).scrollIntoView();"> + </a>
</div>,
<div class="ticket_last_24 report_table_right">
<span>15</span>
<span>(</span><span
class="change_increase">+36%</span><span>)</span>
</div>,
<div class="ticket_last_week report_table_right">
<span>271</span>
<span>(</span><span
class="change_increase">+632%</span><span>)</span>
</div>,
<div class="ticket_waiting_not_elevated
report_table_right">0</div>,
<div class="ticket_waiting_elevated
report_table_right">37</div>,
[]]
how i can only get the "ticket_waiting_not_elevated report_table_right" and ticket_waiting_elevated
report_table_right number 0 and 37?
May be this could help,
text ="""[<div class="ticket_type">‐ Help With Steam Workshop + </div>,
<div class="ticket_last_24 report_table_right"><span>15</span><span>(</span><span class="change_increase">+36%</span><span>)</span> </div>,
<div class="ticket_last_week report_table_right"> <span>271</span><span>(</span><span class="change_increase">+632%</span><span>)</span></div>,
<div class="ticket_waiting_not_elevated report_table_right">0</div>,
<div class="ticket_waiting_elevated report_table_right">37</div>,
[]]"""
soup = BeautifulSoup(text, 'html.parser')
for i in soup.find_all('div', attrs={'class': ['ticket_waiting_not_elevated report_table_right', 'ticket_waiting_elevated report_table_right']}):
print(i.get('class')[0], ':', i.text)
# Output is: ticket_waiting_not_elevated : 0
# ticket_waiting_elevated : 37
You can use select() for getting your data:
data = """[<div class="ticket_type">‐ Help With Steam Workshop<a
href="javascript:
jsTicketsLast7Days.getOptions().appendValueToParam( 'requestid',
'29' ); jsTicketsLast7Days.getOptions().showSelectedRange( true
); $J('#TicketsLast7Days').get(0).scrollIntoView();"> + </a>
</div>,
<div class="ticket_last_24 report_table_right">
<span>15</span>
<span>(</span><span
class="change_increase">+36%</span><span>)</span>
</div>,
<div class="ticket_last_week report_table_right">
<span>271</span>
<span>(</span><span
class="change_increase">+632%</span><span>)</span>
</div>,
<div class="ticket_waiting_not_elevated
report_table_right">0</div>,
<div class="ticket_waiting_elevated
report_table_right">37</div>,
[]]"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
nums = soup.select('''div.ticket_waiting_not_elevated.report_table_right,
div.ticket_waiting_elevated.report_table_right''')
print([num.text for num in nums])
Prints:
['0', '37']
The soup.select('div.ticket_waiting_not_elevated.report_table_right, div.ticket_waiting_elevated.report_table_right') selects all divs with ticket_waiting_not_elevated report_table_right class or ticket_waiting_elevated report_table_right class.

Loop through same div' class and grab text using python webdriver

I done all with selenium and webdriver and now not sure how to get text from ALL div class Text3. But also I have problem with div id="TableStart_00023" That changes now and then numbers "TableStart_00023, TableStart_0283 etc.."
Here is HTML PART OF CODE
<div data-reactroot="" id="TableStart_00023">
<ul>
<li class="FirstRow03">
<a class="aClass">
<div class="innerCl">
<div class="Text1"></div>
<div class="Text2"></div>
<div class="Text3">Wanted data</div>
<div class="Text4"></div>
</div>
</a>
</li>
<li class="FirstRow02">
<a class="aClass">
<div class="innerCl">
<div class="Text1"></div>
<div class="Text2"></div>
<div class="Text3">Wanted data 2</div>
<div class="Text4"></div>
</div>
</a>
</li>
</ul>
</div>
Here is Python PART OF CODE what I done
for content in driver.find_elements_by_id('TableStart_00023'):
mytext= content.find_element_by_xpath('.//div[#class="Text3"]').text
print(mytext)
How can I create loop thought all div class Text3 and get text, when ID TableStart changes numbers? What am I doing wrong?
This xpath will return all elements div class Text3 from your table:
//div[starts-with(#id,'TableStart')]//div[#class='Text3']
When you have all this elements (using driver.find_elements_by_xpath) you can get texts from they.

Resources