[<div class="ticket_type">‐ Help With Steam Workshop<a
href="javascript:
jsTicketsLast7Days.getOptions().appendValueToParam( 'requestid',
'29' ); jsTicketsLast7Days.getOptions().showSelectedRange( true
); $J('#TicketsLast7Days').get(0).scrollIntoView();"> + </a>
</div>,
<div class="ticket_last_24 report_table_right">
<span>15</span>
<span>(</span><span
class="change_increase">+36%</span><span>)</span>
</div>,
<div class="ticket_last_week report_table_right">
<span>271</span>
<span>(</span><span
class="change_increase">+632%</span><span>)</span>
</div>,
<div class="ticket_waiting_not_elevated
report_table_right">0</div>,
<div class="ticket_waiting_elevated
report_table_right">37</div>,
[]]
how i can only get the "ticket_waiting_not_elevated report_table_right" and ticket_waiting_elevated
report_table_right number 0 and 37?
May be this could help,
text ="""[<div class="ticket_type">‐ Help With Steam Workshop + </div>,
<div class="ticket_last_24 report_table_right"><span>15</span><span>(</span><span class="change_increase">+36%</span><span>)</span> </div>,
<div class="ticket_last_week report_table_right"> <span>271</span><span>(</span><span class="change_increase">+632%</span><span>)</span></div>,
<div class="ticket_waiting_not_elevated report_table_right">0</div>,
<div class="ticket_waiting_elevated report_table_right">37</div>,
[]]"""
soup = BeautifulSoup(text, 'html.parser')
for i in soup.find_all('div', attrs={'class': ['ticket_waiting_not_elevated report_table_right', 'ticket_waiting_elevated report_table_right']}):
print(i.get('class')[0], ':', i.text)
# Output is: ticket_waiting_not_elevated : 0
# ticket_waiting_elevated : 37
You can use select() for getting your data:
data = """[<div class="ticket_type">‐ Help With Steam Workshop<a
href="javascript:
jsTicketsLast7Days.getOptions().appendValueToParam( 'requestid',
'29' ); jsTicketsLast7Days.getOptions().showSelectedRange( true
); $J('#TicketsLast7Days').get(0).scrollIntoView();"> + </a>
</div>,
<div class="ticket_last_24 report_table_right">
<span>15</span>
<span>(</span><span
class="change_increase">+36%</span><span>)</span>
</div>,
<div class="ticket_last_week report_table_right">
<span>271</span>
<span>(</span><span
class="change_increase">+632%</span><span>)</span>
</div>,
<div class="ticket_waiting_not_elevated
report_table_right">0</div>,
<div class="ticket_waiting_elevated
report_table_right">37</div>,
[]]"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
nums = soup.select('''div.ticket_waiting_not_elevated.report_table_right,
div.ticket_waiting_elevated.report_table_right''')
print([num.text for num in nums])
Prints:
['0', '37']
The soup.select('div.ticket_waiting_not_elevated.report_table_right, div.ticket_waiting_elevated.report_table_right') selects all divs with ticket_waiting_not_elevated report_table_right class or ticket_waiting_elevated report_table_right class.
Related
I am new on python.
I try to scrape the data from the websites.
but I failed to extract that data which I needed.
here I share my python code
import requests
from bs4 import BeautifulSoup
url = 'https://v2.sherpa.ac.uk/view/publisher_list/1.html'
r = requests.get(url)
htmlContent = r.content
soup = BeautifulSoup(htmlContent, 'html.parser')
title = soup.title
print(soup.find_all("div", {"class" :["ep_view_page ep_view_page_view_publisher_list", "row"]}))
what I face is I need the data that is in div class = row but here there are two div class with row name.
and one more thing that what should I write to get the data from the multiple URL and pages if you see that there is the tag having class col span-6 and col span-3; on href tag when I link on that it opens one new page.
<div class="row">
<div class="col span-6">
'Grigore Antipa' National Museum of Natural History
</div>
<div class="**col span-3**">
<strong>Romania</strong>
<span class="label">Country</span>
</div>
<div class="**col span-3**">
<strong>1 [view ]</strong>
<span class="label">Publication Count</span>
</div>
</div>
and here I share the sitemap
<div class="row">
<h1 class="h1_like_h2">Publishers</h1>
<div class="ep_view_page ep_view_page_view_publisher_list">
</p><div class="row">
<div class="**col span-6">
'Grigore Antipa' National Museum of Natural History
</div>
<div class="col span-3">
<strong>Romania</strong>
<span class="label">Country</span>
</div>
<div class="col span-3">
<strong>1 [view ]</strong>
<span class="label">Publication Count</span>
</div>
</div>
<p></p><a name="group_=28"></a><h2>(</h2><p>
</p><div class="row">
<div class="col span-6">
(ISC)²
</div>
<div class="col span-3">
<strong>United States of America</strong>
<span class="label">Country</span>
</div>
<div class="col span-3">
<strong>1 [view ]</strong>
<span class="label">Publication Count</span>
</div>
</div>
<p></p><a name="group_1"></a><h2>1</h2><p>
</p><div class="row">
<div class="col span-6">
1066 Tidsskrift for historie
</div>
<div class="col span-3">
<strong>Denmark</strong>
<span class="label">Country</span>
</div>
<div class="col span-3">
<strong>1 [view ]</strong>
<span class="label">Publication Count</span>
</div>
</div>
<p></p><a name="group_A"></a><h2>A</h2><p>
**so on.......**
I'm not really sure if that's what you wanted, but I've had some fun getting this stuff.
Basically, the code below scrapes the entire page - 4487 entries - for:
Note: This is just a sample of data.
The name of the entity - ANZAMEMS (Australian and New Zealand Association for Medieval and Early Modern Studies)
The URL to its sub-page - http://v2.sherpa.ac.uk/id/publisher/1853?template=romeo
The country - Australia
The view count - 1
The so called "publisher url" - https://v2.sherpa.ac.uk//view/publication_by_publisher/1853.html
and spits all of this out to a .csv file that looks like this:
Here's the code:
import csv
import requests
from bs4 import BeautifulSoup
def make_soup():
p = requests.get("https://v2.sherpa.ac.uk/view/publisher_list/1.html").text
return BeautifulSoup(p, "html.parser")
main_soup = make_soup()
col_span_6_soup = main_soup.find_all("div", {"class": "col span-6"})
col_span_3_soup = main_soup.find_all("div", {"class": "col span-3"})
def get_names_and_urls():
data = [a.find("a") for a in col_span_6_soup if a is not None]
return [[i.text, i.get("href")] for i in data if "romeo" in i.get("href")]
def get_countries():
return [c.find("strong").text for c in col_span_3_soup[::2]]
def get_views_and_publisher():
return [
[
i.find("strong").text.replace(" [view ]", ""),
f"https://v2.sherpa.ac.uk/{i.find('a').get('href')}",
] for i in col_span_3_soup[1::2]
]
table = zip(get_names_and_urls(), get_countries(), get_views_and_publisher())
with open("loads_of_data.csv", "w") as output:
w = csv.writer(output)
w.writerow(["NAME", "URL", "COUNTRY", "VIEWS", "PUBLISHER_URL"])
for col1, col2, col3 in table:
w.writerow([*col1, col2, *col3])
print("You've got all the data!")
I use soup.select('.c-w a') to select elements. Inside c-w, there is c-s of which I would like not to include in this selection.
from bs4 import BeautifulSoup
txt = '''
<div class="c-w">
<div class="c-s">
<img class="soundpng" src="file://sound.png"/>
</div></div>
'''
soup = BeautifulSoup(txt, 'html.parser')
for a in soup.select('.c-w a'):
a['href'] = 'entry://'
and the result is
<div class="c-w">
<div class="c-s">
<img class="soundpng" src="file://sound.png"/>
</div></div>
My goal is to not include .c-s .a in this process of replacement. I mean when the search meet c-s, it will ignore this element and search in other ones. Could you please elaborate on how to achieve my goal?
Based on your comments, you can use .find_parent() to determine if the <a> tag is inside tag with class="c-s":
from bs4 import BeautifulSoup
txt = '''
<div class="c-w">
<div class="c-s">
<img class="soundpng" src="file://sound.png"/>
</div>
<div>
...
</div>
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
for a in soup.select('.c-w a'):
if a.find_parent(class_='c-s'):
continue
a['href'] = 'entry://'
print(soup.prettify())
Prints:
<div class="c-w">
<div class="c-s">
<a href="sound://english-french/sound/M000001099.mp3">
<img class="soundpng" src="file://sound.png"/>
</a>
</div>
<div>
<a href="entry://">
...
</a>
</div>
</div>
EDIT: To exclude both .c-s and .c-v, you can do this:
from bs4 import BeautifulSoup
txt = '''
<div class="c-w">
<div class="c-s">
<img class="soundpng" src="file://sound.png"/>
</div>
<div class="c-v">
<img class="soundpng" src="file://sound.png"/>
</div>
<div>
...
</div>
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
for a in soup.select('.c-w a'):
if a.find_parent(class_=['c-s', 'c-v']):
continue
a['href'] = 'entry://'
print(soup.prettify())
Prints:
<div class="c-w">
<div class="c-s">
<a href="sound://english-french/sound/M000001099.mp3">
<img class="soundpng" src="file://sound.png"/>
</a>
</div>
<div class="c-v">
<a href="sound://english-french/sound/M000001099.mp3">
<img class="soundpng" src="file://sound.png"/>
</a>
</div>
<div>
<a href="entry://">
...
</a>
</div>
</div>
I have an html in which there are many elements <div class="ex_example"> .. </div> inside <div class="c-s">, which is in turn inside <div class="c-w", i.e.
<div class="c-w"
<div class="c-s">
<div class="ex_example"> .. </div>
<div class="ex_example"> .. </div>
<div class="ex_example"> .. </div>
</div></div>
Could you please elaborate in how to move all the <div class="ex_example"> .. </div> to right before <div class="c-w". I mean
<div class="ex_example"> .. </div>
<div class="ex_example"> .. </div>
<div class="ex_example"> .. </div>
<div class="c-w"
<div class="c-s">
</div></div>
My code is
import requests
session = requests.Session()
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
r = session.get('https://dictionnaire.lerobert.com/definition/aimer', headers = headers)
soup = BeautifulSoup(r.content, 'html.parser')
Thank you so much for your help!
Update: I have the situation in which there are more than one <div class="c-w" and it's possible that some of them do not contain <div class="c-s">.
from bs4 import BeautifulSoup
txt = '''
<div class="c-w">
<div class="c-s">
<div class="ex_example"> aa </div>
<div class="ex_example"> aa </div>
</div>
</div>
<div class="audio">link</div>
<div class="c-w">
<div class="c-s">
<div class="ex_example"> xx </div>
<div class="ex_example"> yy </div>
</div>
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
I hope I understood your question well: You can use .insert_before() to insert tags/strings before some tag/string:
from bs4 import BeautifulSoup
txt = '''
<div class="c-w">
<div class="c-s">
<div class="ex_example"> 1.. </div>
<div class="ex_example"> 2.. </div>
<div class="ex_example"> 3.. </div>
</div></div>
'''
soup = BeautifulSoup(txt, 'html.parser')
for c in list(soup.select_one('div.c-s').contents):
soup.select_one('div.c-w').insert_before(c)
print(soup)
Prints:
<div class="ex_example"> 1.. </div>
<div class="ex_example"> 2.. </div>
<div class="ex_example"> 3.. </div>
<div class="c-w">
<div class="c-s"></div></div>
I am looking to extract following url "https://mania.bg/p/pulover-alexander-mcqueen-p409648" from html (BeautifulSoup object) named urls that looks like:
[<a class="product sellout product-sellout float-left status-1" data-id="409648" data-producturl="https://mania.bg/p/pulover-alexander-mcqueen-p409648" data-status="1" href="https://mania.bg/p/pulover-alexander-mcqueen-p409648"> <div class="product-hover clearfix prevent-flicker"><div class="module-icons favourite tooltip" data-id="409648" data-title=" Любима находка на 24 клиент/и. "> <img alt="" class="favourite-product like-product unactivated" data-id="409648" src="dist/assets/icon_favourite_off.png"/></div> <div class="campaign" style="color: #FFF;background-color: #000000;"> NIGHT </div> <div class="profit-icons-wrapper clearfix"> </div> <div class="product-basic-info"> <div class="image-wrapper"> <img alt="Пуловер Alexander McQueen" class="front" data-url="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-2.jpg" src="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-1.jpg" title="Пуловер Alexander McQueen - Mania"> </img></div> <div class="clearfix brand-line"> <div class="brand float-left text-uppercase">Alexander McQueen</div> <div class="size float-right">S</div> </div> </div> <div class="prices-section"> <div class="prices-inner-section"> <div class="price-wrapper clearfix"> <div class="price-title text-uppercase float-left"> Начална цена </div> <div class="price old"> <span>98.00</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left"> -40% </div> <div class="price old"> <span>58.80</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left" style="color: #FFF;background-color: #000000"> -40% </div> <div class="price"> <span>35.28</span> <span class="currency">лв.</span> </div> </div> </div> </div> </div> <div class="button button-auction buy-now text-center float-left tooltip prevent-popup-close" data-id="409648" data-title="Може да добавите този продукт към количката.">ДОБАВЯМ<img alt="" class="bag-icon" src="dist/assets/icon_bag_button.svg"> </img></div> </a>]
With following code:
for num in range(len(urls)):
url = urls[num - 1].a['href']
I also tried to use:
url = urls[num - 1].a['data-producturl']
I get "TypeError: 'NoneType' object is not subscriptable" as url is None.
import requests
import bs4
url = 'https://mania.bg/p/pulover-alexander-mcqueen-p409648'
data = requests.get(url)
soup = bs4.BeautifulSoup(data.text,'html.parser')
urls = soup.find_all('a', attrs={'class': 'product sellout product-sellout float-left status-1'})
for num in range(len(urls)):
url = urls[num]['href']
print(url)
Try this. Here's an example:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
[<a class="product sellout product-sellout float-left status-1" data-id="409648" data-producturl="https://mania.bg/p/pulover-alexander-mcqueen-p409648" data-status="1" href="https://mania.bg/p/pulover-alexander-mcqueen-p409648"> <div class="product-hover clearfix prevent-flicker"><div class="module-icons favourite tooltip" data-id="409648" data-title=" Любима находка на 24 клиент/и. "> <img alt="" class="favourite-product like-product unactivated" data-id="409648" src="dist/assets/icon_favourite_off.png"/></div> <div class="campaign" style="color: #FFF;background-color: #000000;"> NIGHT </div> <div class="profit-icons-wrapper clearfix"> </div> <div class="product-basic-info"> <div class="image-wrapper"> <img alt="Пуловер Alexander McQueen" class="front" data-url="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-2.jpg" src="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-1.jpg" title="Пуловер Alexander McQueen - Mania"> </img></div> <div class="clearfix brand-line"> <div class="brand float-left text-uppercase">Alexander McQueen</div> <div class="size float-right">S</div> </div> </div> <div class="prices-section"> <div class="prices-inner-section"> <div class="price-wrapper clearfix"> <div class="price-title text-uppercase float-left"> Начална цена </div> <div class="price old"> <span>98.00</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left"> -40% </div> <div class="price old"> <span>58.80</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left" style="color: #FFF;background-color: #000000"> -40% </div> <div class="price"> <span>35.28</span> <span class="currency">лв.</span> </div> </div> </div> </div> </div> <div class="button button-auction buy-now text-center float-left tooltip prevent-popup-close" data-id="409648" data-title="Може да добавите този продукт към количката.">ДОБАВЯМ<img alt="" class="bag-icon" src="dist/assets/icon_bag_button.svg"> </img></div> </a>]
'''
doc = SimplifiedDoc(html)
urls = doc.selects('a.product sellout product-sellout float-left status-1')
print ([(url.href,url['data-producturl']) for url in urls])
Result:
[('https://mania.bg/p/pulover-alexander-mcqueen-p409648', 'https://mania.bg/p/pulover-alexander-mcqueen-p409648')]
find_all already gives you the list of a elements; you just need to get the href from each.
from bs4 import BeautifulSoup
import requests
url = 'https://mania.bg/p/pulover-alexander-mcqueen-p409648'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
for a in soup.find_all(
'a',
attrs={'class':
'product sellout product-sellout float-left status-1'}):
print(a['data-producturl'])
<div class="ticket_last_24 report_table_right">
<span>13,978</span>
<span>(</span><span
class="change_increase">+2.3%
</span><span>)</span>
</div>
<div class="ticket_last_week report_table_right">
<span>99,585</span>
<span>(</span><span
class="change_increase">+0.6%
</span><span>)</span>
</div>
<div class="ticket_last_24 report_table_right">
<span>12121</span>
<span>(</span><span
class="change_increase">+2.3%
</span><span>)</span>
</div>
<div class="ticket_last_week report_table_right">
<span>99,222</span>
<span>(</span><span
class="change_increase">+0.6%
</span><span>)</span>
</div>
I tried the code below:
text=[]
from bs4 import BeautifulSoup
TicketNuber=soup.find_all("div")
for div in TicketNuber:
text.append(div.find("span"))
it prints out:[
'13,978',
'13,978',
'99,585',
'12,121'
'12,121'
'99,222'
]
Not sure why the first number will print out twice. I only want the number ['13,978','99492','12,121','99,222']. there is no duplicate number in the same tag
When I do this:
text = []
TicketNumber = soup.find_all("div")
for div in TicketNumber:
text.append(div.find("span").get_text())
print(text)
I get this:
['13,978', '99,585', '12,121', '99,222']
Could you please give this a shot and confirm if this works?