Web Scraper for Data-Url in Div Python (BeautifulSoup) - python-3.x

I don't know why the program doesn't extract the links from inside the div
I don't know if the error was in defining the div class or the code at the stage of extracting the data-url from the div
here is the current code :
import requests
from bs4 import BeautifulSoup
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
}
url = requests.get("https://www.chosic.com/free-music/all/" , headers=header)
soup = BeautifulSoup(url.content, 'lxml')
list = []
music = soup.find_all('div',{'class':'track-audio'})
for i in music:
i.findAll(['data-url'])
print(i)
output :
<div class="track-audio"><div class="waveform before" data-saved="yes" data-track="27306" data-url="https://www.chosic.com/wp-content/uploads/2021/02/happy-clappy-ukulele.mp3" id="waveform27306"></div></div>
<div class="track-audio"><div class="waveform before" data-saved="yes" data-track="25944" data-url="https://www.chosic.com/wp-content/uploads/2020/07/Art-Of-Silence_V2.mp3" id="waveform25944"></div></div>
<div class="track-audio"><div class="waveform before" data-saved="yes" data-track="26757" data-url="https://www.chosic.com/wp-content/uploads/2020/11/batchbug-sweet-dreams.mp3" id="waveform26757"></div></div>
<div class="track-audio"><div class="waveform before" data-saved="yes" data-track="27880" data-url="https://www.chosic.com/wp-content/uploads/2021/04/Luke-Bergs-Bliss.mp3" id="waveform27880"></div></div>
<div class="track-audio"><div class="waveform before" data-saved="yes" data-track="27281" data-url="https://www.chosic.com/wp-content/uploads/2021/02/Warm-Memories-Emotional-Inspiring-Piano.mp3" id="waveform27281"></div></div>
<div class="track-audio"><div class="waveform before" data-saved="yes" data-track="26021" data-url="https://www.chosic.com/wp-content/uploads/2020/08/fm-freemusic-give-me-a-smile.mp3" id="waveform26021"></div></div>
<div class="track-audio"><div class="waveform before" data-saved="yes" data-track="27247" data-url="https://www.chosic.com/wp-content/uploads/2021/02/Monkeys-Spinning-Monkeys.mp3" id="waveform27247"></div></div>
<div class="track-audio"><div class="waveform before" data-saved="yes" data-track="27248" data-url="https://www.chosic.com/wp-content/uploads/2021/02/Fluffing-a-Duck.mp3" id="waveform27248"></div></div>
<div class="track-audio"><div class="waveform before" data-saved="yes" data-track="27120" data-url="https://www.chosic.com/wp-content/uploads/2021/01/fm-freemusic-inspiring-optimistic-upbeat-energetic-guitar-rhythm.mp3" id="waveform27120"></div></div>
<div class="track-audio"><div class="waveform before" data-saved="yes" data-track="25860" data-url="https://www.chosic.com/wp-content/uploads/2020/07/alexander-nakarada-superepic.mp3" id="waveform25860"></div></div>
<div class="track-audio"><div class="waveform before" data-saved="yes" data-track="28703" data-url="https://www.chosic.com/wp-content/uploads/2021/08/An-Epic-Story.mp3" id="waveform28703"></div></div>
<div class="track-audio"><div class="waveform before" data-saved="yes" data-track="28923" data-url="https://www.chosic.com/wp-content/uploads/2021/08/scott-buckley-jul.mp3" id="waveform28923"></div></div>
<div class="track-audio"><div class="waveform before" data-saved="yes" data-track="24515" data-url="https://www.chosic.com/wp-content/uploads/2020/06/John_Bartmann_-_02_-_Happy_African_Village.mp3" id="waveform24515"></div></div>
<div class="track-audio"><div class="waveform before" data-saved="yes" data-track="27012" data-url="https://www.chosic.com/wp-content/uploads/2021/01/春
のテーマ-Spring-field-.mp3" id="waveform27012"></div></div>
<div class="track-audio"><div class="waveform before" data-saved="yes" data-track="25897" data-url="https://www.chosic.com/wp-content/uploads/2020/07/Brandenburg-Concerto-no.-3-BWV-1048-Complete-Performance.mp3" id="waveform25897"></div></div>
but i want extract their data-url from div
example :
https://www.chosic.com/wp-content/uploads/2021/04/Luke-Bergs-Bliss.mp3
https://www.chosic.com/wp-content/uploads/2021/02/Warm-Memories-Emotional-Inspiring-Piano.mp3
https://www.chosic.com/wp-content/uploads/2020/08/fm-freemusic-give-me-a-smile.mp3
https://www.chosic.com/wp-content/uploads/2021/02/Monkeys-Spinning-Monkeys.mp3
https://www.chosic.com/wp-content/uploads/2021/02/Fluffing-a-Duck.mp3
https://www.chosic.com/wp-content/uploads/2021/01/fm-freemusic-inspiring-optimistic-upbeat-energetic-guitar-rhythm.mp3
https://www.chosic.com/wp-content/uploads/2020/07/alexander-nakarada-superepic.mp3
https://www.chosic.com/wp-content/uploads/2021/08/An-Epic-Story.mp3
https://www.chosic.com/wp-content/uploads/2021/08/scott-buckley-jul.mp3
https://www.chosic.com/wp-content/uploads/2020/06/John_Bartmann_-_02_-_Happy_African_Village.mp3
https://www.chosic.com/wp-content/uploads/2021/01/春
のテーマ-Spring-field-.mp3
https://www.chosic.com/wp-content/uploads/2020/07/Brandenburg-Concerto-no.-3-BWV-1048-Complete-Performance.mp3
any solution possible ??

.findAll doesn't accept CSS-selectors. Also, you aren't assigning the output from .findAll to anything. Try:
import requests
from bs4 import BeautifulSoup
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
}
url = requests.get("https://www.chosic.com/free-music/all/", headers=header)
soup = BeautifulSoup(url.content, "lxml")
music = soup.find_all("div", {"class": "track-audio"})
for i in music:
m = i.select_one("[data-url]")
print(m["data-url"])
Prints:
https://www.chosic.com/wp-content/uploads/2021/02/happy-clappy-ukulele.mp3
https://www.chosic.com/wp-content/uploads/2020/07/Art-Of-Silence_V2.mp3
https://www.chosic.com/wp-content/uploads/2020/11/batchbug-sweet-dreams.mp3
https://www.chosic.com/wp-content/uploads/2021/04/Luke-Bergs-Bliss.mp3
https://www.chosic.com/wp-content/uploads/2021/02/Warm-Memories-Emotional-Inspiring-Piano.mp3
https://www.chosic.com/wp-content/uploads/2020/08/fm-freemusic-give-me-a-smile.mp3
https://www.chosic.com/wp-content/uploads/2021/02/Monkeys-Spinning-Monkeys.mp3
https://www.chosic.com/wp-content/uploads/2021/02/Fluffing-a-Duck.mp3
https://www.chosic.com/wp-content/uploads/2021/01/fm-freemusic-inspiring-optimistic-upbeat-energetic-guitar-rhythm.mp3
https://www.chosic.com/wp-content/uploads/2020/07/alexander-nakarada-superepic.mp3
https://www.chosic.com/wp-content/uploads/2021/08/An-Epic-Story.mp3
https://www.chosic.com/wp-content/uploads/2021/08/scott-buckley-jul.mp3
https://www.chosic.com/wp-content/uploads/2020/06/John_Bartmann_-_02_-_Happy_African_Village.mp3
https://www.chosic.com/wp-content/uploads/2021/01/春のテーマ-Spring-field-.mp3
https://www.chosic.com/wp-content/uploads/2020/07/Brandenburg-Concerto-no.-3-BWV-1048-Complete-Performance.mp3

Related

Python: How to select form from URL response without 'name' tag

I try to select a form from a URL response with Python. This is the form and there is only on form available in the response.
<form id="login_form" data-redirect='{"type":"refresh","link":"\/"}'>
<div class="unlogged-input-container">
<input class="unlogged-input" type="email" id="login_mail" data-check="email" required value="" />
<label class="unlogged-label" for="login_mail">E-Mail-Adress</label>
</div>
<div class="unlogged-input-container">
<input class="unlogged-input unlogged-input-pwd" type="password" id="login_password" required />
<label class="unlogged-label" for="login_password">Password</label>
</div>
<div class="recaptcha-wrapper">
<div class="recaptcha-container">
<div id="recaptcha_enterprise_container"></div>
</div>
</div>
<button id="login_form_submit" class="auth-cta gap-m-top" type="submit">
<span class="unlogged-btn-label">Logon</span>
</button>
<input type="hidden" id="login_method" name="login_method" value="email">
</form>
With
print(br.select_form(nr=0))
it returns
None
But
print([f.attrs['id'] for f in br.forms()])
returns
['login_form']
and
for x in br.forms(): print(x)
returns
<GET https://www.<target_domain>/en/login application/x-www-form-urlencoded
<TextControl(<None>=)>
<PasswordControl(<None>=)>
<SubmitButtonControl(<None>=) (readonly)>
<HiddenControl(login_method=email) (readonly)>>
This is the sample code for it:
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_gzip(False)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_equiv(True)
br.addheaders = [
('User-Agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'),
('Accept-Encoding', 'deflate,br'),
('Content-Type', 'application/x-www-form-urlencoded'),
('Cache-Control', 'no-cache'),
('Connection', 'keep-alive')
]
url = 'https://<target_domain>/en/login'
br.open(url)
print([f.attrs['id'] for f in br.forms()])
print(br.select_form(nr=0))
for x in br.forms():
print(x)
But how to select this form?

How to move sub-tags to right after a mother tag?

I have an html in which there are many elements <div class="ex_example"> .. </div> inside <div class="c-s">, which is in turn inside <div class="c-w", i.e.
<div class="c-w"
<div class="c-s">
<div class="ex_example"> .. </div>
<div class="ex_example"> .. </div>
<div class="ex_example"> .. </div>
</div></div>
Could you please elaborate in how to move all the <div class="ex_example"> .. </div> to right before <div class="c-w". I mean
<div class="ex_example"> .. </div>
<div class="ex_example"> .. </div>
<div class="ex_example"> .. </div>
<div class="c-w"
<div class="c-s">
</div></div>
My code is
import requests
session = requests.Session()
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
r = session.get('https://dictionnaire.lerobert.com/definition/aimer', headers = headers)
soup = BeautifulSoup(r.content, 'html.parser')
Thank you so much for your help!
Update: I have the situation in which there are more than one <div class="c-w" and it's possible that some of them do not contain <div class="c-s">.
from bs4 import BeautifulSoup
txt = '''
<div class="c-w">
<div class="c-s">
<div class="ex_example"> aa </div>
<div class="ex_example"> aa </div>
</div>
</div>
<div class="audio">link</div>
<div class="c-w">
<div class="c-s">
<div class="ex_example"> xx </div>
<div class="ex_example"> yy </div>
</div>
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
I hope I understood your question well: You can use .insert_before() to insert tags/strings before some tag/string:
from bs4 import BeautifulSoup
txt = '''
<div class="c-w">
<div class="c-s">
<div class="ex_example"> 1.. </div>
<div class="ex_example"> 2.. </div>
<div class="ex_example"> 3.. </div>
</div></div>
'''
soup = BeautifulSoup(txt, 'html.parser')
for c in list(soup.select_one('div.c-s').contents):
soup.select_one('div.c-w').insert_before(c)
print(soup)
Prints:
<div class="ex_example"> 1.. </div>
<div class="ex_example"> 2.. </div>
<div class="ex_example"> 3.. </div>
<div class="c-w">
<div class="c-s"></div></div>

Not able to extract urls from HTML BeautifulSoup object

I am looking to extract following url "https://mania.bg/p/pulover-alexander-mcqueen-p409648" from html (BeautifulSoup object) named urls that looks like:
[<a class="product sellout product-sellout float-left status-1" data-id="409648" data-producturl="https://mania.bg/p/pulover-alexander-mcqueen-p409648" data-status="1" href="https://mania.bg/p/pulover-alexander-mcqueen-p409648"> <div class="product-hover clearfix prevent-flicker"><div class="module-icons favourite tooltip" data-id="409648" data-title=" Любима находка на 24 клиент/и. "> <img alt="" class="favourite-product like-product unactivated" data-id="409648" src="dist/assets/icon_favourite_off.png"/></div> <div class="campaign" style="color: #FFF;background-color: #000000;"> NIGHT </div> <div class="profit-icons-wrapper clearfix"> </div> <div class="product-basic-info"> <div class="image-wrapper"> <img alt="Пуловер Alexander McQueen" class="front" data-url="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-2.jpg" src="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-1.jpg" title="Пуловер Alexander McQueen - Mania"> </img></div> <div class="clearfix brand-line"> <div class="brand float-left text-uppercase">Alexander McQueen</div> <div class="size float-right">S</div> </div> </div> <div class="prices-section"> <div class="prices-inner-section"> <div class="price-wrapper clearfix"> <div class="price-title text-uppercase float-left"> Начална цена </div> <div class="price old"> <span>98.00</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left"> -40% </div> <div class="price old"> <span>58.80</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left" style="color: #FFF;background-color: #000000"> -40% </div> <div class="price"> <span>35.28</span> <span class="currency">лв.</span> </div> </div> </div> </div> </div> <div class="button button-auction buy-now text-center float-left tooltip prevent-popup-close" data-id="409648" data-title="Може да добавите този продукт към количката.">ДОБАВЯМ<img alt="" class="bag-icon" src="dist/assets/icon_bag_button.svg"> </img></div> </a>]
With following code:
for num in range(len(urls)):
url = urls[num - 1].a['href']
I also tried to use:
url = urls[num - 1].a['data-producturl']
I get "TypeError: 'NoneType' object is not subscriptable" as url is None.
import requests
import bs4
url = 'https://mania.bg/p/pulover-alexander-mcqueen-p409648'
data = requests.get(url)
soup = bs4.BeautifulSoup(data.text,'html.parser')
urls = soup.find_all('a', attrs={'class': 'product sellout product-sellout float-left status-1'})
for num in range(len(urls)):
url = urls[num]['href']
print(url)
Try this. Here's an example:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
[<a class="product sellout product-sellout float-left status-1" data-id="409648" data-producturl="https://mania.bg/p/pulover-alexander-mcqueen-p409648" data-status="1" href="https://mania.bg/p/pulover-alexander-mcqueen-p409648"> <div class="product-hover clearfix prevent-flicker"><div class="module-icons favourite tooltip" data-id="409648" data-title=" Любима находка на 24 клиент/и. "> <img alt="" class="favourite-product like-product unactivated" data-id="409648" src="dist/assets/icon_favourite_off.png"/></div> <div class="campaign" style="color: #FFF;background-color: #000000;"> NIGHT </div> <div class="profit-icons-wrapper clearfix"> </div> <div class="product-basic-info"> <div class="image-wrapper"> <img alt="Пуловер Alexander McQueen" class="front" data-url="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-2.jpg" src="https://media.mania.bg/product/048/409648/300/pulover-alexander-mcqueen-1.jpg" title="Пуловер Alexander McQueen - Mania"> </img></div> <div class="clearfix brand-line"> <div class="brand float-left text-uppercase">Alexander McQueen</div> <div class="size float-right">S</div> </div> </div> <div class="prices-section"> <div class="prices-inner-section"> <div class="price-wrapper clearfix"> <div class="price-title text-uppercase float-left"> Начална цена </div> <div class="price old"> <span>98.00</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left"> -40% </div> <div class="price old"> <span>58.80</span> <span class="currency">лв.</span> </div> </div> <div class="price-wrapper clearfix"> <div class="discount price-title text-uppercase float-left" style="color: #FFF;background-color: #000000"> -40% </div> <div class="price"> <span>35.28</span> <span class="currency">лв.</span> </div> </div> </div> </div> </div> <div class="button button-auction buy-now text-center float-left tooltip prevent-popup-close" data-id="409648" data-title="Може да добавите този продукт към количката.">ДОБАВЯМ<img alt="" class="bag-icon" src="dist/assets/icon_bag_button.svg"> </img></div> </a>]
'''
doc = SimplifiedDoc(html)
urls = doc.selects('a.product sellout product-sellout float-left status-1')
print ([(url.href,url['data-producturl']) for url in urls])
Result:
[('https://mania.bg/p/pulover-alexander-mcqueen-p409648', 'https://mania.bg/p/pulover-alexander-mcqueen-p409648')]
find_all already gives you the list of a elements; you just need to get the href from each.
from bs4 import BeautifulSoup
import requests
url = 'https://mania.bg/p/pulover-alexander-mcqueen-p409648'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
for a in soup.find_all(
'a',
attrs={'class':
'product sellout product-sellout float-left status-1'}):
print(a['data-producturl'])

how to get a number from a array

[<div class="ticket_type">‐ Help With Steam Workshop<a
href="javascript:
jsTicketsLast7Days.getOptions().appendValueToParam( 'requestid',
'29' ); jsTicketsLast7Days.getOptions().showSelectedRange( true
); $J('#TicketsLast7Days').get(0).scrollIntoView();"> + </a>
</div>,
<div class="ticket_last_24 report_table_right">
<span>15</span>
<span>(</span><span
class="change_increase">+36%</span><span>)</span>
</div>,
<div class="ticket_last_week report_table_right">
<span>271</span>
<span>(</span><span
class="change_increase">+632%</span><span>)</span>
</div>,
<div class="ticket_waiting_not_elevated
report_table_right">0</div>,
<div class="ticket_waiting_elevated
report_table_right">37</div>,
[]]
how i can only get the "ticket_waiting_not_elevated report_table_right" and ticket_waiting_elevated
report_table_right number 0 and 37?
May be this could help,
text ="""[<div class="ticket_type">‐ Help With Steam Workshop + </div>,
<div class="ticket_last_24 report_table_right"><span>15</span><span>(</span><span class="change_increase">+36%</span><span>)</span> </div>,
<div class="ticket_last_week report_table_right"> <span>271</span><span>(</span><span class="change_increase">+632%</span><span>)</span></div>,
<div class="ticket_waiting_not_elevated report_table_right">0</div>,
<div class="ticket_waiting_elevated report_table_right">37</div>,
[]]"""
soup = BeautifulSoup(text, 'html.parser')
for i in soup.find_all('div', attrs={'class': ['ticket_waiting_not_elevated report_table_right', 'ticket_waiting_elevated report_table_right']}):
print(i.get('class')[0], ':', i.text)
# Output is: ticket_waiting_not_elevated : 0
# ticket_waiting_elevated : 37
You can use select() for getting your data:
data = """[<div class="ticket_type">‐ Help With Steam Workshop<a
href="javascript:
jsTicketsLast7Days.getOptions().appendValueToParam( 'requestid',
'29' ); jsTicketsLast7Days.getOptions().showSelectedRange( true
); $J('#TicketsLast7Days').get(0).scrollIntoView();"> + </a>
</div>,
<div class="ticket_last_24 report_table_right">
<span>15</span>
<span>(</span><span
class="change_increase">+36%</span><span>)</span>
</div>,
<div class="ticket_last_week report_table_right">
<span>271</span>
<span>(</span><span
class="change_increase">+632%</span><span>)</span>
</div>,
<div class="ticket_waiting_not_elevated
report_table_right">0</div>,
<div class="ticket_waiting_elevated
report_table_right">37</div>,
[]]"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
nums = soup.select('''div.ticket_waiting_not_elevated.report_table_right,
div.ticket_waiting_elevated.report_table_right''')
print([num.text for num in nums])
Prints:
['0', '37']
The soup.select('div.ticket_waiting_not_elevated.report_table_right, div.ticket_waiting_elevated.report_table_right') selects all divs with ticket_waiting_not_elevated report_table_right class or ticket_waiting_elevated report_table_right class.

Logging in to this website using requests module in python3

Here's what I have done so far. I had to extract the CSFRToken manually (I don't know regex, so that part is messy). Is CSFR part of cookies? Because my cookies only shows two other ID type params, so I dropped the cookie part and did it this way.
import requests
URL = r'http://login.cheezburger.com/'
client = requests.session()
login_page = client.get(URL)
index = login_page.text.find("CSRFToken")
token = login_page.text[index:index+90].split('"')[-2] # This works, I guarantee :)
#print(token) I checked it manually
login_data = {'rlm': 'Shopper',
'for': r'http://login.cheezburger.com/',
'username': 'myusername',
'password': 'mypassword',
'CSRFToken': token}
req = client.post(URL, data=login_data)
Now, there is no error per say, but I am not logging in to this site either. The text of this request shows that I am still stuck in the login page!
The parameters send are (as shown in the dev tools of firefox):
rlm: 'Shopper'
for: 'http://login.cheezburger.com/'
username: 'myusername',
password: 'mypassword',
CSRFToken: '8uhhbf67-1233-fff3-123g1-123123fsdfs22'
The websites source is as follows (the part that contains the form data):
<div class="contents-msl">
<h2>Client Login</h2>
<p>Enter username and password</p>
<div class="form-all-msl">
<form action="/login.action" id="loginForm" method="post"
enctype="application/x-www-form-urlencoded"><input type=hidden name=rlm
value="Shopper"><input
type=hidden
name=for
value="http%3a%2f%2flogin%cheezburger%2ecom%2f">
<ul class="form-section-msl">
<label class="form-label-left-msl" for="loginUserName">
Username<span class="form-required">*</span>
</label>
<div class="form-input-msl">
<input type="text" class="form-textbox-msl" id="loginUserName" name="username"
size="20">
</div>
<label class="form-label-left-msl" for="loginPwd">
Password<span class="form-required-msl">*</span>
</label>
<div class="form-input-msl">
<input type="password" class="form-textbox-msl" id="loginPwd" name="password"
size="20">
</div>
<div class="form-input-msl">
<div class="form-single-column-msl">
</div>
</div>
</ul>
</div>
<button class="members-btn-msl" type="submit">Login</button>
<input type="hidden" name="CSRFToken" class="CSRFToken" value="8uhhbf67-1233-fff3-123g1-123123fsdfs22" /> </form>
</div>
</div>
You maybe want to add some headers
headers = {
'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':
'gzip, deflate, sdch',
'Accept-Language':
'en-US,en;q=0.8,vi;q=0.6',
'Cache-Control':
'max-age=0',
'Connection':
'keep-alive',
'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36',
}
Then add headers to your code:
req = client.post(URL, data=login_data)
Good luck !

Resources