How would I scrape these nested img tags? - python-3.x

I was scraping this site from title and also trying to scrape images followed by title. turns out when scraped the following data was returned:
<div itemscope itemtype="https://schema.org/ItemList" class="group card-8-group-1 clearfix">
<meta itemprop="itemListOrder" content="https://schema.org/ItemListOrderDescending" />
<article itemprop="itemListElement" itemscope itemtype="https://schema.org/Article" class="card card-1 news-card-1 card-type-article type-article" data-sponsorship-type="card" data-sponsorship-article-id="1qo8sz0z1kaqb1dpj038v8658h" data-sponsorship-article-type="article" data-sponsorship-primary-tag="1pgecmpab62ei1akyb084izq3o" data-sponsorship-secondary-tag="22doj4sgsocqpxw45h607udje">
<a data-side="link" href="/en/news/spurs-investigation-aurier-appears-break-lockdown-protocols/1qo8sz0z1kaqb1dpj038v8658h" itemprop="url" data-sponsorship-slot="card" data-sponsorship-slot-id="front" class="type-article">
<div class="picture article-image" data-module="responsive-picture">
<img class="picture__image picture__image--lazyload" data-srcset="&quality=60&w=640 320w,&quality=60&w=560 480w,&quality=60&w=690 740w,&quality=60&w=800 980w,&quality=60&w=970 1580w" />
<noscript class="picture__polyfill"> <img src="https://images.daznservices.com/di/library/GOAL/5f/da/serge-aurier_191f5i34z69us1fausrs9k0mjk.jpg?t=1445827096&quality=60&h=170" alt="Serge Aurier" /> </noscript>
</div>
<div class="title">
<h3 title="Spurs launch investigation as Aurier appears to break lockdown protocols for a third time" itemprop="headline">Aurier appears to break lockdown protocols for a third time</h3>
<div class="image" data-sponsorship-slot="card" data-sponsorship-slot-id="image"></div>
</div>
it appears the page is using lazy loading. my Question is how can I extract the img with its full zoom?

To get full scale image, just replace w=55 to w=970 or bigger in image URL manually.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.goal.com/en/premier-league/2kwbbcootiqqgmrzs6o5inle5'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for title, image in zip(soup.select('.card-type-article h3'),
soup.select('.card-type-article img')):
title = title.get_text(strip=True)
full_img_url = image['src'].replace('w=55', 'w=970')
print('{:<70}{}'.format(title, full_img_url))
Prints:
Wenger calls for FFP reform amid Newcastle takeover talk https://images.daznservices.com/di/library/GOAL/63/cd/arsene-wenger-2019_13luew9ltpa2g1l1r6ziuxpwbw.jpg?t=1363081390&quality=60&w=970
'Special Havertz is half-Ozil, half-Ballack & would thrive in PL' https://images.daznservices.com/di/library/GOAL/cc/18/kai-havertz_7sugon9o7ljy1fg2xzkv1mqcm.jpg?t=-1186202400&quality=60&w=970
Solskjaer: I'd rather a hole in my squad than an asshole https://images.daznservices.com/di/library/GOAL/78/f2/ole-gunnar-solskjaer-manchester-united-2019-20_1vfk6liknrjlx1r8aumegh4cxe.jpg?t=-749345265&quality=60&w=970
Maguire praises Man Utd's 'safe' training return https://images.daznservices.com/di/library/GOAL/5d/e8/harry-maguire-man-utd_13ewrih27ahmb13i1zxfjrhrp8.jpg?t=-444094625&quality=60&w=970
Jorginho's agent opens door for Juve move https://images.daznservices.com/di/library/GOAL/69/da/jorginho-chelsea-2019-20_15zh5m3ojefx0zl1ei7qsyc14.jpg?t=-1675997073&quality=60&w=970
Premier League clubs near approval for contact training https://images.daznservices.com/di/library/GOAL/79/ce/mohamed-salah-dejan-lovren-liverpool-training_7zq70upa8l1618svdzls077xn.jpg?t=143669454&quality=60&w=970
Ceballos reiterates desire to succeed at Real Madrid https://images.daznservices.com/di/library/GOAL/97/c6/dani-ceballos-arsenal_1sywf8w828w4b193xoz5c82uuf.jpg?t=-1552361252&quality=60&w=970

Related

How do I scrape the OHLC values from this website

Website in question. Right now I am only performing analysis on the last quarter, if I was to expand to the past 4-5 quarters would there be a better way of automating this task rather than doing it manually by setting the time range again and again and then extracting the table values?
What I tried doing:
import bs4 as bs
import requests
import lxml
resp = requests.get("http://www.scstrade.com/stockscreening/SS_CompanySnapShotHP.aspx?symbol=HBL")
soup = bs.BeautifulSoup(resp.text, "lxml")
mydivs = soup.findAll("div", {"class": "breadcrumbs"})
print(mydivs)
What I got:
[<div class="breadcrumbs">
<ul>
<li class="breadcrumbs-home">
<a href="#" title="Back To Home">
<i class="fa fa-home"></i>
</a>
</li>
<li>Snapshot / <span id="ContentPlaceHolder1_lbl_companyname">HBL - Habib Bank Ltd.</span> / Historical Prices
</li>
</ul>
</div>, <div class="breadcrumbs" style="background-color:transparent;border-color:transparent;margin-top:20px;">
<ul>
<div class="bootstrap-iso">
<div class="tp-banner-container">
<div class="table-responsive">
<div id="n1">
<table class="table table-bordered table-striped" id="list"><tr><td>Company Wise</td></tr></table>
<div id="pager"></div>
</div>
</div>
</div>
</div>
</ul>
</div>]
Inspecting the source the table is in the div class called "breadcrumbs" (I got that through the "inspect element" thingy) but I dont see the place where all the values are defined/stored in the pages source. Kinda new to web scraping where should I be looking to extract those values here?
Also there are a total of 7 pages and Im currently only trying to scrape the table off from the first oage, how would I go about scraping all x pages of my results and then convert them to a pandas dataframe..
The page loads the data via Javascript from external source. By inspecting where the page is making requests, you can load the data with json module.
You can tweak the parameters in the payload dict to get the data for date range you want:
import json
import requests
url = 'http://www.scstrade.com/stockscreening/SS_CompanySnapShotHP.aspx/chart'
payload = {"par":"HBL","date1":"07/13/2019","date2":"08/12/2019","rows":20,"page":1,"sidx":"trading_Date","sord":"desc"}
json_data = requests.post(url, json=payload).json()
print(json.dumps(json_data, indent=4))
Prints:
{
"d": [
{
"trading_Date": "/Date(1565290800000)/",
"trading_open": 111.5,
"trading_high": 113.24,
"trading_low": 105.5,
"trading_close": 106.17,
"trading_vol": 1349000,
"trading_change": -4.71
},
{
"trading_Date": "/Date(1565204400000)/",
"trading_open": 113.94,
"trading_high": 115.0,
"trading_low": 110.0,
"trading_close": 110.88,
"trading_vol": 1122200,
"trading_change": -3.48
},
... and so on.
EDIT:
I found the URL from where the page is loading data by looking at Network tab in Firefox developer tools:
There is URL, the method how the page is making requests (POST in this case) and parameters needed:
I copy this URL and parameters and use it in requests.post() method to obtain json data.

Extract Text Data from a Div Tag but not a from a Child H3 Tag

I have an HTML snippet that I need to get data from using BeautifuSoup:
<!doctype html>
<html lang="en">
<body>
<div class="sidebar-box">
<h3><i class="fa fa-users"></i> Management Team</h3>
Chairman, Director
</div>
<div class="sidebar-box">
<h3><i class="fa fa-male"></i> Teacher</h3>
John Doe
</div>
<div class="sidebar-box">
<h3><i class="fa fa-mortar-board"></i> Awards </h3>
National Top Quality Educational Development
</div>
<div class="sidebar-box">
<h3><i class="fa fa-building"></i> School Type</h3>
Secondary
</div>
</body>
</html>
I need to get the .text value of the second div from the top "John Doe", but not the .text value inside the h3 tag in that div.
My challenge is that currently I get both text values as in this code snippet:
# Python 3.7, BeautifulSoup 4.7
# html variable is equal to the above HTML snippet
from bs4 import BeautifulSoup
soup4 = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
school_head_teacher = soup4.find_all('div', {'class':'sidebar-box'})
school_head_teacher = school_head_teacher[1].text.strip()
print(school_head_teacher)
This outputs:
Teacher
John Doe
However, I only need the John Doe value.
I offered 2 solutions. The first not the most elegant solution. But just off the top of my head quickly, you can split that again and join together everything after 'Teacher'
Option 1:
html = '''
!doctype html>
<html lang="en">
<body>
<div class="sidebar-box">
<h3><i class="fa fa-users"></i> Management Team</h3>
Chairman, Director
</div>
<div class="sidebar-box">
<h3><i class="fa fa-male"></i> Teacher</h3>
John Doe
</div>
<div class="sidebar-box">
<h3><i class="fa fa-mortar-board"></i> Awards </h3>
National Top Quality Educational Development
</div>
<div class="sidebar-box">
<h3><i class="fa fa-building"></i> School Type</h3>
Secondary
</div>
</body>
</html>'''
from bs4 import BeautifulSoup
soup4 = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
school_head_teacher = soup4.find_all('div', {'class':'sidebar-box'})
school_head_teacher = school_head_teacher[1].text.strip()
school_head_teacher = school_head_teacher.split()[1:]
school_head_teacher = ' '.join(school_head_teacher)
print(school_head_teacher)
Output:
print(school_head_teacher)
John Doe
Option 2:
This one I think is a bit better. You find the tag that has Teacher. Then you get the parent tag. Then since you want the second part, you use .next_sibling and the strip it.
soup4(text=re.compile('Teacher'))[0].parent.next_sibling.strip()
I had it in a for loop incase there's multiple teachers. But you can substitute the top code instead of the for loop
from bs4 import BeautifulSoup
import re
soup4 = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
for elem in soup4(text=re.compile('Teacher')):
print (elem.parent.next_sibling.strip())
Another option:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
teacher_name = soup.find_all('div', class_='sidebar-box')
print(teacher_name[1].contents[2].strip())
Output:
John Doe
Since <div class="sidebar-box">
<h3><i class="fa fa-male"></i> Teacher</h3>
John Doe
</div>
Since John Doe is the next-sibling of <h3><i class="fa fa-male"></i> Teacher</h3>
We can use a combination of find_next() and next_sibling on <div class="sidebar-box">
!doctype html>
<html lang="en">
<body>
<div class="sidebar-box">
<h3><i class="fa fa-users"></i> Management Team</h3>
Chairman, Director
</div>
<div class="sidebar-box">
<h3><i class="fa fa-male"></i> Teacher</h3>
John Doe
</div>
<div class="sidebar-box">
<h3><i class="fa fa-mortar-board"></i> Awards </h3>
National Top Quality Educational Development
</div>
<div class="sidebar-box">
<h3><i class="fa fa-building"></i> School Type</h3>
Secondary
</div>
</body>
</html>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
school_head_teacher = soup.find_all('div', {'class':'sidebar-box'})
head_teacher = school_head_teacher[1].find_next().next_sibling
print(head_teacher)
By this way you can loop over the other divs too that follow the same pattern.
for school_info in school_head_teacher:
print (school_info.find_next().next_sibling)

How to extract multiple text outside tags with BeautifulSoup?

I want to scrape a web page (German complaint website) using BeautifulSoup. Here is a good example (https://de.reclabox.com/beschwerde/44870-deutsche-bahn-berlin-erstattungsbetrag-sparpreisticket)
<div id="comments" class="kt">
<a name="comments"></a>
<span class="bb">Kommentare und Trackbacks (7)</span>
<br><br><br>
<a id="comment100264" name="comment100264"></a>
<div class="data">
19.12.2011 | 11:04
</div>
von Tom K.
<!--
-->
| <a class="flinko" href="/users/login?functionality_required=1">Regelverstoß melden</a>
<div class="linea"></div>
TEXT I AM INTEREST IN<br><br>MORE TEXT I AM INTEREST IN<br><br>MORETEXT I AM INTEREST IN
<br><br>
<a id="comment100265" name="comment100265"></a>
<div class="data">
19.12.2011 | 11:11
</div>
von Tom K.
<!--
-->
| <a class="flinko" href="/users/login?functionality_required=1">Regelverstoß melden</a>
<div class="linea"></div>
TEXT I AM INTEREST IN<br><br>MORE TEXT I AM INTEREST IN
<br><br>
<a id="comment101223" name="comment101223"></a>
<div class="commentbox comment-not-yet-solved">
<div class="data">
25.12.2011 | 10:14
</div>
von ReclaBoxler-4134668
<!--
--><img alt="noch nicht gelöste Beschwerde" src="https://a1.reclabox.com/assets/live_tracking/not_yet_solve-dbf4769c625b73b23618047471c72fa45bacfeb1cf9058655c4d75aecd6e0277.png" title="noch nicht gelöste Beschwerde">
| <a class="flinko" href="/users/login?functionality_required=1">Regelverstoß melden</a>
<div class="linea"></div>
TEXT I AM NOT INTERESTED IN <br><br>TEXT I AM NOT INTERESTED IN
</div>
<br><br>
<a id="comment101237" name="comment101237"></a>
<div class="data">
25.12.2011 | 11:01
</div>
von ReclaBoxler-3315297
<!--
-->
| <a class="flinko" href="/users/login?functionality_required=1">Regelverstoß melden</a>
<div class="linea"></div>
TEXT I AM INTERESTED IN
<br><br>
etc...
<br><br>
<br><br>
</div>
I was able to scrape most of the content I want (thanks to a lot of Q&A's I read here:-)) except for the comments (<div id="comments" class="kt">) which are not in a class ="commentbox" (I got the commentboxes already with another command). The comments outside the comment boxes seem to be not in a normal tag, that's why I just did not manage to get them via "soup.find(_all)". I'd like to scrape these comments as well as information about the person posting the comment ("von") as well as the date and time (<div class="data">).
It would be absolutely fantastic if someone knows how to solve this one. Thanks in advance for your help!
The common task to extract all texts from a page as follows
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
doc = """xxxxxxxx""" // url name
soup = BeautifulSoup(doc, "html.parser")
print(soup.get_text())

How to create list of web elements?

I am trying to make a list of web elements, but it can not seem to find the elements on the web page, although did worked 3 days ago and i can not find any changes in the web page.
this is the html code :
<li id="wlg_41410" class="leagueWindow " dataid="41410">
<h5 style="cursor: pointer; cursor: hand;" onclick="TodaysEventsLeagueWindow.minimizeRestoreClick(41410)">Europa League</h5>
<div class="bet_type select" id="_bet_types"></div>
<div class="bet_type lastscore ">
<h6>1X2 FT </h6>
<div class="types_bg">
<!--[if IE]> <div id="IEroot"> <![endif]-->
<div class="first_buttons_line">
</div>
<!--[if IE]> </div> <![endif]-->
<div class="time"> 23/11 | 18:00 </div>
<div class="bets ml">
</div>
<div class="time"> 23/11 | 20:00 </div>
<div class="bets ml">
</div>
<div class="time"> 23/11 | 20:00 </div>
<div class="bets ml">
</div>
<div class="time"> 23/11 | 20:00 </div>
<div class="bets ml">
</div>
<div class="time"> 23/11 | 20:00 </div>
<div class="bets ml">
</div>
<div class="clr"></div>
</div>
</div> <span class="x" onclick="TodaysEventsLeagueWindow.closeLeagueWindow(41410)"></span>
</li>
i am trying to make a list from the <div class="bets ml"></div> elements
but keep getting the selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document exception , as if selenium can't find the web element.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import StaleElementReferenceException
import time
driver.get("https://www.luckia.es/apuestas")
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it("sbtechBC"))
eventos_de_hoy = driver.find_element_by_id("today_event_btn")
eventos_de_hoy.click()
ligi_len = len(WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "leagueWindow "))))
print(ligi_len)
for index in range(ligi_len):
item = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "leagueWindow ")))[index]
driver.execute_script("arguments[0].scrollIntoView(true);", item)
nume_liga = item.find_element_by_tag_name("h5").text
time.sleep(3)
print('try', nume_liga)
meci = item.find_elements_by_xpath("//*[#class='bets ml']")
print("there are", len(meci), "in one liga")
the reason for the index is that the iframe refreshes every 25 sec.
i also tried meci = item.find_elements_by_css_selector('.bets.ml') and meci = item.find_elements_by_class_name('ml')
Why should i be able to extract the <h5></h5> element and not the other elements?
From your code block, its pretty clear you have just managed to cover up the real issue through time.sleep(3) as follows :
nume_liga = item.find_element_by_tag_name("h5").text
time.sleep(3)
print('try', nume_liga)
While invoking print() for a text, I am not sure why time.sleep(3) was induced. So our main issue got covered up there. But as the List was already created, you are able to print('try', nume_liga)
But next, when you do meci = item.find_elements_by_xpath("//*[#class='bets ml']") you face a StaleElementReferenceException because the HTML DOM have changed.
A closer look into the <h5> tag reveals it have a onclick() event as :
<h5 style="cursor: pointer; cursor: hand;" onclick="TodaysEventsLeagueWindow.minimizeRestoreClick(41410)">Europa League</h5>
A wild guess, while invoking .text on <h5> tag, the HTML DOM changes.
Solution :
A possible solution with your current code block may be to use getAttribute("innerHTML") instead of .text. So your line of code will be :
nume_liga = item.find_element_by_tag_name("h5").get_attribute("innerHTML")

How to extract value from href in python?

Hi developer. I am facing a problem in extracting a href value in python.
I have a button there after clicking on "view Answer" it take me a next link I want to extract that data which is present in that link.
<div class="col-md-11 col-xs-12">
<a href="https://www.marham.pk/forum/thread/4471/i-need-a-good-orthopedic- dr">
<div class="hover-div">
<h2 itemprop="name">i need a good Orthopedic dr</h2>
</div>
</a>
<div class="thread-details">
<a href="https://www.marham.pk/forum/thread/4471/i-need-a-good-orthopedic-dr">
<p class="pull-left"><span class="glyphicon glyphicon-comment"></span> View Answers (<span itemprop="answerCount">1</span>) </p>
</a>
</div>
</div>
I need to extract this href tag.
You Can Use Data Scraping In Python.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen("Your URL WILL GO HERE").read()
soup = bs.BeautifulSoup(sauce,'html5lib')
print(soup)

Resources