Website in question. Right now I am only performing analysis on the last quarter, if I was to expand to the past 4-5 quarters would there be a better way of automating this task rather than doing it manually by setting the time range again and again and then extracting the table values?
What I tried doing:
import bs4 as bs
import requests
import lxml
resp = requests.get("http://www.scstrade.com/stockscreening/SS_CompanySnapShotHP.aspx?symbol=HBL")
soup = bs.BeautifulSoup(resp.text, "lxml")
mydivs = soup.findAll("div", {"class": "breadcrumbs"})
print(mydivs)
What I got:
[<div class="breadcrumbs">
<ul>
<li class="breadcrumbs-home">
<a href="#" title="Back To Home">
<i class="fa fa-home"></i>
</a>
</li>
<li>Snapshot / <span id="ContentPlaceHolder1_lbl_companyname">HBL - Habib Bank Ltd.</span> / Historical Prices
</li>
</ul>
</div>, <div class="breadcrumbs" style="background-color:transparent;border-color:transparent;margin-top:20px;">
<ul>
<div class="bootstrap-iso">
<div class="tp-banner-container">
<div class="table-responsive">
<div id="n1">
<table class="table table-bordered table-striped" id="list"><tr><td>Company Wise</td></tr></table>
<div id="pager"></div>
</div>
</div>
</div>
</div>
</ul>
</div>]
Inspecting the source the table is in the div class called "breadcrumbs" (I got that through the "inspect element" thingy) but I dont see the place where all the values are defined/stored in the pages source. Kinda new to web scraping where should I be looking to extract those values here?
Also there are a total of 7 pages and Im currently only trying to scrape the table off from the first oage, how would I go about scraping all x pages of my results and then convert them to a pandas dataframe..
The page loads the data via Javascript from external source. By inspecting where the page is making requests, you can load the data with json module.
You can tweak the parameters in the payload dict to get the data for date range you want:
import json
import requests
url = 'http://www.scstrade.com/stockscreening/SS_CompanySnapShotHP.aspx/chart'
payload = {"par":"HBL","date1":"07/13/2019","date2":"08/12/2019","rows":20,"page":1,"sidx":"trading_Date","sord":"desc"}
json_data = requests.post(url, json=payload).json()
print(json.dumps(json_data, indent=4))
Prints:
{
"d": [
{
"trading_Date": "/Date(1565290800000)/",
"trading_open": 111.5,
"trading_high": 113.24,
"trading_low": 105.5,
"trading_close": 106.17,
"trading_vol": 1349000,
"trading_change": -4.71
},
{
"trading_Date": "/Date(1565204400000)/",
"trading_open": 113.94,
"trading_high": 115.0,
"trading_low": 110.0,
"trading_close": 110.88,
"trading_vol": 1122200,
"trading_change": -3.48
},
... and so on.
EDIT:
I found the URL from where the page is loading data by looking at Network tab in Firefox developer tools:
There is URL, the method how the page is making requests (POST in this case) and parameters needed:
I copy this URL and parameters and use it in requests.post() method to obtain json data.
Related
Playing with web-scraping in Python 3. I m trying to web-scrape the following HTML page using the BeautifulSoup library in Python 3.
<div class="container">
<div item-id="1" class="container1 item details">
<div class="item-name">
<div class="item-business-name">
<h3>My Business #1</h3>
</div>
</div>
<div class="item-location">
<div class="item-address">
<p>My Address #1</p>
</div>
</div>
<div class="item-contact">
<div class="item-email">
<p>My Email #1</p>
</div>
</div>
</div>
<div item-id="2" class="container2 item details">
<div class="item-name">
<div class="item-business-name">
<h3>My Business #2</h3>
</div>
</div>
<div class="item-location">
<div class="item-address">
<p>My Address #2</p>
</div>
</div>
<div class="item-contact">
<div class="item-email">
<p>My Email #2</p>
</div>
</div>
</div>
</div>
As the naming pattern of the required container(e.g. div item-id="1" class="container1 item details") changes for each item, I would appreciate a piece of advice on how to scrape item-business-name, item-address, item-contact for all items. There are 100 items on the page; hence, the last container is div item-id="100" class="container100 item details"
What is the best way to get all the items in the list at the "container 1-100" level, and after that I know how to get each item :)
I was thinking of something like:
n = 0
while n < 101
item = soup.find_all(class_=f"container{n} item details")
n = n + 1
print(item)
Unfortunately, it does not give me the required list.
You don't have to use any number in id or class names. Just use zip() function to tie all information for the contact together
Variable txt contains HTML string from your question:
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
out = []
for name, address, email in zip(soup.select('.item-business-name'),
soup.select('.item-address'),
soup.select('.item-email')):
out.append([name.get_text(strip=True), address.get_text(strip=True), email.get_text(strip=True)])
# pretty print all rows:
for no, row in enumerate(out, 1):
print('{}.'.format(no), ('{:<20}'*3).format(*row))
Prints:
1. My Business #1 My Address #1 My Email #1
2. My Business #2 My Address #2 My Email #2
I was scraping this site from title and also trying to scrape images followed by title. turns out when scraped the following data was returned:
<div itemscope itemtype="https://schema.org/ItemList" class="group card-8-group-1 clearfix">
<meta itemprop="itemListOrder" content="https://schema.org/ItemListOrderDescending" />
<article itemprop="itemListElement" itemscope itemtype="https://schema.org/Article" class="card card-1 news-card-1 card-type-article type-article" data-sponsorship-type="card" data-sponsorship-article-id="1qo8sz0z1kaqb1dpj038v8658h" data-sponsorship-article-type="article" data-sponsorship-primary-tag="1pgecmpab62ei1akyb084izq3o" data-sponsorship-secondary-tag="22doj4sgsocqpxw45h607udje">
<a data-side="link" href="/en/news/spurs-investigation-aurier-appears-break-lockdown-protocols/1qo8sz0z1kaqb1dpj038v8658h" itemprop="url" data-sponsorship-slot="card" data-sponsorship-slot-id="front" class="type-article">
<div class="picture article-image" data-module="responsive-picture">
<img class="picture__image picture__image--lazyload" data-srcset="&quality=60&w=640 320w,&quality=60&w=560 480w,&quality=60&w=690 740w,&quality=60&w=800 980w,&quality=60&w=970 1580w" />
<noscript class="picture__polyfill"> <img src="https://images.daznservices.com/di/library/GOAL/5f/da/serge-aurier_191f5i34z69us1fausrs9k0mjk.jpg?t=1445827096&quality=60&h=170" alt="Serge Aurier" /> </noscript>
</div>
<div class="title">
<h3 title="Spurs launch investigation as Aurier appears to break lockdown protocols for a third time" itemprop="headline">Aurier appears to break lockdown protocols for a third time</h3>
<div class="image" data-sponsorship-slot="card" data-sponsorship-slot-id="image"></div>
</div>
it appears the page is using lazy loading. my Question is how can I extract the img with its full zoom?
To get full scale image, just replace w=55 to w=970 or bigger in image URL manually.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.goal.com/en/premier-league/2kwbbcootiqqgmrzs6o5inle5'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for title, image in zip(soup.select('.card-type-article h3'),
soup.select('.card-type-article img')):
title = title.get_text(strip=True)
full_img_url = image['src'].replace('w=55', 'w=970')
print('{:<70}{}'.format(title, full_img_url))
Prints:
Wenger calls for FFP reform amid Newcastle takeover talk https://images.daznservices.com/di/library/GOAL/63/cd/arsene-wenger-2019_13luew9ltpa2g1l1r6ziuxpwbw.jpg?t=1363081390&quality=60&w=970
'Special Havertz is half-Ozil, half-Ballack & would thrive in PL' https://images.daznservices.com/di/library/GOAL/cc/18/kai-havertz_7sugon9o7ljy1fg2xzkv1mqcm.jpg?t=-1186202400&quality=60&w=970
Solskjaer: I'd rather a hole in my squad than an asshole https://images.daznservices.com/di/library/GOAL/78/f2/ole-gunnar-solskjaer-manchester-united-2019-20_1vfk6liknrjlx1r8aumegh4cxe.jpg?t=-749345265&quality=60&w=970
Maguire praises Man Utd's 'safe' training return https://images.daznservices.com/di/library/GOAL/5d/e8/harry-maguire-man-utd_13ewrih27ahmb13i1zxfjrhrp8.jpg?t=-444094625&quality=60&w=970
Jorginho's agent opens door for Juve move https://images.daznservices.com/di/library/GOAL/69/da/jorginho-chelsea-2019-20_15zh5m3ojefx0zl1ei7qsyc14.jpg?t=-1675997073&quality=60&w=970
Premier League clubs near approval for contact training https://images.daznservices.com/di/library/GOAL/79/ce/mohamed-salah-dejan-lovren-liverpool-training_7zq70upa8l1618svdzls077xn.jpg?t=143669454&quality=60&w=970
Ceballos reiterates desire to succeed at Real Madrid https://images.daznservices.com/di/library/GOAL/97/c6/dani-ceballos-arsenal_1sywf8w828w4b193xoz5c82uuf.jpg?t=-1552361252&quality=60&w=970
I try to get the attribute of the parent element:
<div class="detailMS__incidentRow incidentRow--away odd">
<div class="time-box">45'</div>
<div class="icon-box soccer-ball-own"><span class="icon soccer-ball-own"> </span></div>
<span class=" note-name">(Autogoal)</span><span class="participant-name">
Reynaldo
</span>
</div>
span_autogoal = soup.find('span', class_='note-name')
print(span_autogoal)
print(span_autogoal.find_parent('div')['class'])
# print(span_autogoal.find_parent('div').get('class')
Output:
<span class="note-name">(Autogoal)</span>
['detailMS__incidentRow', 'incidentRow--away', 'odd']
I know i can do something like this:
print(' '.join(span_autogoal.find_parent('div')['class']))
But i want to know why this is happening and is it possible to do this more correctly?
Above answer is correct however if you want get mutli attribute value return as string try use xml parser after get the parent element.
from bs4 import BeautifulSoup
data='''<div class="detailMS__incidentRow incidentRow--away odd">
<div class="time-box">45'</div>
<div class="icon-box soccer-ball-own"><span class="icon soccer-ball-own"> </span></div>
<span class=" note-name">(Autogoal)</span><span class="participant-name">
Reynaldo
</span>
</div>'''
soup=BeautifulSoup(data,'lxml')
span_autogoal = soup.find('span', class_='note-name')
print(span_autogoal)
parentdiv=span_autogoal.find_parent('div')
data=str(parentdiv)
soup=BeautifulSoup(data,'xml')
print(soup.div['class'])
Output on console:
<span class="note-name">(Autogoal)</span>
detailMS__incidentRow incidentRow--away odd
According to the BeautifulSoup documentation:
HTML 4 defines a few attributes that can have multiple values. HTML 5
removes a couple of them, but defines a few more. The most common
multi-valued attribute is class (that is, a tag can have more than one
CSS class). Others include rel, rev, accept-charset, headers, and
accesskey. Beautiful Soup presents the value(s) of a multi-valued
attribute as a list:
css_soup = BeautifulSoup('<p class="body"></p>') css_soup.p['class']
# ["body"]
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]
So in your case in <div class="detailMS__incidentRow incidentRow--away odd"> a class attribute is multi-valued.
That's why span_autogoal.find_parent('div')['class'] gives you list as an output.
If I have a snippet of html like this:
<p><br><p>
<li>stuff</li>
<li>stuff</li>
Is there a way to clean this and add the missing ul/ol tags using beautiful soup, or another python library?
I tried soup.prettify() but it left as is.
It doesn't seem like there's a built-in method which wraps groups of li elements into an ul. However, you can simply loop over the li elements, identify the first element of each li group and wrap it in ul tags. The next elements in the group are appended to the previously created ul:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
ulgroup = 0
uls = []
for li in soup.findAll('li'):
previous_element = li.findPrevious()
# if <li> already wrapped in <ul>, do nothing
if previous_element and previous_element.name == 'ul':
continue
# if <li> is the first element of a <li> group, wrap it in a new <ul>
if not previous_element or previous_element.name != 'li':
ulgroup += 1
ul = soup.new_tag("ul")
li.wrap(ul)
uls.append(ul)
# append rest of <li> group to previously created <ul>
elif ulgroup > 0:
uls[ulgroup-1].append(li)
print(soup.prettify())
For example, the following input:
html = '''
<p><br><p>
<li>stuff1</li>
<li>stuff2</li>
<div></div>
<li>stuff3</li>
<li>stuff4</li>
<li>stuff5</li>
'''
outputs:
<p>
<br/>
<p>
<ul>
<li>
stuff1
</li>
<li>
stuff2
</li>
</ul>
<div>
</div>
<ul>
<li>
stuff3
</li>
<li>
stuff4
</li>
<li>
stuff5
</li>
</ul>
</p>
</p>
Demo: https://repl.it/#glhr/55619920-fixing-uls
First, you have to decide which parser you are going to use. Different parsers treat malformed html differently.
The following BeautifulSoup methods will help you accomplish what you require
new_tag() - create a new ul tag
append() - To append the newly created ul tag somewhere in the soup tree.
extract() - To extract the li tags one by one (which we can append to the ul tag)
decompose() - To remove any unwanted tags from the tree. Which may be formed as a result of the parser's interpretation of the malformed html.
My Solution
Let's create a soup object using html5lib parser and see what we get
from bs4 import BeautifulSoup
html="""
<p><br><p>
<li>stuff</li>
<li>stuff</li>
"""
soup=BeautifulSoup(html,'html5lib')
print(soup)
Outputs:
<html><head></head><body><p><br/></p><p>
</p><li>stuff</li>
<li>stuff</li>
</body></html>
The next step may vary according to what you want to accomplish. I want to remove the second empty p. Add a new ul tag and get all the li tags inside it.
from bs4 import BeautifulSoup
html="""
<p><br><p>
<li>stuff</li>
<li>stuff</li>
"""
soup=BeautifulSoup(html,'html5lib')
second_p=soup.find_all('p')[1]
second_p.decompose()
ul_tag=soup.new_tag('ul')
soup.find('body').append(ul_tag)
for li_tag in soup.find_all('li'):
ul_tag.append(li_tag.extract())
print(soup.prettify())
Outputs:
<html>
<head>
</head>
<body>
<p>
<br/>
</p>
<ul>
<li>
stuff
</li>
<li>
stuff
</li>
</ul>
</body>
</html>
Hi developer. I am facing a problem in extracting a href value in python.
I have a button there after clicking on "view Answer" it take me a next link I want to extract that data which is present in that link.
<div class="col-md-11 col-xs-12">
<a href="https://www.marham.pk/forum/thread/4471/i-need-a-good-orthopedic- dr">
<div class="hover-div">
<h2 itemprop="name">i need a good Orthopedic dr</h2>
</div>
</a>
<div class="thread-details">
<a href="https://www.marham.pk/forum/thread/4471/i-need-a-good-orthopedic-dr">
<p class="pull-left"><span class="glyphicon glyphicon-comment"></span> View Answers (<span itemprop="answerCount">1</span>) </p>
</a>
</div>
</div>
I need to extract this href tag.
You Can Use Data Scraping In Python.
Beautiful SoupĀ is a Python library for pulling data out of HTML and XML files.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen("Your URL WILL GO HERE").read()
soup = bs.BeautifulSoup(sauce,'html5lib')
print(soup)