How to scrape a string from the div tag using Selenium and Python?

How to scrape a string from the div tag using Selenium and Python? - python-3.x

I have source code like the code below. I'm trying to scrape out the '11 tigers' string. I'm new to xpath, can anyone suggest how to get it using selenium or beatiful soup? I'm thinking driver.find_element_by_xpath or soup.find_all.
source:
<div class="count-box fixed_when_handheld s-vgLeft0_5 s-vgPullBottom1 s-vgRight0_5 u-colorGray6 u-fontSize18 u-fontWeight200" style="display: block;">
<div class="label-container u-floatLeft">11 tigers</div>
<div class="u-floatRight">
<div class="hide_when_tablet hide_when_desktop s-vgLeft0_5 s-vgRight0_5 u-textAlignCenter">
<div class="js-show-handheld-filters c-button c-button--md c-button--blue s-vgRight1">
Filter
</div>
<div class="js-save-handheld-filters c-button c-button--md c-button--transparent">
Save
</div>
</div>
</div>
<div class="cb"></div>
</div>

You can use same .count-box .label-container css selector for both BS and Selenium.
BS:
page = BeautifulSoup(yourhtml, "html.parser")
# if you need first one
label = page.select_one(".count-box .label-container").text
# if you need all
labels = page.select(".count-box .label-container")
for label in labels:
print(label.text)
Selenium:
labels = driver.find_elements_by_css_selector(".count-box .label-container")
for label in labels:
print(label.text)

Variant of the answer given by Sers.
page = BeautifulSoup(html_text, "lxml")
# first one
label = page.find('div',{'class':'count-box label-container')).text
# for all
labels = page.find('div',{'class':'count-box label-container'))
for label in labels:
print(label.text)
Use lxml parser as it's faster. You need to install it explicitly via pip install lxml

To extract the text 11 tigers you can use either of the following solution:
Using css_selector:
my_text = driver.find_element_by_css_selector("div.count-box>div.label-container.u-floatLeft").get_attribute("innerHTML")
Using xpath:
my_text = driver.find_element_by_xpath("//div[contains(#class, 'count-box')]/div[#class='label-container u-floatLeft']").get_attribute("innerHTML")

Related

How do I scrape the OHLC values from this website

Website in question. Right now I am only performing analysis on the last quarter, if I was to expand to the past 4-5 quarters would there be a better way of automating this task rather than doing it manually by setting the time range again and again and then extracting the table values?
What I tried doing:
import bs4 as bs
import requests
import lxml
resp = requests.get("http://www.scstrade.com/stockscreening/SS_CompanySnapShotHP.aspx?symbol=HBL")
soup = bs.BeautifulSoup(resp.text, "lxml")
mydivs = soup.findAll("div", {"class": "breadcrumbs"})
print(mydivs)
What I got:
[<div class="breadcrumbs">
<ul>
<li class="breadcrumbs-home">
<a href="#" title="Back To Home">
<i class="fa fa-home"></i>
</a>
</li>
<li>Snapshot / <span id="ContentPlaceHolder1_lbl_companyname">HBL - Habib Bank Ltd.</span> / Historical Prices
</li>
</ul>
</div>, <div class="breadcrumbs" style="background-color:transparent;border-color:transparent;margin-top:20px;">
<ul>
<div class="bootstrap-iso">
<div class="tp-banner-container">
<div class="table-responsive">
<div id="n1">
<table class="table table-bordered table-striped" id="list"><tr><td>Company Wise</td></tr></table>
<div id="pager"></div>
</div>
</div>
</div>
</div>
</ul>
</div>]
Inspecting the source the table is in the div class called "breadcrumbs" (I got that through the "inspect element" thingy) but I dont see the place where all the values are defined/stored in the pages source. Kinda new to web scraping where should I be looking to extract those values here?
Also there are a total of 7 pages and Im currently only trying to scrape the table off from the first oage, how would I go about scraping all x pages of my results and then convert them to a pandas dataframe..

The page loads the data via Javascript from external source. By inspecting where the page is making requests, you can load the data with json module.
You can tweak the parameters in the payload dict to get the data for date range you want:
import json
import requests
url = 'http://www.scstrade.com/stockscreening/SS_CompanySnapShotHP.aspx/chart'
payload = {"par":"HBL","date1":"07/13/2019","date2":"08/12/2019","rows":20,"page":1,"sidx":"trading_Date","sord":"desc"}
json_data = requests.post(url, json=payload).json()
print(json.dumps(json_data, indent=4))
Prints:
{
"d": [
{
"trading_Date": "/Date(1565290800000)/",
"trading_open": 111.5,
"trading_high": 113.24,
"trading_low": 105.5,
"trading_close": 106.17,
"trading_vol": 1349000,
"trading_change": -4.71
},
{
"trading_Date": "/Date(1565204400000)/",
"trading_open": 113.94,
"trading_high": 115.0,
"trading_low": 110.0,
"trading_close": 110.88,
"trading_vol": 1122200,
"trading_change": -3.48
},
... and so on.
EDIT:
I found the URL from where the page is loading data by looking at Network tab in Firefox developer tools:
There is URL, the method how the page is making requests (POST in this case) and parameters needed:
I copy this URL and parameters and use it in requests.post() method to obtain json data.

Why does attribute splitting happen in BeautifulSoup?

I try to get the attribute of the parent element:
<div class="detailMS__incidentRow incidentRow--away odd">
<div class="time-box">45'</div>
<div class="icon-box soccer-ball-own"><span class="icon soccer-ball-own"> </span></div>
<span class=" note-name">(Autogoal)</span><span class="participant-name">
Reynaldo
</span>
</div>
span_autogoal = soup.find('span', class_='note-name')
print(span_autogoal)
print(span_autogoal.find_parent('div')['class'])
# print(span_autogoal.find_parent('div').get('class')
Output:
<span class="note-name">(Autogoal)</span>
['detailMS__incidentRow', 'incidentRow--away', 'odd']
I know i can do something like this:
print(' '.join(span_autogoal.find_parent('div')['class']))
But i want to know why this is happening and is it possible to do this more correctly?

Above answer is correct however if you want get mutli attribute value return as string try use xml parser after get the parent element.
from bs4 import BeautifulSoup
data='''<div class="detailMS__incidentRow incidentRow--away odd">
<div class="time-box">45'</div>
<div class="icon-box soccer-ball-own"><span class="icon soccer-ball-own"> </span></div>
<span class=" note-name">(Autogoal)</span><span class="participant-name">
Reynaldo
</span>
</div>'''
soup=BeautifulSoup(data,'lxml')
span_autogoal = soup.find('span', class_='note-name')
print(span_autogoal)
parentdiv=span_autogoal.find_parent('div')
data=str(parentdiv)
soup=BeautifulSoup(data,'xml')
print(soup.div['class'])
Output on console:
<span class="note-name">(Autogoal)</span>
detailMS__incidentRow incidentRow--away odd

According to the BeautifulSoup documentation:
HTML 4 defines a few attributes that can have multiple values. HTML 5
removes a couple of them, but defines a few more. The most common
multi-valued attribute is class (that is, a tag can have more than one
CSS class). Others include rel, rev, accept-charset, headers, and
accesskey. Beautiful Soup presents the value(s) of a multi-valued
attribute as a list:
css_soup = BeautifulSoup('<p class="body"></p>') css_soup.p['class']
# ["body"]
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]
So in your case in <div class="detailMS__incidentRow incidentRow--away odd"> a class attribute is multi-valued.
That's why span_autogoal.find_parent('div')['class'] gives you list as an output.

extract content wherever we have div tag followed by hearder tag by using beautifulsoup

I am trying to extract div tags and header tags when they are together.
ex:
<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
I tried solution provided in below link.
here the header tag inside div tag...
but my requirement is div tag after header tag.
Scraping text in h3 and div tags using beautifulSoup, Python
also i tried something like this but not worked
soup = bs4.BeautifulSoup(page, 'lxml')
found = soup..find_all({"h3", "div"})
I need content from H3 tag and all the content inside div tag where ever these two combination exists.

You could use CSS selector h3:has(+div) - this will select all <h3> which have div immediately after it:
data = '''<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
<h3>This header is not selected</h3>
<p>Beacause this is P tag, not DIV</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for h3 in soup.select('h3:has(+div)'):
print('Header:')
print(h3.text)
print('Next <div>:')
print(h3.find_next_sibling('div').get_text(separator=",", strip=True))
Prints:
Header:
header
Next <div>:
some text here,list,list,list
Further reading:
CSS Selectors reference

How to fix missing ul tags in html list snippet with Python and Beautiful Soup

If I have a snippet of html like this:
<p><br><p>
<li>stuff</li>
<li>stuff</li>
Is there a way to clean this and add the missing ul/ol tags using beautiful soup, or another python library?
I tried soup.prettify() but it left as is.

It doesn't seem like there's a built-in method which wraps groups of li elements into an ul. However, you can simply loop over the li elements, identify the first element of each li group and wrap it in ul tags. The next elements in the group are appended to the previously created ul:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
ulgroup = 0
uls = []
for li in soup.findAll('li'):
previous_element = li.findPrevious()
# if <li> already wrapped in <ul>, do nothing
if previous_element and previous_element.name == 'ul':
continue
# if <li> is the first element of a <li> group, wrap it in a new <ul>
if not previous_element or previous_element.name != 'li':
ulgroup += 1
ul = soup.new_tag("ul")
li.wrap(ul)
uls.append(ul)
# append rest of <li> group to previously created <ul>
elif ulgroup > 0:
uls[ulgroup-1].append(li)
print(soup.prettify())
For example, the following input:
html = '''
<p><br><p>
<li>stuff1</li>
<li>stuff2</li>
<div></div>
<li>stuff3</li>
<li>stuff4</li>
<li>stuff5</li>
'''
outputs:
<p>
<br/>
<p>
<ul>
<li>
stuff1
</li>
<li>
stuff2
</li>
</ul>
<div>
</div>
<ul>
<li>
stuff3
</li>
<li>
stuff4
</li>
<li>
stuff5
</li>
</ul>
</p>
</p>
Demo: https://repl.it/#glhr/55619920-fixing-uls

First, you have to decide which parser you are going to use. Different parsers treat malformed html differently.
The following BeautifulSoup methods will help you accomplish what you require
new_tag() - create a new ul tag
append() - To append the newly created ul tag somewhere in the soup tree.
extract() - To extract the li tags one by one (which we can append to the ul tag)
decompose() - To remove any unwanted tags from the tree. Which may be formed as a result of the parser's interpretation of the malformed html.
My Solution
Let's create a soup object using html5lib parser and see what we get
from bs4 import BeautifulSoup
html="""
<p><br><p>
<li>stuff</li>
<li>stuff</li>
"""
soup=BeautifulSoup(html,'html5lib')
print(soup)
Outputs:
<html><head></head><body><p><br/></p><p>
</p><li>stuff</li>
<li>stuff</li>
</body></html>
The next step may vary according to what you want to accomplish. I want to remove the second empty p. Add a new ul tag and get all the li tags inside it.
from bs4 import BeautifulSoup
html="""
<p><br><p>
<li>stuff</li>
<li>stuff</li>
"""
soup=BeautifulSoup(html,'html5lib')
second_p=soup.find_all('p')[1]
second_p.decompose()
ul_tag=soup.new_tag('ul')
soup.find('body').append(ul_tag)
for li_tag in soup.find_all('li'):
ul_tag.append(li_tag.extract())
print(soup.prettify())
Outputs:
<html>
<head>
</head>
<body>
<p>
<br/>
</p>
<ul>
<li>
stuff
</li>
<li>
stuff
</li>
</ul>
</body>
</html>

How to extract value from href in python?

Hi developer. I am facing a problem in extracting a href value in python.
I have a button there after clicking on "view Answer" it take me a next link I want to extract that data which is present in that link.
<div class="col-md-11 col-xs-12">
<a href="https://www.marham.pk/forum/thread/4471/i-need-a-good-orthopedic- dr">
<div class="hover-div">
<h2 itemprop="name">i need a good Orthopedic dr</h2>
</div>
</a>
<div class="thread-details">
<a href="https://www.marham.pk/forum/thread/4471/i-need-a-good-orthopedic-dr">
<p class="pull-left"><span class="glyphicon glyphicon-comment"></span> View Answers (<span itemprop="answerCount">1</span>) </p>
</a>
</div>
</div>
I need to extract this href tag.

You Can Use Data Scraping In Python.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen("Your URL WILL GO HERE").read()
soup = bs.BeautifulSoup(sauce,'html5lib')
print(soup)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to scrape a string from the div tag using Selenium and Python? - python-3.x

Related

How do I scrape the OHLC values from this website

Why does attribute splitting happen in BeautifulSoup?

extract content wherever we have div tag followed by hearder tag by using beautifulsoup

How to fix missing ul tags in html list snippet with Python and Beautiful Soup

How to extract value from href in python?

Categories

Resources