Scrapy: X Path choose all headers where ancestor is not footer

Scrapy: X Path choose all headers where ancestor is not footer - python-3.x

I am trying to get all headers which are not in the footer.
So the header <h3 class="ibm-bold">Discover</h3> should be excluded from the scrape.
<footer role="contentinfo" aria-label="IBM">
<div class="region region-footer">
<div id="ibm-footer-module">
<section role="region" aria-label="Resources">
<h3 class="ibm-bold">Discover</h3>
I have tried using this expression to select the headers which should be excluded, but it doesn't return the right nodes.
//*[self::h1 or self::h2 or self::h3 or self::h4 or self::h5 or self::h6]/ancestor::footer/text()
The page I am scraping is this: https://www.ibm.com/products/informix/embedded-for-iot?mhq=iot&mhsrc=ibmsearch_a
Please help

You almost had it.
//*[
(self::h1 or self::h2 or self::h3 or self::h4 or self::h5 or self::h6)
and not(ancestor::footer)
]/text()

Related

html multiline replacement Greasemonkey

trying to replace parts of html code in a page. Unfortunately the parts are multiline and I can't use specific div ids/class since they are used in the same page elsewhere (can't change that, not my fault)
For example:
<div class="c2c" style="display: none;">
<div class="left" style="height:12px;">
<span >A test</span>
</div>
That is with breaks and spaces..
I have tried this
document.body.innerHTML= document.body.innerHTML.replace(/<div class=\"c2c\" style=\"display: none;\">(\r\n|\r|\n)?\s*<div class=\"left\" style=\"height:12px;\">(\r\n|\r|\n)?\s*<span >A test<\/span>/g,"<div class=\"c2c\" style=\"display: block !important;\"><div class=\"left\" style=\"height:12px;\"><span >A test<\/span>");
and also some other variations but with no result..
any ideas?
Thank you

How do I scrape the OHLC values from this website

Website in question. Right now I am only performing analysis on the last quarter, if I was to expand to the past 4-5 quarters would there be a better way of automating this task rather than doing it manually by setting the time range again and again and then extracting the table values?
What I tried doing:
import bs4 as bs
import requests
import lxml
resp = requests.get("http://www.scstrade.com/stockscreening/SS_CompanySnapShotHP.aspx?symbol=HBL")
soup = bs.BeautifulSoup(resp.text, "lxml")
mydivs = soup.findAll("div", {"class": "breadcrumbs"})
print(mydivs)
What I got:
[<div class="breadcrumbs">
<ul>
<li class="breadcrumbs-home">
<a href="#" title="Back To Home">
<i class="fa fa-home"></i>
</a>
</li>
<li>Snapshot / <span id="ContentPlaceHolder1_lbl_companyname">HBL - Habib Bank Ltd.</span> / Historical Prices
</li>
</ul>
</div>, <div class="breadcrumbs" style="background-color:transparent;border-color:transparent;margin-top:20px;">
<ul>
<div class="bootstrap-iso">
<div class="tp-banner-container">
<div class="table-responsive">
<div id="n1">
<table class="table table-bordered table-striped" id="list"><tr><td>Company Wise</td></tr></table>
<div id="pager"></div>
</div>
</div>
</div>
</div>
</ul>
</div>]
Inspecting the source the table is in the div class called "breadcrumbs" (I got that through the "inspect element" thingy) but I dont see the place where all the values are defined/stored in the pages source. Kinda new to web scraping where should I be looking to extract those values here?
Also there are a total of 7 pages and Im currently only trying to scrape the table off from the first oage, how would I go about scraping all x pages of my results and then convert them to a pandas dataframe..

The page loads the data via Javascript from external source. By inspecting where the page is making requests, you can load the data with json module.
You can tweak the parameters in the payload dict to get the data for date range you want:
import json
import requests
url = 'http://www.scstrade.com/stockscreening/SS_CompanySnapShotHP.aspx/chart'
payload = {"par":"HBL","date1":"07/13/2019","date2":"08/12/2019","rows":20,"page":1,"sidx":"trading_Date","sord":"desc"}
json_data = requests.post(url, json=payload).json()
print(json.dumps(json_data, indent=4))
Prints:
{
"d": [
{
"trading_Date": "/Date(1565290800000)/",
"trading_open": 111.5,
"trading_high": 113.24,
"trading_low": 105.5,
"trading_close": 106.17,
"trading_vol": 1349000,
"trading_change": -4.71
},
{
"trading_Date": "/Date(1565204400000)/",
"trading_open": 113.94,
"trading_high": 115.0,
"trading_low": 110.0,
"trading_close": 110.88,
"trading_vol": 1122200,
"trading_change": -3.48
},
... and so on.
EDIT:
I found the URL from where the page is loading data by looking at Network tab in Firefox developer tools:
There is URL, the method how the page is making requests (POST in this case) and parameters needed:
I copy this URL and parameters and use it in requests.post() method to obtain json data.

Scraping multiple similar lines with python

Using a simple request I'm trying to get from this html page some information stored in "alt". The problem is that, within each instance, the information is separated in multiple lines that start with "img", and when I try to access it, I can only read the first instance of "img" and not the rest, but I'm not sure how to do it. Here's the HTML text:
<div class="archetype-tile-description-wrapper">
<div class="archetype-tile-description">
<h2>
<span class="deck-price-online">
Golgari Midrange
</span>
<span class="deck-price-paper">
Golgari Midrange
</span>
</h2>
<div class="manacost-container">
<span class="manacost">
<img alt="b" class="common-manaCost-manaSymbol sprite-mana_symbols_b" src="//assets1.mtggoldfish.com/assets/s-d69cbc552cfe8de4931deb191dd349a881ff4448ed3251571e0bacd0257519b1.gif" />
<img alt="g" class="common-manaCost-manaSymbol sprite-mana_symbols_g" src="//assets1.mtggoldfish.com/assets/s-d69cbc552cfe8de4931deb191dd349a881ff4448ed3251571e0bacd0257519b1.gif" />
</span>
</div>
<ul>
<li>Jadelight Ranger</li>
<li>Merfolk Branchwalker</li>
<li>Vraska's Contempt</li>
</ul>
</div>
</div>
Having said that, what I'm looking to get from this is both "b" and "g" and store them in a single variable.

You can probably grab those <img> elements with the class "common-manaCost-manaSymbol" like this:
imgs = soup.find_all("img",{"class":"common-manaCost-manaSymbol"})
and then you can iterate over each <img> and grab the alt property of it.
alts = []
for i in imgs:
alts.append(i['alt'])
or with a list comprehension
alts = [i['alt'] for i in imgs]

How to extract value from href in python?

Hi developer. I am facing a problem in extracting a href value in python.
I have a button there after clicking on "view Answer" it take me a next link I want to extract that data which is present in that link.
<div class="col-md-11 col-xs-12">
<a href="https://www.marham.pk/forum/thread/4471/i-need-a-good-orthopedic- dr">
<div class="hover-div">
<h2 itemprop="name">i need a good Orthopedic dr</h2>
</div>
</a>
<div class="thread-details">
<a href="https://www.marham.pk/forum/thread/4471/i-need-a-good-orthopedic-dr">
<p class="pull-left"><span class="glyphicon glyphicon-comment"></span> View Answers (<span itemprop="answerCount">1</span>) </p>
</a>
</div>
</div>
I need to extract this href tag.

You Can Use Data Scraping In Python.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen("Your URL WILL GO HERE").read()
soup = bs.BeautifulSoup(sauce,'html5lib')
print(soup)

How can I add mutiple anchors to the same block?

I'm using AsciiDoctor to create an HTML manual. In order to keep existing links valid, I need multiple anchors at the same header.
Basically I want this output:
<a id="historic1"></a>
<a id="historic2"></a>
<h2 id="current">Caption</h2>
While it is possible to create multiple inline anchors like this
Inline [[historic1]] [[historic2]] [[current]] Anchor
Inline <a id="historic1"></a> <a id="historic2"></a> <a id="current"></a> Anchor
it looks like additional anchor macros in front of blocks are simply swallowed:
[[historic1]]
[[historic2]]
[[current]]
== Caption
<h2 id="current">Caption</h2>
So what are my options to have multiple anchors in front of a block?

You can also use the shorthand version of this solution.
[#current]
== [[historic1]][[historic2]]Caption
Now you get all three anchors on the same heading.

The best I could do (tested with Asciidoctor.js 1.5.4):
== anchor:historic1[historic1] anchor:historic2[historic2] anchor:current[current] Caption
Some text
Output:
<h2 id="__a_id_historic1_a_a_id_historic2_a_a_id_current_a_caption"><a id="historic1"></a> <a id="historic2"></a> <a id="current"></a> Caption</h2>
<div class="sectionbody">
<div class="paragraph">
<p>Some text</p>
</div>
</div>
There are two issues:
#840
#1689

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scrapy: X Path choose all headers where ancestor is not footer - python-3.x

You almost had it. //*[ (self::h1 or self::h2 or self::h3 or self::h4 or self::h5 or self::h6) and not(ancestor::footer) ]/text()

Related

html multiline replacement Greasemonkey

How do I scrape the OHLC values from this website

Scraping multiple similar lines with python

How to extract value from href in python?

How can I add mutiple anchors to the same block?

Categories

Resources