Returning Certain 'a' class Href by Date - python-3.x

I have a number of divtags (as shown below) that contains a hrefthat I'm looking for. I can return the all of the hrefs and append them to a list but what I need to do is to just return the hrefs where the date equals the newest date in <li class="last first date"></li>. Any help on how I could achieve this would be great.
<div class="span8 story index_story genre-letter">
<a class="gtm-event" data-evt-action="/opinion/letters/article 1 on
/opinion/letters" data-evt-category="Section element" data-evt-
label="Position 98 of 99" href="/opinion/letters/article 1">
<span class="h2">Article 1</span>
</a>
<div class="article_info">
<ul>
<li class="last first date">February 21, 2018</li>
</ul>
</div>

Related

Get first element Xpath

I have a HTML like this :
<ol class="list">
<li class="list-item " id="37647629">
<!---->
<div>
<!---->
<div>
<!---->
<book class="book">
<div class="title">
someText
</div>
<div class="year">
2022
</div>
</book>
</div>
<!---->
</div>
<!---->
</li>
<li class="list-item " id="37647778">
<!---->
<div>
<!---->
<div>
<!---->
<book class="book">
<div class="title">
someOtherText
</div>
<div class="year">
2014
</div>
</book>
</div>
</div>
<!---->
</li>
</ol>
I want to get the first book title and year, directly with two xPath expression.
I tried :
$x('//book') => Ok, get the two books list
$x('//book[0]') => Empty list
$x('//book[0]/div[#class="title"]') => Nothing
Seems I have to do this :
$x('//book')[0]
and then process title, but why I can't do this just with Xpath and directly access the first title with a Xpath expression ?
This will give you the first book title
"(//book)[1]//div[#class='title']"
And this gives the first book year
"(//book)[1]//div[#class='year']"
You're missing that XPath indexing starts at 1; JavaScript indexing starts at 0.
$x('//book') selects all book elements in the document.
$x('//book[0]') selects nothing because XPath indexing starts at 1. (It also signifies to select all book elements that are the first among siblings — not necessarily the same as the first of all book elements in the document.)
$x('//book')[0] would select the first book element because JavaScript indexing starts at 0.
$x('(//book)[1]') would select the first book element because XPath indexing starts at 1.
To select the first div with class of 'title', all in XPath:
$x('(//div[#class="title"])[1]')
or, using JavaScript to index:
$x('(//div[#class="title"])')[0]
To return just the string value without the leading/trailing whitespace, wrap in normalize-space():
$x('normalize-space((//div[#class="title"])[1])')
Note that normalize-space() will also consolidate internal whitespace, but that is of no consequence with this example.
See also
How to select first element via XPath? (And be sure not to miss the explanation of the difference between //book[1] and (//book)[1] — they are not the same.)

Need to get a specific class exists in HTML body

I am trying to check if class = "special-price" exists in below code.
Here is html code :
<div class="product-shop">
<div class="f-fix">
<h2 class="product-name newname"> Xiaomi Mi Band 2 Strap (Black with White Border) </h2>
<!--product price-->
<div class="text-center ">
<div class="price-box">
<p class="old-price"> <span class="price-label">Regular Price:</span >
<span class = "price" id = "old-price-8846" > ৳200 </span>
</p >
<p class = "special-price" >
<span class = "price-label"> Special Price </span>
<span class="price" itemprop="price" content="149" id="product-price-8846"> ৳149 </span>
</p>
</div>
</div >
</div>
I am using Scrapy with python. After checking if the class found I need to collect text of class="price".
Did you try something like:
if response.css('.special-price'):
price = response.css('.price::text').get() # or do whatever you need
or for short:
price = response.css('.special-price .price::text').get()
it will give you None in case there is no element with special-price class.

Scraping multiple similar lines with python

Using a simple request I'm trying to get from this html page some information stored in "alt". The problem is that, within each instance, the information is separated in multiple lines that start with "img", and when I try to access it, I can only read the first instance of "img" and not the rest, but I'm not sure how to do it. Here's the HTML text:
<div class="archetype-tile-description-wrapper">
<div class="archetype-tile-description">
<h2>
<span class="deck-price-online">
Golgari Midrange
</span>
<span class="deck-price-paper">
Golgari Midrange
</span>
</h2>
<div class="manacost-container">
<span class="manacost">
<img alt="b" class="common-manaCost-manaSymbol sprite-mana_symbols_b" src="//assets1.mtggoldfish.com/assets/s-d69cbc552cfe8de4931deb191dd349a881ff4448ed3251571e0bacd0257519b1.gif" />
<img alt="g" class="common-manaCost-manaSymbol sprite-mana_symbols_g" src="//assets1.mtggoldfish.com/assets/s-d69cbc552cfe8de4931deb191dd349a881ff4448ed3251571e0bacd0257519b1.gif" />
</span>
</div>
<ul>
<li>Jadelight Ranger</li>
<li>Merfolk Branchwalker</li>
<li>Vraska's Contempt</li>
</ul>
</div>
</div>
Having said that, what I'm looking to get from this is both "b" and "g" and store them in a single variable.
You can probably grab those <img> elements with the class "common-manaCost-manaSymbol" like this:
imgs = soup.find_all("img",{"class":"common-manaCost-manaSymbol"})
and then you can iterate over each <img> and grab the alt property of it.
alts = []
for i in imgs:
alts.append(i['alt'])
or with a list comprehension
alts = [i['alt'] for i in imgs]

How to select only divs with specific children span with xpath python

I am currently trying to scrap information of a particular ecommerce site and i only want to get product information like product name, price, color and sizes of only products whose prices have been slashed.
i am currently using xpath
this is my python scraping code
from lxml import html
import requests
class CategoryCrawler(object):
def __init__(self, starting_url):
self.starting_url = starting_url
self.items = set()
def __str__(self):
return('All Items:', self.items)
def crawl(self):
self.get_item_from_link(self.starting_url)
return
def get_item_from_link(self, link):
start_page = requests.get(link)
tree = html.fromstring(start_page.text)
names = tree.xpath('//span[#class="name"][#dir="ltr"]/text()')
print(names)
Note this is not the original URL
crawler = CategoryCrawler('https://www.myfavoriteecommercesite.com/')
crawler.crawl()
When the program is Run ... These are the HTML Content Gotten from the E-commerce Site
Div of Products With Price Slash
div class="products-info">
<h2 class="title"><span class="brand ">Apple </span> <span class="name" dir="ltr">IPhone X 5.8-Inch HD (3GB,64GB ROM) IOS 11, 12MP + 7MP 4G Smartphone - Silver</span></h2>
<div class="price-container clearfix">
<span class="sale-flag-percent">-22%</span>
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="NGN">₦</span>
<span dir="ltr" data-price="388990">388,990</span>
</span>
<span class="price -old ">
<span data-currency-iso="NGN">₦</span>
<span dir="ltr" data-price="500000">500,000</span>
</span>
</span>
</div>
div
Div of Products with No Price Slash
div class="products-info">
<h2 class="title"><span class="brand ">Apple </span> <span class="name" dir="ltr">IPhone X 5.8-Inch HD (3GB,64GB ROM) IOS 11, 12MP + 7MP 4G Smartphone - Silver</span></h2>
<div class="price-container clearfix">
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="NGN">₦</span>
<span dir="ltr" data-price="388990">388,990</span>
</span>
</span>
</div>
div
Now this is my exact Question
i want to know how to select only the parent divs i.e
div class="price-container clearfix"> that also contains any of these children span classes
span class="price -old "> or
span class="sale-flag-percent">
Thank you all
One solution would be get all <div class="price-container clearfix"> and iterate, checking with the string of the whole element that your keywords exist.
But a better solution would be to use conditionals with xpath:
from lxml import html
htmlst = 'your html'
tree=html.fromstring(htmlst)
divs = tree.xpath('//div[#class="price-container clearfix" and .//span[#class = "price -old " or #class = "sale-flag-percent"] ]')
print(divs)
This get all divs where class="price-container clearfix" and then check if contains span with the searched classes.

ExpressionEngine swtich tag working inconsistently

In ExpressioneEngine, I'm creating a list with conditionals that is returning some strange behavior. The code below is part of a bigger set:
<li><h4>DERMATOLOGY</h4>
<ul>
{exp:channel:entries channel="specialist" dynamic="no" orderby="sp_order" sort="asc"}
{if sp_specialty == "sp_dermatology"}
<li>
<img src="{sp_headshot}" />
<p>{title}</p>
</li>
{/if}
{/exp:channel:entries}
</ul>
</li>
<li><h4>EMERGENCY AND CRITICAL CARE</h4>
<ul>
{exp:channel:entries channel="specialist" dynamic="no" orderby="sp_order" sort="asc"}
{if sp_specialty == "sp_emergency"}
<li class="{switch='one|two'}">
<img src="{sp_headshot}" />
<p>{title}</p>
</li>
{/if}
{/exp:channel:entries}
</ul>
</li>
What happens, in the case of EMERGENCY AND CRITICAL CARE, is that with the 5 entries I have under that, the classes are returned like this: two, one, one, one, two. Any suggestions on getting the behavior I need?
I see what you mean. The switch variable applies its logic to all entries returned by the entries loop - which is why you're seeing odd numbering in your rendered page - because it's applying them to entries returned by the loop that you are then applying conditionals to in order to do your grouping. You could use the search param to do some of that for you, returning only the entries you're looking for within each loop. Like this:
<li><h4>DERMATOLOGY</h4>
<ul>
{exp:channel:entries channel="specialist" search:sp_specialty="=sp_dermatology" dynamic="no" orderby="sp_order" sort="asc"}
<li>
<img src="{sp_headshot}" />
<p>{title}</p>
</li>
{/exp:channel:entries}
</ul>
</li>
<li><h4>EMERGENCY AND CRITICAL CARE</h4>
<ul>
{exp:channel:entries channel="specialist" search:sp_specialty="=sp_emergency" dynamic="no" orderby="sp_order" sort="asc"}
<li class="{switch='one|two'}">
<img src="{sp_headshot}" />
<p>{title}</p>
</li>
{/exp:channel:entries}
</ul>
</li>
This way each loop returns ONLY the matching items you're looking for, eliminating the need for the conditional and allowing the switch param to operate as it wants to - applying itself in alternating fashion to every returned entry from the loop.

Resources