xpath: How do I extract text within the "strong" tag? - python-3.x

I'm using scrapy and need to extract "Gray / Gray" using xpath selectors.
Here's the html snippet:
<div class="Vehicle-Overview">
<div class="Txt-YMM">
2006 GMC Sierra 1500
</div>
<div class="Txt-Price">
Price : $8,499
</div>
<table width="100%" border="0" cellpadding="0" cellspacing="0"
class="Table-Specs">
<tr>
<td>
<strong>2006 GMC Sierra 1500 Crew Cab 143.5 WB 4WD
SLE</strong>
<strong class="text-right t-none"></strong>
</td>
</tr>
<tr>
<td>
<strong>Gray / Gray</strong><br />
<strong>209,123
Miles
/ VIN: XXXXXXXXXX
</td>
</tr>
</table>
I'm stuck trying to extract "Gray / Gray" within the "strong" tag. Any help is appreciated.

This XPath will work in Scrapy and also in Google/Firefox Developer's Console:
//div[#class='Vehicle-Overview']/table[#class='Table-Specs']//tr[2]/td[1]/strong[1]/text()
You can use this code in your spider:
color = response.xpath("//div[#class='Vehicle-Overview']/table[#class='Table-Specs']//tr[2]/td[1]/strong[1]/text()").extract_first()

You can use this XPath expression with your sample XML/HTML:
//div[#class='Vehicle-Overview']/table[#class='Table-Specs']/tr[2]/td[1]/strong[1]
A full XPath given the full file mentioned below with respect to a namespace "http://www.w3.org/1999/xhtml" can be
/html/body/div/div/div[#class='content-bg']/div/div/div[#class='Vehicle-Overview']/table[#class='Table-Specs']/tr[2]/td[1]/strong[1]

Related

I am working on a project to fill data in a pre existing web table

I am working on a project where I have to fill data in a pre populated dynamic web table. I am able to find the element with the help of XPath in selenium VBA. I am able to click the element. But Whenever I Try to fill data with sendkey command I get the message "Run time error 0.Element not intractable.
When I try sendKey command with normal id of the element or CSS id then it gets accepted but when I try XPath i get the above message.
bot.FindElementByXPath("//td[contains(text(),'XYZ')]/following-sibling::td[5]").SendKeys "54"
<tr style="background-color:White;height:24px;">
<td class="gridtext" align="center">
<span class="checkboxclass"><input id="ctl00_ContentPlaceHolder1_grdUsers_ctl02_chkSelect" type="checkbox" name="ctl00$ContentPlaceHolder1$grdUsers$ctl02$chkSelect" onclick="javascript:setTimeout('__doPostBack(\'ctl00$ContentPlaceHolder1$grdUsers$ctl02$chkSelect\',\'\')', 0)"></span>
<input type="hidden" name="ctl00$ContentPlaceHolder1$grdUsers$ctl02$hdnUserId" id="ctl00_ContentPlaceHolder1_grdUsers_ctl02_hdnUserId" value="206451744">
</td><td align="center" style="width:2%;">
1
</td><td class="gridtext" align="center">
<span id="ctl00_ContentPlaceHolder1_grdUsers_ctl02_lblStudentId" style="display:inline-block;color:#000000;font-family:Calibri;font-size:12px;font-weight:normal;font-style:normal;width:120px;">091001118051500183</span>
</td><td class="gridtext" align="center">
</td><td class="gridtext" align="left" style="background-color:#FDE9D9;">SHIVAKSHI</td><td class="gridtext" align="left" style="background-color:#FDE9D9;">SANJAI KUMAR</td><td class="gridtext" align="left" style="background-color:#FDE9D9;">SANGEETA</td><td class="gridtext" align="left" style="background-color:#FDE9D9;">19/02/2010</td><td class="gridtext" align="center" style="background-color:#FDE9D9;">
</td><td class="gridtext" align="center">
<input name="ctl00$ContentPlaceHolder1$grdUsers$ctl02$txtNoOfDays" type="text" maxlength="3" id="ctl00_ContentPlaceHolder1_grdUsers_ctl02_txtNoOfDays" class="TextBox" onkeypress="return AllowNumeric_Browser(event, false);" onpaste="return false;" style="width:50px;">
</td><td class="gridtext" align="center">
You need to target the child input tag of the td as the td itself is not interactable. So, you need to add /input to your current path
bot.FindElementByXPath("//td[contains(text(),'XYZ')]/following-sibling::td[5]/input").SendKeys "54"
Here is a similar example where you select the child input of a td based on the parent tr having a td with specific text
bot.Get 'https://datatables.net/examples/api/form.html'
bot.FindElementByXPath("//tr[contains(td/text(), 'Angelica Ramos')]/td[2]/input").SendKeys "test"
Or
//td[contains(text(),'Angelica Ramos')]/following-sibling::td[1]/input

Stemming a list of words in a dataframe

<html>
<body>
<table border=1>
<tr>
<th>label</th>
<th>rev</th>
</tr>
<tr>
<td>0</td>
<td>[ story man unnatural feelings pig...] </td>
</tr>
<tr>
<td>0</td>
<td>[ airport starts brand new luxury ...] </td></tr>
<tr>
<td>0</td>
<td>[ film lacked something couldnt pu...] </td></tr>
<tr>
<td>0</td>
<td>[ sorry everyone know supposed art...] </td></tr>
<tr>
<td>0</td>
<td>[ little parents took along theate..]</td></tr>
</table>
</body>
</html>
IMAGE-> [1]: https://i.stack.imgur.com/j2EAK.jpg
My dataframe looks like above, I tried the below code to stem it :
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()
da.rev=[ps.stem(word) for word in da.loc[:,'rev']]
but it was resulting in the same data frame again, can't point out what went wrong.
Any help will be dearly appreciated. Thank you for your time
Hard to say without seeing your exact code but if each item in the series is a list of strings you could try
da.rev.apply(lambda x: [ps.stem(word) for word in x])

VBA Excel get text inside HTMLObject

I know this is really easy for some of you out there. But I have been going deep on the internet and I can not find an answer. I need to get the company name that is inside the
tbody tr td a eBay-tradera.com
and
td class="bS aR" 970,80
/td /tr /tbody
<tbody id="matrix1_group0">
<tr class="oR" onmouseover="onMouseOver(this, false)" onmouseout="onMouseOut(this, false)" onclick="onClick(this, false)">
<td class="bS"> </td>
<td>
<a href="aProgramInfoApplyRead.action?programId=175&affiliateId=2014848" title="http://www.tradera.com/" target="_blank">
eBay-Tradera.com
</a>
</td>
<td class="aR">
175</td>
<td class="bS aR">0</td><td class="bS aR">0</td><td class="bS aR">187</td>
<td class="aR">0,00%</td><td class="bS aR">124</td>
<td class="aR">0,00%</td>
<td class="bS aR">26</td>
<td class="aR">20,97%</td>
<td class="bS aR">32</td>
<td class="aR">60,80</td>
<td class="aR">25,81%</td>
<td class="bS aR">5 102,00</td>
<td class="bS aR">0,00</td>
<td class="aR">0,00</td>
<td class="bS aR">
970,80
</td>
</tr>
</tbody>
This is my code, where I only try to get the a tag to start of with but I cant get that to work either
Set TDelements = document.getElementById("matrix1_group0").document.getElementsbytagname("a").innerHTML
r = 0
C = 0
For Each TDelement In TDelements
Blad1.Range("A1").Offset(r, C).Value = TDelement.innerText
r = r + 1
Next
Thanks on beforehand I know that this might be to simple. But I hope that other people might have the same issue and this will be helpful for them as well. The reason for the "r = r + 1" is because there are many more companies on this list. I just wanted to make it as easy as I could. Thanks again!
You will need to specify the element location in the table. Ebay seems to be obfuscating the class-names so we cannot rely on those being consistent. Nor would I usually rely on the elements by their table index being consistent but I don't see any way around this.
I am assuming that this is the HTML document you are searching
<tbody id="matrix1_group0">
<tr class="oR" onmouseover="onMouseOver(this, false)" onmouseout="onMouseOut(this, false)" onclick="onClick(this, false)">
<td class="bS"> </td>
<td>
<a href="aProgramInfoApplyRead.action?programId=175&affiliateId=2014848" title="http://www.tradera.com/" target="_blank">
eBay-Tradera.com <!-- <=== You want this? -->
</a>
</td>
<!-- ... -->
</tr>
<!-- ... -->
</tbody>
We can ignore the rest of the document as the table element has an ID. In short, we assume that
.getElementById("matrix1_group0").getElementsByTagName("TR")
will return a collection of html row objects sorted by their appearance.
Set matrix = document.getElementById("matrix1_group0")
Set firstRow = matrix.getElementsByTagName("TR")(1)
Set firstRowSecondCell = firstRow.getElementsByTagName("TD")(2)
traderaName = firstRowSecondCell.innerText
Of course you could inline this all as
document.getElementById("matrix1_group0").getElementsByTagName("TR")(1).getElementsByTagName("TD")(2).innerText
but that would make debugging harder. Also if the web-page is ever presented to you in a different format then this won't work. Ebay is deliberately making it hard for you to scrape data off of it for security.
With only the HTML you have shown you can use CSS selectors to obtain these:
a[href*='aProgramInfoApplyRead.action?programId']
Which says a tag with attribute href that contains the string 'aProgramInfoApplyRead.action?programId'. This matches two elements but the first is the one you want.
CSS Selector:
VBA:
You can use .querySelector method of .document to retrieve the first match
Debug.Print ie.document.querySelector("a[href*='aProgramInfoApplyRead.action?programId']").innerText

expression engine search result related

I want to display up to 200 words of the related results just in the next line to title
But i am not getting the text that {excerpt} should Display
My code is written below
{exp:search:search_results switch="resultRowOne|resultRowTwo"}
<table border="0" cellpadding="6" cellspacing="1" width="100%">
{exp:search:search_results switch="resultRowOne|resultRowTwo"}
<tr class="{switch}">
{if page_meta_title != ""} <td width="30%" valign="top"><b>{title}</b></td>{/if}
</tr>
<tr><td style="color:red!important">{excerpt}</td></tr>
{if count == total_results}
</table>
{/if}
{paginate}
<p>Page {current_page} of {total_pages} pages {pagination_links}</p>
{/paginate}
{/exp:search:search_results}
</table>
Maybe this was just a typo in your question, but it looks like you have the opening search tag listed twice.
{exp:search:search_results switch="resultRowOne|resultRowTwo"}
<table border="0" cellpadding="6" cellspacing="1" width="100%">
{exp:search:search_results switch="resultRowOne|resultRowTwo"}
Also, the excerpt tag by default allows 50 characters. You can also consider the character limiter plugin (http://devot-ee.com/add-ons/character-limiter) which is a free plugin from Ellis Lab. Once you have that setup, you would use it like so....
{exp:char_limit total="200" exact="no"}{your_text_field}{/exp:char_limit}

Formatting HTML Tables for Excel

I have a page in asp that makes a xls table, however when open the table all the rows are stuffed into the default column width which I would like to set.
My table looks something like this:
<table>
<thead>
'A for loop makes a series of th
</thead>
'another loop pulls db values
<tr><td>value1</td><td>value2</td> 'etc </tr>
</table>
I have tried the following to set the space
width="3.29in"
&nsp; spam (barbaric but sometimes effective)
width="400px"
style="width:300px"
none of the above seem to work.
Additionally here is my header asp incase its relevant
Response.Clear()
Response.Buffer = False
Response.ContentType = "application/vnd.ms-excel"
Response.AddHeader "Content-Disposition", "attachment; filename=blah.xls"
Also on a side note for some reason when I have a dollar value printed as such
<td>$<%=dbvalue%></td>
for some reason this yields '$dollar value and I am not sure how to nuke the single quote.
Do you need the thead tag? Something like this should work:
<html>
<body>
<h1>Report Title</h1>
<table >
<tr>
<th style="width : 300px">header1</th>
<th style="width : 100px">header2</th>
<th style="width : 200px">header..</th>
<th style="width : 300px">....</th>
</tr>
<tr class="row1">
<td >value1</td>
<td >value2</td>
<td >value..</td>
<td >....</td>
</tr>
....
Optionally you can put the table row with th tags inside a <thead> tag

Resources