xpath join text from multiple elements python - python-3.x

Hello I have some html file from this website: https://www.oddsportal.com/soccer/argentina/superliga/results/
<td class="name table-participant">
<a href="/soccer/argentina/superliga/independiente-san-martin-tIuN5Umrd/">
<span class="bold">Independiente</span>
"- San Martin T."
</a>
</td>
<td class="name table-participant">
<a href="/soccer/argentina/superliga/lanus-huracan-xIDIe0Gr/">
"Lanus - "
<span class="bold">Huracan</span>
</a>
</td>
<td class="name table-participant">
Rosario Central - Colon Santa FE
</td>
I want to select and join a/text() and span/text() in order to look like this: "Independiente - San Martin T."
As you see the 'span' is not allways in the same place and some times is missing (see last 'td class')
I used this code:
('//td[#class="name table-participant"]/a/text() | span/text()').extract()
but it returns only the a/text().
Can you help me to make this work?
Thank you

You trying to search span/text() without a scope. Add // at the beginning of this part of query, in the totally:
('//td[#class="name table-participant"]/a/text() | //span/text()').extract()
But I'm strongly recommend use this decision:
('//td[#class="name table-participant"]//*[self::a/ancestor::td or self::span]/text()').extract
for get span only from your choiced td-scope.

I'm assuming that you're using Scrapy to scrape the HTML.
From the structure of your sample HTML, it looks like you want to obtain the text of the anchor element, so you need to iterate over those.
Only then you can strip and join the text child nodes of the anchor element to obtain properly formatted strings. There is additional complication by the inconsistent use of quotes, but the following should get you going.
from scrapy.selector import Selector
HTML="""
<td class="name table-participant">
<a href="/soccer/argentina/superliga/independiente-san-martin-tIuN5Umrd/">
<span class="bold">Independiente</span>
"- San Martin T."
</a>
</td>
<td class="name table-participant">
<a href="/soccer/argentina/superliga/lanus-huracan-xIDIe0Gr/">
"Lanus - "
<span class="bold">Huracan</span>
</a>
</td>
<td class="name table-participant">
Rosario Central - Colon Santa FE
</td>
"""
def strip_and_join(x):
l=[]
for s in x:
# strip whitespace and quotes
s = s.strip().strip('"').strip()
# drop now empty strings
if s:
l.append(s)
return " ".join(l)
for x in Selector(text=HTML).xpath('//td[#class="name table-participant"]/a'):
print strip_and_join(x.xpath('.//text()').extract())
Note that for the sake of clarity I didn't squeeze the code into a single list comprehension, although this would be possible of course.

Related

How to get all td[3] tags from the tr tags with selenium Xpath in python

I have a webpage HTML like this:
<table class="table_type1" id="sailing">
<tbody>
<tr>
<td class="multi_row"></td>
<td class="multi_row"></td>
<td class="multi_row">1</td>
<td class="multi_row"></td>
</tr>
<tr>
<td class="multi_row"></td>
<td class="multi_row"></td>
<td class="multi_row">1</td>
<td class="multi_row"></td>
</tr>
</tbody>
</table>
and tr tags are dynamic so i don't know how many of them exist, i need all td[3] of any tr tags in a list for some slicing stuff.it is much better iterate with built in tools if find_element(s)_by_xpath("") has iterating tools.
Try
cells = driver.find_elements_by_xpath("//table[#id='sailing']//tr/td[3]")
to get third cell of each row
Edit
For iterating just use a for loop:
print ([i.text for i in cells])
Try following code :
tdElements = driver.find_elements_by_xpath("//table[#id="sailing "]/tbody//td")
Edit : for 3rd element
tdElements = driver.find_elements_by_xpath("//table[#id="sailing "]/tbody/tr/td[3]")
To print the text e.g. 1 from each of the third <td> you can either use the get_attribute() method or text property and you can use either of the following solutions:
Using CssSelector and get_attribute():
print(driver.find_elements_by_css_selector("table.table_type1#sailing tr td:nth-child(3)").get_attribute("innerHTML"))
Using CssSelector and text property:
print(driver.find_elements_by_css_selector("table.table_type1#sailing tr td:nth-child(3)").text)
Using XPath and get_attribute():
print(driver.find_elements_by_xpath('//table[#class='table_type1' and #id="sailing"]//tr//following::td[3]').get_attribute("innerHTML"))
Using XPath and text property:
print(driver.find_elements_by_xpath('//table[#class='table_type1' and #id="sailing"]//tr//following::td[3]').text)
To get the 3 rd td of each row, you can try either with xpath
driver.find_elements_by_xpath('//table[#id="sailing"]/tbody//td[3]')
or you can try with css selector like
driver.find_elements_by_css_selector('table#sailing td:nth-child(3)')
As it is returning list you can iterate with for each,
elements=driver.find_elements_by_xpath('//table[#id="sailing"]/tbody//td[3]')
for element in elements:
print(element.text)

find_elements_by_xpath() not producing the desired output python selenium scraping

I'm trying to find a tr by its class of .tableOne. Here is my code:
browser = webdriver.Chrome(executable_path=path, options=options)
cells = browser.find_elements_by_xpath('//*[#class="tableone"]')
But the output of the cells variable is [], an empty array.
Here is the html of the page:
<tbody class="tableUpper">
<tr class="tableone">
<td><a class="studentName" href="//www.abc.com"> student one</a></td>
<td> <span class="id_one"></span> <span class="long">Place</span> <span class="short">Place</span></td>
<td class="hide-s">
<span class="state"></span> <span class="studentState">student_state</span>
</td>
</tr>
<tr class="tableone">..</tr>
<tr class="tableone">..</tr>
<tr class="tableone">..</tr>
<tr class="tableone">..</tr>
</tbody>
Please try this:
import re
cells = browser.find_elements_by_xpath("//*[contains(local-name(), 'tr') and contains(#class, 'tableone')]")
for (e in cells):
insides = e.find_elements_by_xpath("./td")
for (i in insides):
result = re.search('\">(.*)</', i.get_attribute("outerHTML"))
print result.group(1)
What this does is gets all the tr elements that have class tableone, then iterates through each element and lists all the tds. Then iterates through the outerHTML of each td and strips each string to get the text value.
It's quite unrefined and will return empty strings, I think. You might need to put some more work into the final product.

xpath: How do I extract text within the "strong" tag?

I'm using scrapy and need to extract "Gray / Gray" using xpath selectors.
Here's the html snippet:
<div class="Vehicle-Overview">
<div class="Txt-YMM">
2006 GMC Sierra 1500
</div>
<div class="Txt-Price">
Price : $8,499
</div>
<table width="100%" border="0" cellpadding="0" cellspacing="0"
class="Table-Specs">
<tr>
<td>
<strong>2006 GMC Sierra 1500 Crew Cab 143.5 WB 4WD
SLE</strong>
<strong class="text-right t-none"></strong>
</td>
</tr>
<tr>
<td>
<strong>Gray / Gray</strong><br />
<strong>209,123
Miles
/ VIN: XXXXXXXXXX
</td>
</tr>
</table>
I'm stuck trying to extract "Gray / Gray" within the "strong" tag. Any help is appreciated.
This XPath will work in Scrapy and also in Google/Firefox Developer's Console:
//div[#class='Vehicle-Overview']/table[#class='Table-Specs']//tr[2]/td[1]/strong[1]/text()
You can use this code in your spider:
color = response.xpath("//div[#class='Vehicle-Overview']/table[#class='Table-Specs']//tr[2]/td[1]/strong[1]/text()").extract_first()
You can use this XPath expression with your sample XML/HTML:
//div[#class='Vehicle-Overview']/table[#class='Table-Specs']/tr[2]/td[1]/strong[1]
A full XPath given the full file mentioned below with respect to a namespace "http://www.w3.org/1999/xhtml" can be
/html/body/div/div/div[#class='content-bg']/div/div/div[#class='Vehicle-Overview']/table[#class='Table-Specs']/tr[2]/td[1]/strong[1]

VBA Excel get text inside HTMLObject

I know this is really easy for some of you out there. But I have been going deep on the internet and I can not find an answer. I need to get the company name that is inside the
tbody tr td a eBay-tradera.com
and
td class="bS aR" 970,80
/td /tr /tbody
<tbody id="matrix1_group0">
<tr class="oR" onmouseover="onMouseOver(this, false)" onmouseout="onMouseOut(this, false)" onclick="onClick(this, false)">
<td class="bS"> </td>
<td>
<a href="aProgramInfoApplyRead.action?programId=175&affiliateId=2014848" title="http://www.tradera.com/" target="_blank">
eBay-Tradera.com
</a>
</td>
<td class="aR">
175</td>
<td class="bS aR">0</td><td class="bS aR">0</td><td class="bS aR">187</td>
<td class="aR">0,00%</td><td class="bS aR">124</td>
<td class="aR">0,00%</td>
<td class="bS aR">26</td>
<td class="aR">20,97%</td>
<td class="bS aR">32</td>
<td class="aR">60,80</td>
<td class="aR">25,81%</td>
<td class="bS aR">5 102,00</td>
<td class="bS aR">0,00</td>
<td class="aR">0,00</td>
<td class="bS aR">
970,80
</td>
</tr>
</tbody>
This is my code, where I only try to get the a tag to start of with but I cant get that to work either
Set TDelements = document.getElementById("matrix1_group0").document.getElementsbytagname("a").innerHTML
r = 0
C = 0
For Each TDelement In TDelements
Blad1.Range("A1").Offset(r, C).Value = TDelement.innerText
r = r + 1
Next
Thanks on beforehand I know that this might be to simple. But I hope that other people might have the same issue and this will be helpful for them as well. The reason for the "r = r + 1" is because there are many more companies on this list. I just wanted to make it as easy as I could. Thanks again!
You will need to specify the element location in the table. Ebay seems to be obfuscating the class-names so we cannot rely on those being consistent. Nor would I usually rely on the elements by their table index being consistent but I don't see any way around this.
I am assuming that this is the HTML document you are searching
<tbody id="matrix1_group0">
<tr class="oR" onmouseover="onMouseOver(this, false)" onmouseout="onMouseOut(this, false)" onclick="onClick(this, false)">
<td class="bS"> </td>
<td>
<a href="aProgramInfoApplyRead.action?programId=175&affiliateId=2014848" title="http://www.tradera.com/" target="_blank">
eBay-Tradera.com <!-- <=== You want this? -->
</a>
</td>
<!-- ... -->
</tr>
<!-- ... -->
</tbody>
We can ignore the rest of the document as the table element has an ID. In short, we assume that
.getElementById("matrix1_group0").getElementsByTagName("TR")
will return a collection of html row objects sorted by their appearance.
Set matrix = document.getElementById("matrix1_group0")
Set firstRow = matrix.getElementsByTagName("TR")(1)
Set firstRowSecondCell = firstRow.getElementsByTagName("TD")(2)
traderaName = firstRowSecondCell.innerText
Of course you could inline this all as
document.getElementById("matrix1_group0").getElementsByTagName("TR")(1).getElementsByTagName("TD")(2).innerText
but that would make debugging harder. Also if the web-page is ever presented to you in a different format then this won't work. Ebay is deliberately making it hard for you to scrape data off of it for security.
With only the HTML you have shown you can use CSS selectors to obtain these:
a[href*='aProgramInfoApplyRead.action?programId']
Which says a tag with attribute href that contains the string 'aProgramInfoApplyRead.action?programId'. This matches two elements but the first is the one you want.
CSS Selector:
VBA:
You can use .querySelector method of .document to retrieve the first match
Debug.Print ie.document.querySelector("a[href*='aProgramInfoApplyRead.action?programId']").innerText

expression engine search result related

I want to display up to 200 words of the related results just in the next line to title
But i am not getting the text that {excerpt} should Display
My code is written below
{exp:search:search_results switch="resultRowOne|resultRowTwo"}
<table border="0" cellpadding="6" cellspacing="1" width="100%">
{exp:search:search_results switch="resultRowOne|resultRowTwo"}
<tr class="{switch}">
{if page_meta_title != ""} <td width="30%" valign="top"><b>{title}</b></td>{/if}
</tr>
<tr><td style="color:red!important">{excerpt}</td></tr>
{if count == total_results}
</table>
{/if}
{paginate}
<p>Page {current_page} of {total_pages} pages {pagination_links}</p>
{/paginate}
{/exp:search:search_results}
</table>
Maybe this was just a typo in your question, but it looks like you have the opening search tag listed twice.
{exp:search:search_results switch="resultRowOne|resultRowTwo"}
<table border="0" cellpadding="6" cellspacing="1" width="100%">
{exp:search:search_results switch="resultRowOne|resultRowTwo"}
Also, the excerpt tag by default allows 50 characters. You can also consider the character limiter plugin (http://devot-ee.com/add-ons/character-limiter) which is a free plugin from Ellis Lab. Once you have that setup, you would use it like so....
{exp:char_limit total="200" exact="no"}{your_text_field}{/exp:char_limit}

Resources