Get href within a table - python-3.x

Sorry, has most likely been asked before but I can't seem to find an answer on stack/from search engine.
I'm trying to scrape some data from a table, but there are href links which I need to get. Html as follows:
<table class="featprop results">
<tr>
**1)**<td class="propname" colspan="2"> West Drayton</td>
</tr>
<tr><td class="propimg" colspan="2">
<div class="imgcrop">
**2)**<img src="content/images/1/1/641/w296/858.jpg" alt=" Ashford" width="148"/>
<div class="let"> </div>
</div>
</td></tr>
<tr><td class="proprooms">
So far I have used the following:
for table in soup.findAll('table', {'class': 'featprop results'}):
for tr in table.findAll('tr'):
for a in tr.findAll('a'):
print(a)
Which returns both 1 and 2 in the above html, could anyone help me strip out just the href link?

for table in soup.findAll('table', {'class': 'featprop results'}):
for tr in table.findAll('tr'):
for a in tr.findAll('a'):
print(a['href'])
out:
/lettings-search-results?task=View&itemid=136
/lettings-search-results?task=View&itemid=136
Attributes
EDIT:
links = set() # set will remove the dupilcate
for a in tr.findAll('a', href=re.compile(r'^/lettings-search-results?')):
links.add(a['href'])
regular expression

This provide you an array of tags under the element of the selected class name.
result = soup.select(".featprop a");
for a in result:
print(a['href'])
Give you the below result:
/lettings-search-results?task=View&itemid=136
/lettings-search-results?task=View&itemid=136

Related

Node - Cheerio - Find element that contains specific text

I am trying to get "text that I want" from the site with this structure of code:
<td class="x">
<h3 class="x"> number </h3>
<p>
text that I want;
</p>
</td>
If there will be one td with class "x" then I will do this:
$('td.x > p > a').text()
and get text that I want, but the problem is that on this site there are a lot of "td" and "h3" elements with the same class "x". The only difference is that each time the text that is in "h3" element is a different number and I know what number is in "h3" element on the place where is my link. For example:
<td class="x">
<h3 class="x"> **125** </h3>
<p>
text that I want;
</p>
</td>
The question is - is it possible to choose selector based on the text that is inside - in my example I know that in code there is h3 element with text "125" or maybe is better way to get text from "a" element in my case.
Contains is the selector you're looking for
$('h3:contains("**125**")')
This will select h3 that has the text you wanted

find_elements_by_xpath() not producing the desired output python selenium scraping

I'm trying to find a tr by its class of .tableOne. Here is my code:
browser = webdriver.Chrome(executable_path=path, options=options)
cells = browser.find_elements_by_xpath('//*[#class="tableone"]')
But the output of the cells variable is [], an empty array.
Here is the html of the page:
<tbody class="tableUpper">
<tr class="tableone">
<td><a class="studentName" href="//www.abc.com"> student one</a></td>
<td> <span class="id_one"></span> <span class="long">Place</span> <span class="short">Place</span></td>
<td class="hide-s">
<span class="state"></span> <span class="studentState">student_state</span>
</td>
</tr>
<tr class="tableone">..</tr>
<tr class="tableone">..</tr>
<tr class="tableone">..</tr>
<tr class="tableone">..</tr>
</tbody>
Please try this:
import re
cells = browser.find_elements_by_xpath("//*[contains(local-name(), 'tr') and contains(#class, 'tableone')]")
for (e in cells):
insides = e.find_elements_by_xpath("./td")
for (i in insides):
result = re.search('\">(.*)</', i.get_attribute("outerHTML"))
print result.group(1)
What this does is gets all the tr elements that have class tableone, then iterates through each element and lists all the tds. Then iterates through the outerHTML of each td and strips each string to get the text value.
It's quite unrefined and will return empty strings, I think. You might need to put some more work into the final product.

xpath join text from multiple elements python

Hello I have some html file from this website: https://www.oddsportal.com/soccer/argentina/superliga/results/
<td class="name table-participant">
<a href="/soccer/argentina/superliga/independiente-san-martin-tIuN5Umrd/">
<span class="bold">Independiente</span>
"- San Martin T."
</a>
</td>
<td class="name table-participant">
<a href="/soccer/argentina/superliga/lanus-huracan-xIDIe0Gr/">
"Lanus - "
<span class="bold">Huracan</span>
</a>
</td>
<td class="name table-participant">
Rosario Central - Colon Santa FE
</td>
I want to select and join a/text() and span/text() in order to look like this: "Independiente - San Martin T."
As you see the 'span' is not allways in the same place and some times is missing (see last 'td class')
I used this code:
('//td[#class="name table-participant"]/a/text() | span/text()').extract()
but it returns only the a/text().
Can you help me to make this work?
Thank you
You trying to search span/text() without a scope. Add // at the beginning of this part of query, in the totally:
('//td[#class="name table-participant"]/a/text() | //span/text()').extract()
But I'm strongly recommend use this decision:
('//td[#class="name table-participant"]//*[self::a/ancestor::td or self::span]/text()').extract
for get span only from your choiced td-scope.
I'm assuming that you're using Scrapy to scrape the HTML.
From the structure of your sample HTML, it looks like you want to obtain the text of the anchor element, so you need to iterate over those.
Only then you can strip and join the text child nodes of the anchor element to obtain properly formatted strings. There is additional complication by the inconsistent use of quotes, but the following should get you going.
from scrapy.selector import Selector
HTML="""
<td class="name table-participant">
<a href="/soccer/argentina/superliga/independiente-san-martin-tIuN5Umrd/">
<span class="bold">Independiente</span>
"- San Martin T."
</a>
</td>
<td class="name table-participant">
<a href="/soccer/argentina/superliga/lanus-huracan-xIDIe0Gr/">
"Lanus - "
<span class="bold">Huracan</span>
</a>
</td>
<td class="name table-participant">
Rosario Central - Colon Santa FE
</td>
"""
def strip_and_join(x):
l=[]
for s in x:
# strip whitespace and quotes
s = s.strip().strip('"').strip()
# drop now empty strings
if s:
l.append(s)
return " ".join(l)
for x in Selector(text=HTML).xpath('//td[#class="name table-participant"]/a'):
print strip_and_join(x.xpath('.//text()').extract())
Note that for the sake of clarity I didn't squeeze the code into a single list comprehension, although this would be possible of course.

Beautifulsoup placing </li> in the wrong place because there is no closing tag [duplicate]

I would like to scrape the table from html code using beautifulsoup. A snippet of the html is shown below. When using table.findAll('tr') I get the entire table and not only the rows. (probably because the closing tags are missing from the html code?)
<TABLE COLS=9 BORDER=0 CELLSPACING=3 CELLPADDING=0>
<TR><TD><B>Artikelbezeichnung</B>
<TD><B>Anbieter</B>
<TD><B>Menge</B>
<TD><B>Taxe-EK</B>
<TD><B>Taxe-VK</B>
<TD><B>Empf.-VK</B>
<TD><B>FB</B>
<TD><B>PZN</B>
<TD><B>Nachfolge</B>
<TR><TD>ACTIQ 200 Mikrogramm Lutschtabl.m.integr.Appl.
<TD>Orifarm
<TD ID=R> 30 St
<TD ID=R> 266,67
<TD ID=R> 336,98
<TD>
<TD>
<TD>12516714
<TD>
</TABLE>
Here is my python code to show what I am struggling with:
soup = BeautifulSoup(data, "html.parser")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.text)
As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = BeautifulSoup(data, "lxml")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.get_text(strip=True))
Note that lxml added html & body tags because they weren't present in the source (It'll try to create a well formed document as previously state).

Get the text of a link within a table cell

I have a table similar to this one:
<table id="space-list" class="aui list-container">
<tr class="space-list-item" data-spacekey="BLANKSPACEEXAMPLE">
<td class="entity-attribute space-name">
<a title="Blank Space Example" href="https://q-leap.atlassian.net/wiki/display/BLANKSPACEEXAMPLE/Blank+Space+Example+Home">
Blank Space Example
</a>
</td>
<td class="entity-attribute space-desc">
<span>
An example of a "Knowledge Base" type space, freely editable, accessible to everyone, may be deleted at any time.
</span>
</td>
</tr>
</table>
My PageObject code looks like this
class Space < PageObject::Elements::TableRow
def name
cell_element(index: 0).link_element(href: /q-leap/).text
end
def description
cell_element(index: 1).text
end
end
PageObject.register_widget :space, Space, :tr
class SpaceDirectoryPage
include PageObject
spaces(:space) do
table_element(:id => 'space-list')
.group_elements(:tag_name => 'tr')[1..-1]
end
end
And now I am iterating over all the rows in the table to get the content of each cell:
while true
on(SpaceDirectoryPage).space_elements.each_with_index do |space|
puts space.name
puts space.description
end
end
Which is working fine for the description, but I have no clue how to access the text of the link within the first column; tried 100s of things, nothing worked.
Thanks in advance!

Resources