separate texts from a href in same td with XPath python - python-3.x

I have an HTML webpage like this:
<tr><td style="text-align:center;">7</td><td class="multi_row" style="line-height:15px;">Loaded on 'NYK LEO 303W' at Port of Loading<br> NYK LEO 303W</td><td class="multi_row" style="line-height:15px;">VANCOUVER, BC ,CANADA<br> 3891 DELTAPORT GCT</td><td class="ico_e">2018-10-26 23:30</td></tr>
I want to separate the <a href>'s string part in one variable and have a pure text like 'bla bla bla' in another variable.
this is what i have done till now:
event_path = driver.find_elements_by_xpath("//table[#id='detail']//tr/td[2]")
event = [cell.text for cell in event_path]
its for the text part
and this one is for the string in :
vessel_path = driver.find_elements_by_xpath("//table[#id='detail']//tr/td[2]/a")
vessel = [cell.text.split(' ')[:2] for cell in vessel_path]
the split(' ')[:2] is cuz the data is sth like this : NYK LEO 303W and i just need words not the number (it can be done more reliable with regex)

Try to use below to get only first text node from td
event = [driver.execute_script('return arguments[0].firstChild.textContent;', cell).strip() for cell in event_path]

Please try following code :
elements = driver.find_elements_by_classname("multi_row")
for element in elements
print(element.text)

In your case, I see vessel that you are expecting is already present in title attribute of anchor.
If it is a valid case, then you can get it directly from attibutes like,
vessel_path = driver.find_elements_by_xpath("//table[#id='detail']//tr/td[2]/a")
vessel = [cell.get_attribute("title") for cell in vessel_path]

Related

Using xpath to extract only the text being part of the parent node

How can we only select and extract text which is only part of the parent node. Here is the HTML i am working on. I need to extract only the "$1950" using the xpath. When i select the parent node and extract its text content i get the the text content of its childs as well, but i need the text content of parent node only.
<span class="rentRollup">
<span class="longText">3 Bedrooms</span>
<span class="shortText">3 Beds</span>
$1,950
</span>
I have tried using the xpath but its printing the whole parent node as well as the child nodes data.
url = 'https://www.apartments.com/214-taylor-st-raleigh-nc/cr6tchd/'
#intializing request headers
ua = UserAgent()
header = {'User-Agent':str(ua.chrome)}
response = requests.get(url, headers=header)
print(response)
byte_data = response.content
source_code = html.fromstring(byte_data)
name=source_code.xpath("//*[contains(text(), '3 Bedrooms')]/..")
name=name[0].text_content()
print(name)
Try it this way: after print(response), replace everything with:
tree = html.fromstring(response.content)
name=tree.xpath("//span[#class='rentRollup']/text()")
name[2].strip()
Output:
'$1,950'
The following XPath expression
//*[contains(*/text(), '3 Bedrooms')]/text()
will select just the text nodes which are direct children of the parent node of interest. But there is still whitespace-noise which you need to get rid of.

How to remove< > from the result

I am trying to get the list of college names from an online dataset table (search result), and the college names are in between the tag and , i am not sure how to remove those from the result.
geo_table = soup.find('table',{'id':'ctl00_cphCollegeNavBody_ucResultsMain_tblResults'})
Colleges=geo_table.findAll('strong')
Colleges
I am thinking that the problem is I am extracting the wrong part because refers to bold the line. Where shall I find the college name?
This is a sample output:
href="?s=IL+MA+PA&p=14.0802+14.0801+14.3901&l=91+92+93+94&id=211440"
To fetch the href value you need to find_all <a> tag and then iterate the loop and get the attribute value href to fetch the college name you can find <strong> tag and get the text value.
geo_table =soup.find('table',{'id':'ctl00_cphCollegeNavBody_ucResultsMain_tblResults'})
Colleges=geo_table.findAll('a')
for college in Colleges:
print('href :' + college['href'])
print('college Name : ' + college.find('strong').text )

obtain en-US title tag text

I'm trying to obtain the text in only the title#lang=en-US elements in an XML file.
This code obtains all the title text for all languages.
entries = root.xpath('//prefix:new-item', namespaces={'prefix': 'http://mynamespace'})
for entry in entries:
all_titles = entry.xpath('./prefix:title', namespaces={'prefix': 'http://mynamespace'})
for title in all_titles:
print (title.text)
I tried this code to get the title#lang=en-US text, but it does not work.
all_titles = entry.xpath('./prefix:title', namespaces={'prefix': 'http://mynamespace'})
for title in all_titles:
test = title.xpath("#lang='en-US'")
print (test)
How do I obtain the text for only the english language items?
The expression
//prefix:title[lang('en')]
will select all the English-language titles. Specifically:
title elements that have an xml:lang attribute identifying the title as English, for example <title xml:lang="en-US"> or <title xml:lang="en-GB">
title elements within some container that identifies all the contents as English, for example <section xml:lang="en-US"><title/></section>.
If you specifically want only US English titles, excluding other forms of English, then you can use the predicate [lang('en-US')].

How to scrape email address out of a string?

How do I get the email address from this html snippet?
As there are thousand of leads like this in a certain webpage and the text within it is not always found as it is seen here.
The only common thing is the email address located in the first position.
How can I get the email address and ignore the rest?
These are the elements:
<div class="gm_popup"><div class="gm_name">Adel Outfitters</div><div class="gm_address">1221 W 4th St</div><div class="gm_location">Adel, Georgia 31620<div style="display:none" class="w3-address-country">United States</div></div><div class="gm_phone"><span class="gm_phone_label">P:</span> 229-896-7105</div><div class="gm_email">adeloutfitters#yahoo.com<div><div class="gm_website">https://www.facebook.com/pages/Adel-Outfitters/132735763434461</div><br><a target="_blank" class="directions-link" href="http://maps.google.com/?saddr=+&daddr=1221+W 4th St, Adel, Georgia, 31620">Directions<span class="w3-arrow">different stuffs</span></a></div></div></div>
What I tried:
Set post = html.getElementsByClassName("gm_email")(0)
MsgBox post.innerText
The result:
adeloutfitters#yahoo.com
https://www.facebook.com/pages/Adel-Outfitters/132735763434461
Directionsdifferent stuffs
Expected output:
adeloutfitters#yahoo.com
The closing </div> tag is further down which is why you are getting the extra text. Can you chop off anything after a new line? Or check each word in the string and save the one with "#" in it? Bad way of going about it, but it would probably work...

How to use xpath to locate an acronym in parenthesis in a title element? For example (ANT)

In this xml example:
<concept><title>Another Neat Tool(ANT)</title></concept>
The context is: /concept/title/
I want to return any concept title element that contains any open parenthesis, with any text, and a close parenthesis. I want it to locate (ANT) or any other text inside ( ).
I was successful with /context/title/text(), "("
But it only located and highlighted ( not (ANT)
How can I modify the xpath 2.0 to locate any concept title element that contains an acronym presented within parenthesis.
Try this expression (you may need to adapt it to include your root node):
/concept/title[contains(text(),"(") AND contains(text(),")")]/substring-before(substring-after(text(),"("),")")
Online test code here.
From this:
<concept>
<title>Automated Teller Machine (ATM)</title>
</concept>
<concept>
<title>Central Standard Time (CST)</title>
</concept>
<concept>
<title>OPEN AND CLOSE (ABC)</title>
</concept>
<concept>
<title>CLOSE ONLY DEF)</title>
</concept>
<concept>
<title>OPEN ONLY (GHI</title>
</concept>
<concept>
<title>OPEN AND CLOSE (TEST)</title>
</concept>
It returns this:
ATM
CST
ABC
TEST
The XPath expression
/concept/title[matches(., '\([A-Z]{2,}\)')]
will return a sequence of all /concept/title elements that contain an acronym, defined as two or more capital letters in parentheses.
If you want to return the acronym letters themselves, you can use
/concept/title[matches(., '\([A-Z]{2,}\)')]/
replace(., '^.*\(([A-Z]{2,})\).*$', '$1')

Resources