Extract text using css selector

Extract text using css selector - excel

I am trying to extract specific text using a CSS selector. Here's a screenshot of the part that I would like to extract
I tried
div[id="Section3"]:first-child
but this doesn't return anything. I can't depend on locating the element by the text because I need to extract that text as shown.
This is the relevant HTML
<div class="ad24123fa4-c17c-4dc5-9aa5-ea007a8db30e-5" style="top:8px;left:218px;width:124px;height:31px;text-align:center;">
<table width="113px" border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td>
<table width="100%" border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td align="center">
<span class="fcb900b29f-64d7-453d-babf-192e86f17d6f-7">نظامي</span>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</div>
The full HTML is here.
This is my try
On Error Resume Next
Set ele = .FindElementByXPath("//span[text()='ãäÇÒá']")
If ele Is Nothing Then sStatus = "äÙÇãí" Else sStatus = "ãäÇÒá"
On Error GoTo 0
While inspecting the element I noticed that there is a hint of using $0 in the console .. Can this be useful?
As for the two possible texts "نظامي" and "منازل"

To use xpath with multiple possible search values use the following syntax:
//*[text()='نظامي' or text()='منازل']
CSS selectors (that work for me):
driver.findElementByCss("#ctl00_ContentPlaceHolder1_CrystalReportViewer1 div.ad071889d2-8e6f-4755-ad7d-c44ae0ea9fca-5 table span").text
which is an abbreviation of the full selector:
#ctl00_ContentPlaceHolder1_CrystalReportViewer1 > tbody > tr > td > div > div.crystalstyle > div.ad071889d2-8e6f-4755-ad7d-c44ae0ea9fca-5 > table > tbody > tr > td > table > tbody > tr > td > span
You can also index into table nodeList
Set matches = html.querySelectorAll("#ctl00_ContentPlaceHolder1_CrystalReportViewer1 div.crystalstyle table")
ActiveSheet.Cells(1, 1) = matches.item(80).innerText
Otherwise:
Reading in from html file I can take the last index of the matches based on class selector. For selenium you would switch to:
driver.FindElementsByCss(".fc180999a8-04b5-46bc-bf86-f601317d19c8-7").count
VBA:
Option Explicit
Public Sub test()
Dim html As HTMLDocument, matches As Object
Dim fStream As ADODB.Stream
Set html = New HTMLDocument
Set fStream = New ADODB.Stream
With fStream
.Charset = "UTF-8"
.Open
.LoadFromFile "C:\Users\User\Desktop\Output6.html"
html.body.innerHTML = .ReadText
.Close
End With
Set matches = html.querySelectorAll(".fc180999a8-04b5-46bc-bf86-f601317d19c8-7")
ActiveSheet.Cells(1, 1) = matches.item(matches.Length - 1).innerText
End Sub

Related

Web scraping in Investing.com portfolio with Excel vba

I am trying to scrape the cyrpto currency information in my portfolio (e.g. current worth, change% etc.). I tried to come up with a useful code in last 10 hours but couldn't do it. First I tried the very nice code in here: Web scraping in Investing.com with Excel vba
However, it is for getting defined table information and I am not familiar with webscraping too much, especially with XML method. So I couldn't make it work.
The page I am trying to scrape is only reachable via login; therefore, I will try to show the html via copying here and screenshots.
The page I am trying to scrape:
You may check the example html screenshot (1061477 is id of Dogecoin) and html code below:
<tbody id="tbody_overview_5563889" class="ui-sortable">
<tr id="sort_945629" rel="5563889_945629" data-pair-id="945629" data-pair-exchange-id="1014" data-is-open-by="exchange" data-is-pair-exchange-open="1">
<td class="left dragHandle"><span class="checkers"></span></td>
<td class="flag"><span title="" class="ceFlags bitcoin"> </span></td>
<td data-column-name="name" data-pair-id="945629" class="symbol plusIconTd left bold elp alert js-injected-user-alert-container">
<span class="aqPopupWrapper js-hover-me-wrapper"><a target="_blank" href="/crypto/bitcoin/btc-usd" title="BTC/USD - Bitcoin US Dollar" class="aqlink js-hover-me" hoverme="markets" data-pairid="945629">BTC/USD</a></span>
<span class="js-plus-icon alertBellGrayPlus genToolTip oneliner" data-tooltip="Create Alert" data-tooltip-alt="Alert is active"></span>
</td>
<td data-column-name="symbol" class="left bold "><a target="_blank" href="/crypto/bitcoin/btc-usd" title=""></a></td>
<td data-column-name="exchange" class="left displayNone" title="Bitfinex">Bitfinex</td>
<td data-column-name="last" class="pid-945629-last" id="5563889_last_945629">40,324.0</td>
<td data-column-name="bid" class="pid-945629-bid displayNone" id="5563889_bid_945629">40,322.0</td>
<td data-column-name="ask" class="pid-945629-ask displayNone" id="5563889_ask_945629">40,323.0</td>
<td data-column-name="extended_hours" class="js-extended-hours js-extended-last Font pidExt-945629-last displayNone">--</td>
<td data-column-name="extended_hours_percent" class="js-extended-hours js-extended-percent Font pidExt-945629-pcp displayNone">--</td>
<td data-column-name="open" class="">37,461.0</td>
<td data-column-name="prev" class="displayNone">37,461.0</td>
<td data-column-name="high" class="pid-945629-high " id="5563889_high_945629">40,380.0</td>
<td data-column-name="low" class="pid-945629-low " id="5563889_low_945629">37,233.0</td>
<td data-column-name="chg" class="bold pid-945629-pc greenFont" id="5563889_chg_945629">+2863.0</td>
<td data-column-name="chgpercent" class="bold pid-945629-pcp greenFont" id="5563889_p_chg_945629">+7.64%</td>
<td data-column-name="vol" class="pid-945629-turnover " data-value="8733">8.88K</td>
<td data-column-name="next_earning" class="left textNum displayNone" data-value="0">--</td>
<td data-column-name="time" class="pid-945629-time " id="5563889_time_945629" data-value="1612610025">06:13:45</td>
<td class="icon" id="5563889_isopen_945629"><span class="greenClockIcon middle isOpenExch-1014"></span></td>
<td class="icon"> </td>
</tr><tr id="sort_1061477" rel="5563889_1061477" data-pair-id="1061477" data-pair-exchange-id="1037" data-is-open-by="exchange" data-is-pair-exchange-open="1">
<td class="left dragHandle"><span class="checkers"></span></td>
<td class="flag"><span title="" class="ceFlags dogecoin"> </span></td>
<td data-column-name="name" data-pair-id="1061477" class="symbol plusIconTd left bold elp alert js-injected-user-alert-container">
<span class="aqPopupWrapper js-hover-me-wrapper"><a target="_blank" href="/indices/investing.com-doge-usd" title="Investing.com Dogecoin Index" class="aqlink js-hover-me" hoverme="markets" data-pairid="1061477">Dogecoin</a></span>
<span class="js-plus-icon alertBellGrayPlus genToolTip oneliner" data-tooltip="Create Alert" data-tooltip-alt="Alert is active"></span>
</td>
<td data-column-name="symbol" class="left bold "><a target="_blank" href="/indices/investing.com-doge-usd" title="DOGE/USD">DOGE/USD</a></td>
<td data-column-name="exchange" class="left displayNone" title="Investing.com">Investing.com</td>
<td data-column-name="last" class="pid-1061477-last" id="5563889_last_1061477">0.048506</td>
<td data-column-name="bid" class=" displayNone" id="5563889_bid_1061477">-</td>
<td data-column-name="ask" class=" displayNone" id="5563889_ask_1061477">-</td>
<td data-column-name="extended_hours" class="js-extended-hours js-extended-last Font pidExt-1061477-last displayNone">--</td>
<td data-column-name="extended_hours_percent" class="js-extended-hours js-extended-percent Font pidExt-1061477-pcp displayNone">--</td>
<td data-column-name="open" class="">0.043969</td>
<td data-column-name="prev" class="displayNone">0.043969</td>
<td data-column-name="high" class="pid-1061477-high " id="5563889_high_1061477">0.051038</td>
<td data-column-name="low" class="pid-1061477-low " id="5563889_low_1061477">0.044505</td>
<td data-column-name="chg" class="bold pid-1061477-pc greenFont" id="5563889_chg_1061477">+0.004537</td>
<td data-column-name="chgpercent" class="bold pid-1061477-pcp greenFont" id="5563889_p_chg_1061477">+10.32%</td>
<td data-column-name="vol" class="pid-1061477-turnover " data-value="21137638982">21.08B</td>
<td data-column-name="next_earning" class="left textNum displayNone" data-value="0">--</td>
<td data-column-name="time" class="pid-1061477-time " id="5563889_time_1061477" data-value="1612610031">06:13:51</td>
<td class="icon" id="5563889_isopen_1061477"><span class="greenClockIcon middle isOpenExch-1037"></span></td>
<td class="icon"> </td>
</tr><tr id="sort_1057392" rel="5563889_1057392" data-pair-id="1057392" data-pair-exchange-id="1037" data-is-open-by="exchange" data-is-pair-exchange-open="1">
<td class="left dragHandle"><span class="checkers"></span></td>
<td class="flag"><span title="" class="ceFlags ripple"> </span></td>
<td data-column-name="name" data-pair-id="1057392" class="symbol plusIconTd left bold elp alert js-injected-user-alert-container">
<span class="aqPopupWrapper js-hover-me-wrapper"><a target="_blank" href="/indices/investing.com-xrp-usd" title="Investing.com XRP Index" class="aqlink js-hover-me" hoverme="markets" data-pairid="1057392">XRP</a></span>
<span class="js-plus-icon alertBellGrayPlus genToolTip oneliner" data-tooltip="Create Alert" data-tooltip-alt="Alert is active"></span>
</td>
I highlighted the parts that I am trying to get.
Although it is too slow, I was able to scrape some of the data with below code (x=1061477). I am getting error on redfont ones since it becomes green when the currency is going up. I tried to use the ID, but couldn't get the data. Also it changes my computer's time somehow :)
Sub getprice()
Dim ws As Worksheet: Set ws = ThisWorkbook.Worksheets("Sheet1")
Dim text As String
Dim lastrow As Long
Dim sht As Worksheet
Set sht = ActiveSheet
lastrow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row
For i = 2 To lastrow
x = Cells(i, 1).Value
Set IE = CreateObject("InternetExplorer.Application")
IE.navigate "https://www.investing.com/portfolio/?portfolioID=NTUwZjJiZjkzbT46NW8%3D"
Do While IE.Busy And IE.readyState <> 4: DoEvents: Loop
Sleep 500
Dim last As String
'Name = .document.getElementsByClassName("aqPopupWrapper js-hover-me-wrapper")(0).outerText
last = IE.document.getElementsByClassName("pid-" & x & "-last")(0).outerText
high = IE.document.getElementsByClassName("pid-" & x & "-high")(0).outerText
low = IE.document.getElementsByClassName("pid-" & x & "-low")(0).outerText
'Change = IE.document.getElementById("5563889_chg_1057392")(0).innerHTML
Change = IE.document.getElementsByClassName("bold pid-" & x & "-pc redFont")(0).outerText
change2 = IE.document.getElementsByClassName("bold pid-" & x & "-pcp redFont")(0).outerText
volume = IE.document.getElementsByClassName("pid-" & x & "-turnover")(0).outerText
Time = IE.document.getElementsByClassName("pid-" & x & "-time")(0).outerText
IE.Quit
' ws.Cells(2, 1).Value = Name
ws.Cells(i, 3).Value = last
ws.Cells(i, 4).Value = high
ws.Cells(i, 5).Value = low
ws.Cells(i, 6).Value = Change
ws.Cells(i, 7).Value = change2
ws.Cells(i, 8).Value = volume
ws.Cells(i, 9).Value = Time
Next i
End Sub
Any idea on how to scrape this data? Especially with XML method.
Thanks in advance for your help

Not sure what you mean in comments about whole table like in shared link but the whole table as per your image should be possible. You only show HTML from the tbody level (better would have been from table tag level); however, I reconstruct a table from that HTML, by matching on start substring of the id of the tbody, pulling out the outerHTML, adding wrapping table tags, and passing that html to the clipboard to then paste to sheet.
Technically, I could easily have generated a table object and grabbed .Rows(2).outHTML (assuming the "japanese dog" coin is in row 2) and wrapped in table tags instead, just to get the one row of interest.
NOTE: NOT TESTED
Dim s As String
s = "<table>" & ie.document.querySelector("[id^='tbody_overview]").outerHTML & "</table>"
' s = "<table>" & ie.document.querySelector("[id^='tbody_overview]").rows(2).outerHTML & "</table>"
Dim clipboard As Object
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
clipboard.SetText s
clipboard.PutInClipboard
ThisWorkbook.Worksheets("Sheet1").Cells(1, 1).PasteSpecial

First comment is that I think that authentication to investing.com may not even be needed.
investing.com provides a public page for each asset (stock or crypto-coin) that you can analyze. For example, to get the dogecoin info you could use the following url:
https://www.investing.com/crypto/dogecoin/doge-usd
Second comment is that there are ways to transform an html page to an Excel sheet without coding. I did something similar using coinmarketcap.com. See blog post here.
The same can be done for investing.com:
Create a local file called dogecoin-from-investing.com.iqy in a location you will remember. Type into the file the following:
WEB
1
https://www.investing.com/crypto/dogecoin/doge-usd
Selection=AllTables
Formatting=None
PreFormattedTextToColumns=True
ConsecutiveDelimitersAsOne=True
SingleBlockTextImport=False
DisableDateRecognition=False
DisableRedirections=False
Open Excel and create a new helper sheet in your worksheet and call it DogeCoin.
Navigate to Data → Get External Data → Run Web Query… and accept all defaults.
Magic! Excel did all the work for you.
You should now have a populated helper sheet with the up-to-date data about dogecoin.
You can use that data in any other sheet as needed. The data model does not change too often, at least until investing.com decides to change it.
To refresh the data, navigate to the Data menu and hit Refresh All.

WebScraping: Get nested element in HTML Table

Hi I am new to webscraping and got stuck on getting nested html element tag in a table, here is the html code I get from the url http://www.geonames.org/search.html?q=+Leisse&country=FR:
<table class="restable">
<tr>
<td colspan=6 style="text-align: right;"><small>1 records found for "col de la Leisse"</small></td>
</tr>
<tr>
<th></th>
<th>Name</th>
<th>Country</th>
<th>Feature class</th>
<th>Latitude</th>
<th>Longitude</th>
</tr>
<tr>
<td><small>1</small> <img src="/maps/markers/m10-ORANGE-T.png" border="0" alt="T"></td>
<td>Col de la Leisse<br><small></small><span class="geo" style="display:none;"><span class="latitude">45.42372</span><span class="longitude">6.906828</span></span></td>
<td>France, Auvergne-Rhône-Alpes<br><small>Savoy > Albertville > Tignes</small></td>
<td>pass</td>
<td nowrap>N 45° 25' 25''</td>
<td nowrap>E 6° 54' 24''</td>
</tr>
<tr class="tfooter">
<td colspan=6></td>
</tr>
</table>
This is the code for only one row to make things simple, but in my case I iterate over each row and check if the text of <td> element equal to a target value, if true I scrape the value of <span> element with class longitude and latitude. In my case I want to get the row with value Col de la Leisse
Here is my code: (not good)
soup = BeautifulSoup(response.text, "html.parser")
table = soup.findAll('table')[1] # second table
rows = table.find_all('tr')
target = "Col de la Leisse"
longitude, latitude = 0
for row in rows:
cols=row.find_all('td')
# I am stuck here...
# if cols.text == target:
# ...
Result:
longitude = 6.906828
latitude = 45.42372

With bs4 4.7.1 you can use :has and :contains to ensure row has an a tag element with your target string in.
target = 'Col de la Leisse'
rows = soup.select('.restable tr:has(a:contains("' + target + '"))')
for row in rows:
print([item.text for item in row.select('.latitude, .longitude')])
You can of course separate out .latitude and .longitude if you think they will not both be present, or if can occur in different order

How to get all td[3] tags from the tr tags with selenium Xpath in python

I have a webpage HTML like this:
<table class="table_type1" id="sailing">
<tbody>
<tr>
<td class="multi_row"></td>
<td class="multi_row"></td>
<td class="multi_row">1</td>
<td class="multi_row"></td>
</tr>
<tr>
<td class="multi_row"></td>
<td class="multi_row"></td>
<td class="multi_row">1</td>
<td class="multi_row"></td>
</tr>
</tbody>
</table>
and tr tags are dynamic so i don't know how many of them exist, i need all td[3] of any tr tags in a list for some slicing stuff.it is much better iterate with built in tools if find_element(s)_by_xpath("") has iterating tools.

Try
cells = driver.find_elements_by_xpath("//table[#id='sailing']//tr/td[3]")
to get third cell of each row
Edit
For iterating just use a for loop:
print ([i.text for i in cells])

Try following code :
tdElements = driver.find_elements_by_xpath("//table[#id="sailing "]/tbody//td")
Edit : for 3rd element
tdElements = driver.find_elements_by_xpath("//table[#id="sailing "]/tbody/tr/td[3]")

To print the text e.g. 1 from each of the third <td> you can either use the get_attribute() method or text property and you can use either of the following solutions:
Using CssSelector and get_attribute():
print(driver.find_elements_by_css_selector("table.table_type1#sailing tr td:nth-child(3)").get_attribute("innerHTML"))
Using CssSelector and text property:
print(driver.find_elements_by_css_selector("table.table_type1#sailing tr td:nth-child(3)").text)
Using XPath and get_attribute():
print(driver.find_elements_by_xpath('//table[#class='table_type1' and #id="sailing"]//tr//following::td[3]').get_attribute("innerHTML"))
Using XPath and text property:
print(driver.find_elements_by_xpath('//table[#class='table_type1' and #id="sailing"]//tr//following::td[3]').text)

To get the 3 rd td of each row, you can try either with xpath
driver.find_elements_by_xpath('//table[#id="sailing"]/tbody//td[3]')
or you can try with css selector like
driver.find_elements_by_css_selector('table#sailing td:nth-child(3)')
As it is returning list you can iterate with for each,
elements=driver.find_elements_by_xpath('//table[#id="sailing"]/tbody//td[3]')
for element in elements:
print(element.text)

find_elements_by_xpath() not producing the desired output python selenium scraping

I'm trying to find a tr by its class of .tableOne. Here is my code:
browser = webdriver.Chrome(executable_path=path, options=options)
cells = browser.find_elements_by_xpath('//*[#class="tableone"]')
But the output of the cells variable is [], an empty array.
Here is the html of the page:
<tbody class="tableUpper">
<tr class="tableone">
<td><a class="studentName" href="//www.abc.com"> student one</a></td>
<td> <span class="id_one"></span> <span class="long">Place</span> <span class="short">Place</span></td>
<td class="hide-s">
<span class="state"></span> <span class="studentState">student_state</span>
</td>
</tr>
<tr class="tableone">..</tr>
<tr class="tableone">..</tr>
<tr class="tableone">..</tr>
<tr class="tableone">..</tr>
</tbody>

Please try this:
import re
cells = browser.find_elements_by_xpath("//*[contains(local-name(), 'tr') and contains(#class, 'tableone')]")
for (e in cells):
insides = e.find_elements_by_xpath("./td")
for (i in insides):
result = re.search('\">(.*)</', i.get_attribute("outerHTML"))
print result.group(1)
What this does is gets all the tr elements that have class tableone, then iterates through each element and lists all the tds. Then iterates through the outerHTML of each td and strips each string to get the text value.
It's quite unrefined and will return empty strings, I think. You might need to put some more work into the final product.

VBA Excel get text inside HTMLObject

I know this is really easy for some of you out there. But I have been going deep on the internet and I can not find an answer. I need to get the company name that is inside the
tbody tr td a eBay-tradera.com
and
td class="bS aR" 970,80
/td /tr /tbody
<tbody id="matrix1_group0">
<tr class="oR" onmouseover="onMouseOver(this, false)" onmouseout="onMouseOut(this, false)" onclick="onClick(this, false)">
<td class="bS"> </td>
<td>
<a href="aProgramInfoApplyRead.action?programId=175&affiliateId=2014848" title="http://www.tradera.com/" target="_blank">
eBay-Tradera.com
</a>
</td>
<td class="aR">
175</td>
<td class="bS aR">0</td><td class="bS aR">0</td><td class="bS aR">187</td>
<td class="aR">0,00%</td><td class="bS aR">124</td>
<td class="aR">0,00%</td>
<td class="bS aR">26</td>
<td class="aR">20,97%</td>
<td class="bS aR">32</td>
<td class="aR">60,80</td>
<td class="aR">25,81%</td>
<td class="bS aR">5 102,00</td>
<td class="bS aR">0,00</td>
<td class="aR">0,00</td>
<td class="bS aR">
970,80
</td>
</tr>
</tbody>
This is my code, where I only try to get the a tag to start of with but I cant get that to work either
Set TDelements = document.getElementById("matrix1_group0").document.getElementsbytagname("a").innerHTML
r = 0
C = 0
For Each TDelement In TDelements
Blad1.Range("A1").Offset(r, C).Value = TDelement.innerText
r = r + 1
Next
Thanks on beforehand I know that this might be to simple. But I hope that other people might have the same issue and this will be helpful for them as well. The reason for the "r = r + 1" is because there are many more companies on this list. I just wanted to make it as easy as I could. Thanks again!

You will need to specify the element location in the table. Ebay seems to be obfuscating the class-names so we cannot rely on those being consistent. Nor would I usually rely on the elements by their table index being consistent but I don't see any way around this.
I am assuming that this is the HTML document you are searching
<tbody id="matrix1_group0">
<tr class="oR" onmouseover="onMouseOver(this, false)" onmouseout="onMouseOut(this, false)" onclick="onClick(this, false)">
<td class="bS"> </td>
<td>
<a href="aProgramInfoApplyRead.action?programId=175&affiliateId=2014848" title="http://www.tradera.com/" target="_blank">
eBay-Tradera.com <!-- <=== You want this? -->
</a>
</td>
<!-- ... -->
</tr>
<!-- ... -->
</tbody>
We can ignore the rest of the document as the table element has an ID. In short, we assume that
.getElementById("matrix1_group0").getElementsByTagName("TR")
will return a collection of html row objects sorted by their appearance.
Set matrix = document.getElementById("matrix1_group0")
Set firstRow = matrix.getElementsByTagName("TR")(1)
Set firstRowSecondCell = firstRow.getElementsByTagName("TD")(2)
traderaName = firstRowSecondCell.innerText
Of course you could inline this all as
document.getElementById("matrix1_group0").getElementsByTagName("TR")(1).getElementsByTagName("TD")(2).innerText
but that would make debugging harder. Also if the web-page is ever presented to you in a different format then this won't work. Ebay is deliberately making it hard for you to scrape data off of it for security.

With only the HTML you have shown you can use CSS selectors to obtain these:
a[href*='aProgramInfoApplyRead.action?programId']
Which says a tag with attribute href that contains the string 'aProgramInfoApplyRead.action?programId'. This matches two elements but the first is the one you want.
CSS Selector:
VBA:
You can use .querySelector method of .document to retrieve the first match
Debug.Print ie.document.querySelector("a[href*='aProgramInfoApplyRead.action?programId']").innerText

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extract text using css selector - excel

Related

Web scraping in Investing.com portfolio with Excel vba

WebScraping: Get nested element in HTML Table

How to get all td[3] tags from the tr tags with selenium Xpath in python

find_elements_by_xpath() not producing the desired output python selenium scraping

VBA Excel get text inside HTMLObject

Categories

Resources