I'm in the process of migrating some VBA code from Internet Explorer to Selenium that scrapes data from Amazon .
The code enters a search term, and scrapes items such as ASIN, Selling Price, # of Reviews, etc from the search results. I can get all the items except Ratings.
The Ratings can be found in two sections of each products hierarchy, both contained inside a span element.
<span aria-label="4.3 out of 5 stars">
<span class="a-icon-alt">4.3 out of 5 stars</span>
Please click for Amazon HTML code hierarchy
Sub AMZ_Scraper
SearchURL = "https://www.amazon.ca/s?k=BIKE+LED+LIGHTS+REAR+FRONT&ref=nb_sb_noss_2"
'Start Edge and Navigate to URL
Dim Browser As New WebDriver
Dim Keys As New Selenium.Keys
Browser.Start "edge"
Browser.Get SearchURL
'Find top element
Dim Elements As Selenium.WebElements
Set Elements = Browser.FindElementsByCss(".s-result-item")
For Each Element In Elements
Asin = Element.Attribute("data-asin")
ProductName = Element.FindElementByClass("a-size-base-plus").Text
Reviews = Element.FindElementByClass("a-size-base").Text
Rating1 = Element.FindElementByXPath("//div [#class = 'a-row a-size-small']").FindElementByCss("Span").Attribute("aria-label")
Rating2 = Element.FindElementByXPath("//div [#class = 'a-row a-size-small']").FindElementByXPath("//span [#class = 'a-icon-alt']").Text
Next
End sub
The Rating1 code works but it's extracting the same rating (4.4 out of 5 Stars) for every product.
The Rating2 code does not error out but also not extract any data.
How do I extract the value from the element that holds the rating?
Related
I'm trying to scape data from a website via Excel VBA. I have a web page which has different data depending on a button selection, but the button sits withing a ul list. I can find the element by class using:
.FindElementByClass("shared-filter-button-list_navItem__ZiG2J")
But I can seem to work out how to switch the focus between 'This season' and 'All time' to change to displayed data on the page. Any ideas would be gratefully received. The html is:
<ul class="shared-filter-button-list_navContainer__3hJmS"><li class="shared-filter-button-list_navItem__ZiG2J is-active"><button class="tag-button_btn__1B2dI tag-button__purple__3SyTF shared-filter-button_wrap__3OgbA is-active" value="This season" type="button">This season</button></li><li class="shared-filter-button-list_navItem__ZiG2J"><button class="tag-button_btn__1B2dI tag-button__purple__3SyTF shared-filter-button_wrap__3OgbA " value="All time" type="button">All time</button></li></ul>
It would help to see the page, but if you just want to click the "this season" or "all time" button, just find the buttons inside the list you already have and click one?
update
I misread the provided HTML (its all in one line) and thought that shared-filter-button-list_navItem__ZiG2J was the container ul not the list items, and also that selenium uses 1-based indexes not 0-based.
The code below finds all buttons that match the query and prints their index and text to the debug window.
Private Driver As Selenium.ChromeDriver
Sub Main()
Set Driver = New Selenium.ChromeDriver
Driver.Get "https://www.euroleaguebasketball.net/eurocup/players/lukas-meisner/011187/"
Dim List As Selenium.WebElements
' get a list of button that are children of li with specified class name
Set List = Driver.FindElementsByXPath("//li[contains(#class, 'shared-filter-button-list_navItem__ZiG2J')]/button", 0, 5000)
If List Is Nothing Then
Debug.Print "failed to get list"
Exit Sub
End If
For Index = 1 To List.Count
Debug.Print Index & " -> " & List(Index).Text
Next Index
End Sub
Expected result from this code is:
1 -> This season
2 -> All time
3 -> Regular Season
If you wanted to click the This season button
List(1).Click
I'm trying to scrape each of the symbol codes and names from here (about 1/4 of the way down the page): https://uk.finance.yahoo.com/quote/MSFT?p=MSFT&.tsrc=fin-srch
If I inspect the HTML of the first row with the symbol AAPL I am given the following
<tr class="Va(t) Bdc($seperatorColor) TapHc(h) Fw(500) Bgc($hoverBgColor):h H(44px) BdT"</tr>
So in my VBA I navigate to the webpage by creating an internetexplorer object and then the first piece of code to actually begin the scraping is the following:
Dim allRowOfData As Variant
Set allRowOfData = appIE.document.getElementsByClassName("Va(t)")
Dim myValue As String
myValue = allRowOfData.Cells(1).innerHTML
If I look in the immediate window I am then presented with so many HTMLElements (20) plus all of their children that I have no idea where to begin, to be able to get the data that I want.
Is there an easier way to do this?
Also, how do we know what to put in the getElementsByClassName? Initially I had the entire string after the <tr class= and this returned nothing at all.
This question already has answers here:
Scraping data from website using vba
(5 answers)
Closed 3 years ago.
I am trying to extract market cap from the website "https://www.bloomberg.com/quote/206:HK"
which is 1.059B in this case.
I would like to extract the market cap value into an excel column for a list of bloomberg tickers. I would like to do this in VBA and unfortunately not sure where to start from.
Basically I have a column with all the links to bloomberg. I would like to extract market cap values in a column next to it
You ca do that with the code below. I use two steps to get the value. One can guess it works also over the css class value__b93f12ea. But the class name include a hex value and I know that this is often the case when such identifiers are dynamically generated.
Sub ScrapMarketCap()
Dim browser As Object
Dim url As String
Dim nodeMarketCapAll As Object
Dim nodeMarketCap As Object
url = "https://www.bloomberg.com/quote/206:HK"
'Initialize Internet Explorer, set visibility,
'Call URL and wait until page is fully loaded
Set browser = CreateObject("internetexplorer.application")
browser.Visible = True
browser.navigate url
Do Until browser.ReadyState = 4: DoEvents: Loop
'Get all html elements withh the css class "dataBox marketcap numeric"
'in a node collection and get the first one by index (0)
'There will be only one element with this class. But we still need to
'specify the index, because we need the specific element from the node list
'
'We want this html in our dom object
'<section class="dataBox marketcap numeric">
' <header class="title__49417cb9"><span>Market Cap</span></header>
' <div class="value__b93f12ea">1.074B</div>
'</section>
Set nodeMarketCapAll = browser.document.getElementsByClassName("dataBox marketcap numeric")(0)
If Not nodeMarketCapAll Is Nothing Then
'If we got the element
'We take the value of the market cap from the first div tag
Set nodeMarketCap = nodeMarketCapAll.getElementsByTagName("div")(0)
If Not nodeMarketCap Is Nothing Then
'If we got the div
'We take the value from it
MsgBox Trim(nodeMarketCap.innertext)
End If
End If
End Sub
I have the following html part
<div class="description">
<span>Brand:</span>
Nikon<br/>
<span>Product Code:</span> 130342 <br/>
<span>Barcode</span> 18208948581 <br/>
<span>Availability:</span>Available</div>
I am trying to get the last span and the word Available using the following
Set availability = ie.Document.getElementsByClassName(".description").getElementsByTagName("span")(2)
wks.Cells(i, "D").Value = availability.innerText
But it shows all span texts
What I am doing wrong here?
Use last-child css pseudo class in descendant combination with parent element class selector.
.description span:last-child
The :last-child CSS pseudo-class represents the last element among a
group of sibling elements.
Applying:
single match
Set availability = ie.document.querySelector(".description span:last-child")
Cells(1,1) = availability.innerText
all matches
Set availability = ie.document.querySelectorAll(".description span:last-child")
Cells(1,1) = availability.item(0).innerText
Otherwise, you can return the span collection from that parent class and index into it
Set availability = ie.document.querySelectorAll(".description span")
Cells(1,1) = availability.item(2).innerText '<==choose your index here
Or even chain:
Set availability = ie.document.querySelector(".description span + span + span") '<==expand as required. This uses [adjacent sibling combinator][4].
Sadly, pseudo classes nth-of-type / nth-child are not supported in VBA implementation though you can in many other languages e.g. python.
—-
If after just the Available you should be able to use .description as your selector to return all the text in the div. Then use Split on the .innerText using Chr$(32) to split by and extract the UBound (I.e. the last element of the generated array)
Set availability = ie.document.querySelector(".description")
Dim arr() As String
arr = split( availability.innerText, ":")
Cells(1,1) = arr(UBound(arr))
As Zac pointed out in the comments, you shouldn't use a period . with the getElementsByClassName method.
ie.Document.getElementsByClassName is returning a DispHTMLElementCollection of elements. You need to specify which element you want to reference
Set availability = ie.Document.getElementsByClassName(".description")(0).getElementsByTagName("span")(2)
A better way to write the write the code would be to reference the Microsoft HTML Object Library and create a variable to test each element returned. Unfortunately, there is a bug in the DispHTMLElementCollection implementation, so you will need to use Object instead of DispHTMLElementCollection.
Dim doc As HTMLDocument
Dim availability As Object
Set doc = ie.Document
Set availability = doc.getElementsByClassName("description")
Dim div As HTMLDivElement
Dim span As HTMLSpanElement
Dim spans As Object
For Each div In availability
Set spans = div.getElementsByTagName("span")
For Each span In spans
Debug.Print span.innerText
Next
Next
Output
Good Morning,
I’m hoping that some kind soul out there can help me with a roadblock I’ve encountered in my quest to manipulate a website with VBA. I am using MS Excel 2010 and Internet Explorer 11.0.56.
I’m somewhat comfortable with VBA but have never used it to navigate to a website, enter information and click on buttons. I’ve managed to muddle through as follows:
In Column A of my Excel spreadsheet, I have a list of 10 digit case numbers.
The code below will open IE, navigate to the desired website, pause while I log in, then navigate to the search screen, enter in the first case number and press the SEARCH button (yes, I have the case number in this example hard coded in with no looping, but that stuff I can handle so please ignore):
Sub Button_Click()
Dim objIE As Object
Set objIE = New InternetExplorerMedium
objIE.Top = 0
objIE.Left = 0
objIE.Width = 800
objIE.Height = 600
objIE.AddressBar = 0
objIE.StatusBar = 0
objIE.Toolbar = 0
objIE.Visible = True
objIE.Navigate ("https://somewebsite.com")
MsgBox ("Please log in and then press OK")
objIE.Navigate ("https://somewebsite.com/docs")
Do
DoEvents
Loop Until objIE.ReadyState = 4
objIE.Document.all("caseNumber").Value = "1234567890"
objIE.Document.getElementById("SearchButton").Click
Exit Sub
Do
DoEvents
Loop Until objIE.ReadyState = 4
MsgBox ("Done")
End Sub
That will bring me to this screen
The file number entered in the search field will return any number of files in a dynamic table with a checkbox to the left of each file.
For this example, let’s say I am ONLY concerned with the file called “CC8” under the “Type” column. There will only ever be one instance of “CC8” for a given file number.
What I need help with is, through VBA, how do I search through this table, find the “CC8” line, and then have the checkbox to the left automatically checked?
When I inspect the “CC8” element in IE, this is the HTML associated with it (highlighted in gray; the entire table is under class “listing list-view clearfix”)
see here
The HTML for the checkbox related to the “CC8” item is below:
HTML code here
The “id” for both has the same sequence of numbers, but one starts with “viewPages” and the other “doc”.
Can anyone help me out as to what I need to add to my code to get this checkbox checked? Thank you!
Note:
Please post the actual HTML using the snippet tool.
Generally:
Without HTML to properly test, I am assuming that the following 2 nodeLists are the same length, meaning that when the search text is found in aNodeList then the assumption is the same index can be used to target the corresponding checkbox in the bNodeList:
Dim aNodeList As Object, i As Long
With objIE.document
Set aNodeList = .querySelectorAll("a[target='_blank']")
Set bNodeList = .querySelectorAll("[title='Search Result: Checkbox']")
End With
For i = 0 To aNodeList.Length - 1
If aNodeList.item(i).innerText = "CC8" Then
bNodeList.item(i).Click
Exit For
End If
Next
You could also potentially use the following instead as you say the viewPages prefixes each item:
Set aNodeList = .querySelectorAll("a[id^='viewPages']")
Other observations:
Traditional checkboxes would have a checked attribute and syntax of
bNodeList.item(i).Checked = True, but as I can't see that attribute in your element I am assuming a .Click suffices.