How to extract company address from website - excel

I have been browsing, trying and reading different codes from here but i can't get mine to work. What i am trying to do:
Browse a web which contains all information about the country's companies.
Look up the company in the website (taking as reference the name from a excel cell).
When the company web loads, extract "Domicilio" (address) from it and past it on another excel cell.
Information to take into account:
Website is: https://www.infocif.es/
Form to look up is: https://www.infocif.es/general/empresas-informacion-listado-empresas.asp?Buscar= , so if you want to search "014 media SL" company, it would be https://www.infocif.es/general/empresas-informacion-listado-empresas.asp?Buscar=014 media SL
If you load the previous link, which leads to the final link https://www.infocif.es/ficha-empresa/014-media-sl, you will see in the "informacion de la compañia" div, the needed "domicilio" information. In this case, i would like to extract: Calle Alcala 106 Primero 28009 - Madrid
what i have tried so far. The following codes from some great people on this web:
clsHTTP (MODULE CLASS)
Option Explicit
Private http As Object
Private Sub Class_Initialize()
Set http = CreateObject("MSXML2.XMLHTTP")
End Sub
Public Function GetHTML(ByVal URL As String) As String
Dim sResponse As String
With http
.Open "GET", URL, False
.send
GetHTML = StrConv(.responseBody, vbUnicode)
End With
End Function
Public Function GetDomicilio(ByVal html As HTMLDocument) As String
On Error GoTo errhand:
GetDomicilio = html.querySelector("div.row p:nth-child(6)").innerText
Exit Function
errhand:
GetDomicilio = "Not Found"
End Function
Standard module
Option Explicit
Public Sub GetInfo()
Dim http As clsHTTP, sResponse As String, lastRow As Long, groupResults(), i As Long, html As HTMLDocument
Set html = New HTMLDocument
Set http = New clsHTTP
Const BASE_URL As String = "https://www.infocif.es/general/empresas-informacion-listado-empresas.asp?Buscar="
Application.ScreenUpdating = False
With ThisWorkbook.Worksheets("Hoja1")
lastRow = .Cells(.Rows.Count, "A").End(xlUp).Row
Select Case lastRow
Case 1
Exit Sub
Case 2
ReDim arr(1, 1): arr(1, 1) = .Range("A2").Value
Case Else
arr = .Range("A2:A" & lastRow).Value
End Select
ReDim groupResults(1 To lastRow - 1)
With http
For i = LBound(arr, 1) To UBound(arr, 1)
If Len(BASE_URL & arr(i, 1)) > Len(BASE_URL) Then
sResponse = .GetHTML(BASE_URL & arr(i, 1))
html.body.innerHTML = sResponse
groupResults(i) = .GetDomicilio(html)
sResponse = vbNullString: html.body.innerHTML = vbNullString
End If
Next
End With
For i = LBound(groupResults) To UBound(groupResults)
.Cells(i + 1, "B") = groupResults(i)
Next
End With
Application.ScreenUpdating = True
End Sub
I have been trying to figure it out, and i believe that it is failing by 2 reasons:
the Base URL is wrong, or
I cant properly define the queryselector for "domicilio".
What am i getting when i run the code above?
It will run, no issues, but it will return a "not found" reply on b column. Please, assume that all the company names are listed on A column (there are like 1800 rows). Company names start by A2.
You can try the following companies:
014 MEDIA SL
100M MONTADITOS INTERNACIONAL SL
649298 CASTELLON SL
ABACO ASESORES PERICIALES SL
ABADA SA
ABELLCON 3 SL
ABEX-EXCELENCIA ROBOTICA SL
ABF IBERIA HOLDING SL
ABRAFACT SL
ACCENTURE HOLDINGS IBERIA SL
ACCESORIOS DE TUBERIA SA
ACCESORIOS FRIGORIFICOS SA
ACCIONA MOBILITY SA

Related

VBA Code to Search for Element on Website and Repeat Search

I am fairly new to VBA and web scraping, but have been reading a lot trying to learn.
I have a long list of addresses in column A of an Excel file. I would like to like to search each of these addresses on the Website http://www.parkopedia.com. This website finds the closest parking and cost based on a given address.
Based on the results, I would like to pull the inner text of two Span tags. This inner text represents the cost of the closest parking and the distance to the parking.
<div class="LocationListItem__searchDetails">
<div class="LocationDetailsSearchDetails">
<div class="LocationDetailsSearchDetails__detail">
<span class="LocationDetailsSearchDetails__detail__value">Free</span>
<span class="LocationDetailsSearchDetails__detail__label">2 hours</span>
</div>
<div class="LocationDetailsSearchDetails__detail">
<span class="LocationDetailsSearchDetails__detail__value">20 </span>
In this case I would like to pull "Free" and "20" into columns B and C of my spreadsheet. But it appears they both have the same Class name of "LocationDetailsSearchDetails__detail"
I have about 1500 results to search, so if possible using xmlHTTP would be preferred.
The first address that I am searching is "10 herrmann place new york 10710"
Below is the code that I have tried far. For right now I have the URL entered directly into the code rather than pulling from the Excel file. I will figure out how to get the correct URL to pull once I can get the correct data to be scraped from the site. The data in B2 is currently blank so that it only pulls the base URL with nothing else on the end.
The issue that I seem to be having is finding a way to get the below line to work correctly.
GetParkingData = html.getFirstSpanWithClass('LocationDetailsSearchDetails__detail__value').innerText
Module
Option Explicit
Public Sub GetInfo()
Dim http As clsHTTP, sResponse As String, lastRow As Long, groupResults(), i As Long, html As HTMLDocument
Set html = New HTMLDocument
Set http = New clsHTTP
Const BASE_URL As String = "https://en.parkopedia.com/parking/locations/10_herrmann_place_new_york_10710_united_states_1jegdr72zr6xf2yc83/?arriving=202107131800&leaving=2021071320000"
Application.ScreenUpdating = False
With ThisWorkbook.Worksheets("Sheet1")
lastRow = .Cells(.Rows.Count, "B").End(xlUp).Row
Select Case lastRow
Case 1
Exit Sub
Case 2
ReDim arr(1, 1): arr(1, 1) = .Range("B2").Value
Case Else
arr = .Range("B2:B" & lastRow).Value
End Select
ReDim groupResults(1 To lastRow - 1)
With http
For i = LBound(arr, 1) To UBound(arr, 1)
If Len(BASE_URL & arr(i, 1)) > Len(BASE_URL) Then
sResponse = .GetHTML(BASE_URL & arr(i, 1))
html.body.innerHTML = sResponse
groupResults(i) = .GetParkingData(html)
sResponse = vbNullString: html.body.innerHTML = vbNullString
End If
Next
End With
For i = LBound(groupResults) To UBound(groupResults)
.Cells(i + 1, "C") = groupResults(i)
Next
End With
Application.ScreenUpdating = True
End Sub
Class ClsHTTP
Option Explicit
Private http As Object
Private Sub Class_Initialize()
Set http = CreateObject("MSXML2.XMLHTTP")
End Sub
Public Function GetHTML(ByVal URL As String) As String
Dim sResponse As String
With http
.Open "GET", URL, False
.send
GetHTML = StrConv(.responseBody, vbUnicode)
End With
End Function
Public Function GetParkingData(ByVal html As HTMLDocument) As String
On Error GoTo errhand:
GetParkingData = html.getFirstSpanWithClass('LocationDetailsSearchDetails__detail__value').innerText
Exit Function
errhand:
GetParkingData = "Not Found"
End Function
Thank you all very much! Please forgive anything that seems novice, I am trying to teach myself VBA and web scraping.

Getting elements using multiple tags

I wish to pull data into excel from the following link: https://echa.europa.eu/registration-dossier/-/registered-dossier/13817/7/1 for Tox summaries for inhalation routes, dermal, eyes etc
The code below partly achieves this
Public Sub GetContents3()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/registration-dossier/-/registered-dossier/13817/7/1", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
Set SubSectList = HTMLDoc.getElementById("SectionContent")
Debug.Print SubSectList.innerText
End Sub
Although the code above works okay there are a few issues Id like to address.
For some reason the first Workers - Hazard via inhalation route is not being pulled.
I wish to use the tags dt and dd to refine how the information should be pulled. As an example of how I would like the data to be printed:
Workers - Hazard via inhalation route
Systemic effects
Long term exposure
Hazard assessment conclusion: DNEL (Derived No Effect Level)
Value: 1.41 mg/m³
... Etc (to include all data)
Workers - Hazard via dermal route
Systemic effects
Long term exposure
Hazard assessment conclusion: DNEL (Derived No Effect Level)
Value: 2.06 mg/kg bw/day
And So on for each Route.
To try and Achieve this I have the following code:
Public Sub GetContents3()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/registration-dossier/-/registered-dossier/13817/7/1", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
Set SubSectList = HTMLDoc.getElementById("SectionContent")
Set SubSects = SubSectList.getElementsByTagName("dt")
For Each SubSect In SubSects
Debug.Print SubSect.innerText & " " & SubSect.NextSibling.innerText
Next SubSect
End Sub
This is better but now it doesn't pull the subheadings.
I need someway to pull the information from multiple tags without overlapping data however I am unsure how to do this and so far attempts have just caused errors.
If anyone knows how I can modify the code to achieve the desired output above that would be great.
You can specify a css selector pattern to match the relevant tags, then during a loop over returned nodes, check the tagName, if DD or DT you need to combine into a single line for output:
Option Explicit
Public Sub GetContents()
Dim http As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument
Set http = New MSXML2.XMLHTTP60: Set html = New MSHTML.HTMLDocument
With http
.Open "GET", "https://echa.europa.eu/registration-dossier/-/registered-dossier/13817/7/1", False
.send
html.body.innerHTML = .responseText
End With
Dim nodeList As MSHTML.IHTMLDOMChildrenCollection, i As Long, r As Long, concat As String
Set nodeList = html.querySelectorAll("#SectionContent h3,#SectionContent h4,#SectionContent h5,#SectionContent h6, #SectionContent dt, #SectionContent dd")
For i = 1 To nodeList.Length - 1
Select Case nodeList.Item(i).tagName
Case "DT"
concat = nodeList.Item(i).innerText
Case "DD"
concat = concat & Chr$(32) & nodeList.Item(i).innerText
r = r + 1
ActiveSheet.Cells(r, 1) = concat
Case Else
concat = vbNullString
r = r + 1
ActiveSheet.Cells(r, 1) = nodeList.Item(i).innerText
End Select
Next
End Sub
Now, I would like something tidier than the requested list, so how about the following tabulation?
Option Explicit
Public Sub GetContents()
Dim http As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument
Set http = New MSXML2.XMLHTTP60: Set html = New MSHTML.HTMLDocument
With http
.Open "GET", "https://echa.europa.eu/registration-dossier/-/registered-dossier/13817/7/1", False
.send
html.body.innerHTML = .responseText
End With
Dim nodeList As MSHTML.IHTMLDOMChildrenCollection, i As Long
Dim c As Long, r As Long, tag As String
Set nodeList = html.querySelectorAll("#SectionContent h3,#SectionContent h4,#SectionContent h5,#SectionContent h6, #SectionContent dt, #SectionContent dd")
r = 1
For i = 1 To nodeList.Length - 1
tag = nodeList.Item(i).tagName
Select Case tag
Case "DT"
c = 5
Case "DD"
c = 6
r = r + 1
Case Else
c = Right$(tag, 1) - 2
End Select
ActiveSheet.Cells(IIf(tag = "DD", r - 1, r), c) = nodeList.Item(i).innerText
Next
End Sub

URL Issue retrieving data quotes in Yahoo finance

The URL from Yahoo is not working when I try to retrieve quotes from a particular stock. There are several discussion about it, However, it seems nothing is shown regarding VBA macro
Sub Get_Data()
Dim URL As String
Dim Ticker As String
Dim http As New WinHttpRequest
Dim sCotes As String
Dim Lignes
Dim Valeurs
Dim i As Long
Dim j As Long
Dim sLigne As String
Dim sValeur As String
Ticker = Range("Ticker")
URL = "https://query1.finance.yahoo.com/v7/finance/download/TECK?period1=1540456339&period2=1571992339&interval=1d&events=history&crumb=kjOZLFv6ch2"
http.Send
sCotes = http.ResponseText
MsgBox sCotes
Lignes = Split(sCotes, Chr(10))
For i = 1 To UBound(Lignes) 'until the end of the Lignes variable
sLigne = Lignes(i)
Valeurs = Split(sLigne, ",")
For j = 0 To UBound(Valeurs) - 1
Select Case j
Case 0
sValeur = DateSerial(CLng(Left(Valeurs(0), 4)), CLng(Mid(Valeurs(0), 6, 2)), CLng(Right(Valeurs(0), 2)))
Case 5
sValeur = CLng(Valeurs(5))
Case Else
sValeur = CDbl(Replace(Valeurs(j), ".", ","))
End Select
Range("A1").Offset(i, j) = sValeur
Application.StatusBar = Format(Cells(i, 1), "Short Date")
Next
Next
Application.StatusBar = False
End Sub
Execution error at the step Http.send : "This method cannot be called until the Open method has been called"
You would need to use the "open" method before attempting to send and GET is perfectly fine. However, a few things....
There is an easier way. The headers worth adding are the User-Agent and one to mitigate being served cached results. The following shows you how to get a json response from the server for a specified time period and write to Excel. Note: You need to concatenate the ticker into the url. You should probably also test the response code from server to ensure successful.
I use jsonconverter.bas as the json parser to handle response. Download raw code from here and add to standard module called JsonConverter . You then need to go VBE > Tools > References > Add reference to Microsoft Scripting Runtime. Remove the top Attribute line from the copied code.
The values for startDate and endDate need to be passed as unix timestamps. #TimWilliams wrote a nice function, toUnix, for converting Date to Unix here which I use. I have added my own function to manage the conversion in the opposite direction.
This method avoids using any session based identifiers so avoids your issue with invalid cookie crumb.
VBA:
Option Explicit
Public Sub GetYahooHistoricData()
Dim ticker As String, ws As Worksheet, url As String, s As String
Dim startDate As Long, endDate As Long
Set ws = ThisWorkbook.Worksheets("Sheet1")
ticker = ws.Range("ticker") 'Range A1. Above write out range
endDate = toUnix("2019-10-27")
startDate = toUnix("2018-10-25")
url = "https://query1.finance.yahoo.com/v8/finance/chart/" & ticker & "?region=US&lang=en-US&includePrePost=false&interval=1d&period1=" & startDate & "&period2=" & endDate & "&corsDomain=finance.yahoo.com&.tsrc=finance"
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", url, False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
s = .responseText
End With
Dim json As Object
Set json = JsonConverter.ParseJson(s)("chart")("result")
Dim dates As Object, results(), rows As Object, adjClose As Object, r As Long, headers()
headers = Array("date", "close", "volume", "open", "high", "low", "adjclose")
Set dates = json(1)("timestamp")
ReDim results(1 To dates.Count, 1 To UBound(headers) + 1)
Set rows = json(1)("indicators")("quote")(1)
Set adjClose = json(1)("indicators")("adjclose")(1)("adjclose")
For r = 1 To dates.Count
results(r, 1) = GetDate(dates(r))
results(r, 2) = rows("close")(r)
results(r, 3) = rows("volume")(r)
results(r, 4) = rows("open")(r)
results(r, 5) = rows("high")(r)
results(r, 6) = rows("low")(r)
results(r, 7) = adjClose(r)
Next
With ws
.Cells(3, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(4, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Public Function GetDate(ByVal t As Variant) As String
GetDate = Format$(t / 86400 + DateValue("1970-01-01"), "yyyy-mm-dd")
End Function
Public Function toUnix(ByVal dt As Variant) As Long
toUnix = DateDiff("s", "1/1/1970", dt)
End Function
Example top 10 rows:
Try replacing this code
URL = "https://query1.finance.yahoo.com/v7/finance/download/TECK?period1=1540456339&period2=1571992339&interval=1d&events=history&crumb=kjOZLFv6ch2"
http.Send
with this code:
set http = Server.Createobject("MSXML2.ServerXMLHTTP.6.0")
URL = "https://query1.finance.yahoo.com/v7/finance/download/TECK?period1=1540456339&period2=1571992339&interval=1d&events=history&crumb=kjOZLFv6ch2"
http.open "POST", URL, False
http.Send
The error is pretty clear: you need to call the open method before the Send method. Also this would be a POST request. You also may need to put these two lines after the open method:
http.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
http.setRequestHeader "Content-Length", 0
The question is about 99% duplicate as the one from here - How can I send an HTTP POST request to a server from Excel using VBA?. Anyway, the mistake is obviously, because the .Send() method simply sends a completely empty Dim http As New WinHttpRequest object.
To make the code work, copy the example from the duplcated question and print the http.ResponseText:
Sub TestMe()
Dim http As Object
Dim url As String
Set http = CreateObject("MSXML2.ServerXMLHTTP.6.0")
url = "https://query1.finance.yahoo.com/v7/finance/download/TECK?period1=1540456339&period2=1571992339&interval=1d&events=history&crumb=kjOZLFv6ch2"
http.Open "POST", url, False
http.Send
MsgBox http.responsetext
End Sub

Web Scraping ETFs Daily Data VBA

I'm trying to web scrape some daily info of differents ETFs. I found that https://www.marketwatch.com/ have a accurate info.
The most relevant info is the open Price, outstanding shares, NAV, total assets of the ETF.
Here is the link for IVV US Equity: https://www.marketwatch.com/investing/fund/ivv
I have web scraped with VBA before but the HTML of the pages I had used are different, I don't know if this is because some values of the ETFs (such as Price and Taded Volume) change constantly.
The idea is to create a code to extract relevant info and create a data base to analyze Macroeconomics factor using the ETFs as market indicators of flows between countries, regions, etc...
Mi first approach would be with VBA but after I get more into the data I would like to try with Python (after I get more conffident with it) to automate the webscraping process on a daily basis.
I am open to any suggestion or any other website that could be useful (I have tried with Yahoo Finance and Morningstar and I get the same problema with the HTML code).
This is my poor code:
Sub Get_Data()
Dim ticker As String, enlace As String
ticker = ThisWorkbook.Worksheets("ETFs").Cells(2, 2).Value 'IVV
'link = "https://www.morningstar.com/etfs/arcx/" & ticker & "/quote.html"
'link = "https://finance.yahoo.com/quote/" & ticker & "?p=" & ticker
link = "https://www.marketwatch.com/investing/fund/" & ticker
Application.ScreenUpdating = False
Dim x As Integer
x = ThisWorkbook.Worksheets("ETFs").Cells(Rows.Count, 1).End(xlUp).Row
'Dim i As Integer
'For i = 2 To x
Dim total_net_assets As Variant, open_price As Variant, NAV As Variant, shares_out
Set ie = CreateObject("InternetExplorer.application")
With ie
.Visible = False
.navigate link
While .Busy Or .readyState < 4: DoEvents: Wend
Do
DoEvents
On Error Resume Next
' Here is where I get the problem of not knowing how to reference the values I need because the class name appears repeatedly
total_net_assets = .document.getElementsByClassName("").Value
open_price = .document.getElementByClassName("price").Value
NAV = .document.getElementByClassName("").Value
shares_out = .document.getElementByClassName("kv__value kv__primary ").Value
On Error GoTo 0
Loop
End With
ThisWorkbook.Worksheets("ETFs").Cells(2, 13).Value = total_net_assets
ThisWorkbook.Worksheets("ETFs").Cells(2, 14).Value = NAV
ThisWorkbook.Worksheets("ETFs").Cells(2, 15).Value = open_price
ThisWorkbook.Worksheets("ETFs").Cells(2, 16).Value = shares_out
ie.Quit
'Next i
Application.ScreenUpdating = True
End Sub
Access method:
I use XMLHTTP requests as much faster than opening IE.
Code notes:
The following reads in fund short codes from Sheet1 column A, starting in A2, into an array. You can easily extend this adding more funds into column A.
This array is looped issuing XMLHTTP requests by concatenating the fund code into the BASE_URL variable.
I use a class, clsHTTP, to hold the XMLHTTP object to be efficient - no need to keep creating and destroying the object.
I provide this class with two methods. One to retrieve the target page innerHTML (GetString), and the other to extract the required info if available (GetInfo). I use a dictionary to test if the searched for labels are present. If present I grab the associated value. If not, I have a placeholder vbNullString in the dictionary.
I add each scraped result into a collection called results. At the end I loop this writing out to the sheet. By keeping most of the work in memory this provides for much faster scraping.
Retrieving info from HTML:
The labels e.g. Open, and values come in pairs.
You can generate a nodeList (think collection as with getElementsByClassName) by using querySelectorAll method to apply a class CSS selector to gather the label elements by their class name kv__label. The "." is the class selector.
Set labels = .querySelectorAll(".kv__label") '<== nodeList of labels
You do the same to get the associated values:
Set values = .querySelectorAll(".kv__value.kv__primary") '<== nodeList of associated values. Same length as labels nodeList so can use same index to retrieve associated label/value pairs from each nodeList.
You loop the labels using the dictionary in the clsHTTP method .GetInfo to see if you searched for labels are present, if they are, the associated value is retrieved from values by using the same index as where the label was found in the nodeList labels, and the dictionary vbNullString value for that label is updated with the actual retrieved value, else it is left as vbNullString.
Sample results:
VBA:
Class module clsHTTP:
Option Explicit
Private http As Object
Private Sub Class_Initialize()
Set http = CreateObject("MSXML2.XMLHTTP")
End Sub
Public Function GetString(ByVal url As String) As String
Dim sResponse As String
With http
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
GetString = sResponse
End With
End Function
Public Function GetInfo(ByVal html As HTMLDocument) As Object
Dim dict As Object, i As Long
Set dict = CreateObject("Scripting.Dictionary")
dict.Add "Open", vbNullString
dict.Add "Shares Outstanding", vbNullString
dict.Add "Total Net Assets", vbNullString
dict.Add "NAV", vbNullString
Dim values As Object, labels As Object
With html
Set values = .querySelectorAll(".kv__value.kv__primary")
Set labels = .querySelectorAll(".kv__label")
For i = 0 To labels.Length - 1
If dict.Exists(labels.item(i).innerText) Then dict(labels.item(i).innerText) = values.item(i).innerText
Next
End With
Set GetInfo = dict
End Function
Standard module 1:
Option Explicit
Public Sub GetFundInfo()
Dim sResponse As String, html As HTMLDocument, http As clsHTTP, i As Long
Dim headers(), funds(), url As String, results As Collection, ws As Worksheet
Const BASE_URL As String = "https://www.marketwatch.com/investing/fund/"
Application.ScreenUpdating = False
headers = Array("Open", "Shares Outstanding", "Total Net Assets", "NAV")
Set results = New Collection
Set http = New clsHTTP
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
funds = Application.Transpose(ws.Range("A2:A3").Value) '<== Change the range here to the single column range containing your dotNums.
For i = LBound(funds) To UBound(funds)
If Not IsEmpty(funds(i)) Then
url = BASE_URL & funds(i)
html.body.innerHTML = http.GetString(url)
results.Add http.GetInfo(html).Items
End If
Next
If results.Count > 0 Then
Dim item As Variant, r As Long, c As Long
r = 2: c = 2
With ws
.Cells(1, c).Resize(1, UBound(headers) + 1) = headers
For Each item In results
.Cells(r, c).Resize(1, UBound(item) + 1) = item
r = r + 1
Next
End With
End If
Application.ScreenUpdating = True
End Sub
Set-up:
Without using a class:
Option Explicit
Public Sub GetFundInfo()
Dim sResponse As String, html As HTMLDocument, i As Long
Dim headers(), funds(), url As String, results As Collection, ws As Worksheet
Const BASE_URL As String = "https://www.marketwatch.com/investing/fund/"
Application.ScreenUpdating = False
headers = Array("Open", "Shares Outstanding", "Total Net Assets", "NAV")
Set results = New Collection
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
funds = Application.Transpose(ws.Range("A2:A3").Value) '<== Change the range here to the single column range containing your dotNums.
For i = LBound(funds) To UBound(funds)
If Not IsEmpty(funds(i)) Then
url = BASE_URL & funds(i)
html.body.innerHTML = GetString(url)
results.Add GetInfo(html).Items
End If
Next
If results.Count > 0 Then
Dim item As Variant, r As Long, c As Long
r = 2: c = 2
With ws
.Cells(1, c).Resize(1, UBound(headers) + 1) = headers
For Each item In results
.Cells(r, c).Resize(1, UBound(item) + 1) = item
r = r + 1
Next
End With
End If
Application.ScreenUpdating = True
End Sub
Public Function GetString(ByVal url As String) As String
Dim http As Object
Set http = CreateObject("MSXML2.XMLHTTP")
Dim sResponse As String
With http
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
GetString = sResponse
End With
End Function
Public Function GetInfo(ByVal html As HTMLDocument) As Object
Dim dict As Object, i As Long
Set dict = CreateObject("Scripting.Dictionary")
dict.Add "Open", vbNullString
dict.Add "Shares Outstanding", vbNullString
dict.Add "Total Net Assets", vbNullString
dict.Add "NAV", vbNullString
Dim values As Object, labels As Object
With html
Set values = .querySelectorAll(".kv__value.kv__primary")
Set labels = .querySelectorAll(".kv__label")
For i = 0 To labels.Length - 1
If dict.Exists(labels.item(i).innerText) Then dict(labels.item(i).innerText) = values.item(i).innerText
Next
End With
Set GetInfo = dict
End Function
Okay, so you will need to create two loops. You can just keep reusing the elem0, elem1, and elemColl(1) variables for each price point you need - just make sure to reset bFoundIt to False for each new iteration so you do not exit the For Loops early.
For your total_net_assets var, you will first loop the class of kv__item. You will then need to loop each class collection of kv__label within the kv__item's elements and stop when you match the innerText: Total Net Assets. Once you match this, you will use the first coll obj elem0 to get the kv__value kv__primary class name for it.
Dim IE As Object, elem0 As Object, elem1 As Object, i As Long, bFoundIt As Boolean
Set IE = CreateObject("InternetExplorer.application")
With IE
.Visible = False
.navigate link
While .Busy Or .readyState < 4: DoEvents: Wend
DoEvents
bFoundIt = False
For Each elem0 In .document.getElementsByClassName("kv__item")
For Each elem1 In elem0.getElementsByClassName("kv__label")
If elem1.innerText = "Total Net Assets" Then
bFoundIt = True
total_net_assets = elem0.getElementsByClassName("kv__value kv__primary ")(0).innerText
Exit For
End If
Next elem1
If bFoundIt Then Exit For
Next elem0

web scraping using excel and VBA

i wrote my VBA code in excel sheet as below but it is not scrape data for me and also i don't know why please any one help me. it gave me reullt as "click her to read more" onlyi want to scrape enitre data such as first name last name state zip code and so on
Sub extractTablesData()
Dim IE As Object, obj As Object
Dim myState As String
Dim r As Integer, c As Integer, t As Integer
Dim elemCollection As Object
Set IE = CreateObject("InternetExplorer.Application")
myState = InputBox("Enter the city where you wish to work")
With IE
.Visible = True
.navigate ("http://www.funeralhomes.com/go/listing/Search? name=&city=&state=&country=USA&zip=&radius=")
While IE.readyState <> 4
DoEvents
Wend
For Each obj In IE.document.all.item("state").Options
If obj.innerText = myState Then
obj.Selected = True
End If
Next obj
IE.document.getElementsByValue("Search").item.Click
Do While IE.Busy: DoEvents: Loop
ThisWorkbook.Sheets("Sheet1").Range("A1:K1500").ClearContents
Set elemCollection = IE.document.getElementsByTagName("TABLE")
For t = 0 To (elemCollection.Length - 1)
For r = 0 To (elemCollection(t).Rows.Length - 1)
For c = 0 To (elemCollection(t).Rows(r).Cells.Length - 1)
ThisWorkbook.Worksheets(1).Cells(r + 1, c + 1) = elemCollection(t).Rows(r).Cells(c).innerText
Next c
Next r
Next t
End With
Set IE = Nothing
End Sub
Using the same URL as the answer already given you could alternatively select with CSS selectors to get the elements of interest, and use split to get just the names and address parts from the text. We can also do away with the browser altogether to get faster results from first results page.
Business name:
You can get the name with the following selector (using paid listing example):
div.paid-listing .listing-title
This selects (sample view)
Try
Address info:
The associated descriptive information can be retrieved with the selector:
div.paid-listing .address-summary
And then using split we can parse this into just the address information.
Code:
Option Explicit
Public Sub GetTitleAndAddress()
Dim oHtml As HTMLDocument, nodeList1 As Object, nodeList2 As Object, i As Long
Const URL As String = "http://www.funeralhomes.com/go/listing/ShowListing/USA/New%20York/New%20York"
Set oHtml = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", URL, False
.send
oHtml.body.innerHTML = .responseText
End With
Set nodeList1 = oHtml.querySelectorAll("div.paid-listing .listing-title")
Set nodeList2 = oHtml.querySelectorAll("div.paid-listing .address-summary")
With Worksheets("Sheet3")
.UsedRange.ClearContents
For i = 0 To nodeList1.Length - 1
.Range("A" & i + 1) = nodeList1.Item(i).innerText
.Range("B" & i + 1) = Split(nodeList2.Item(i).innerText, Chr$(10))(0)
Next i
End With
End Sub
Example output:
Yeah, without an API, this can be very tricky at best, and very inconsistent at worst. For now, you can try the script below.
Sub DumpData()
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
URL = "http://www.funeralhomes.com/go/listing/ShowListing/USA/New%20York/New%20York"
'Wait for site to fully load
IE.Navigate2 URL
Do While IE.Busy = True
DoEvents
Loop
RowCount = 1
With Sheets("Sheet1")
.Cells.ClearContents
RowCount = 1
For Each itm In IE.document.all
If itm.classname Like "*free-listing*" Or itm.classname Like "*paid-listing*" Then
.Range("A" & RowCount) = itm.classname
.Range("B" & RowCount) = Left(itm.innertext, 1024)
RowCount = RowCount + 1
End If
Next itm
End With
End Sub
You probably want some kind of input box to capture the city and state and radius from the user, or capture those variable in cells in your worksheet.
Notice, the '%20' is a space character.
I got this idea from a friend of mine, Joel, a long time ago. That guy is great!

Resources