Excel VBA source code for extracting data from a URL - excel

I want to extract the title of every news item displayed on "http://pib.nic.in/newsite/erelease.aspx?relid=58313" website using Excel VBA. I have written a code using getelementsbyclassname("contentdiv"). But the debugger is showing a error pertaining to that the object doesn't support...I want to extract the information items of every relid..which is there in the URL as well...

Cold scrapes like this are generally handled more efficiently with a XMLHTTP pull. This requires the addition of a few libraries to the VBE's Tools ► References. The code below needs Microsoft XML, v6.0, Microsoft HTML Object library and Microsoft Internet Controls. Might not need the last one but you probably will if you expand the code beyond what is supplied.
Public Const csURL As String = "http://pib.nic.in/newsite/erelease.aspx?relid=×ID×"
Sub scrape_PIBNIC()
Dim htmlBDY As HTMLDocument, xmlHTTP As MSXML2.ServerXMLHTTP60
Dim i As Long, u As String, iDIV As Long
On Error GoTo CleanUp
Set xmlHTTP = New MSXML2.ServerXMLHTTP60
Set htmlBDY = New HTMLDocument
For i = 58313 To 58313
htmlBDY.body.innerHTML = vbNullString
With xmlHTTP
u = Replace(csURL, "×ID×", i)
'Debug.Print u
.Open "GET", u, False
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded; charset=UTF-8"
.send
If .Status <> 200 Then GoTo CleanUp
htmlBDY.body.innerHTML = .responseText
For iDIV = 0 To (htmlBDY.getElementsByClassName("contentdiv").Length - 1)
If CBool(htmlBDY.getElementsByClassName("contentdiv")(iDIV).getElementsByTagName("span").Length) Then
Sheets("Sheet1").Cells(Rows.Count, 1).End(xlUp).Offset(1, 0) = _
htmlBDY.getElementsByClassName("contentdiv")(iDIV).getElementsByTagName("span")(0).innerText
End If
Next iDIV
End With
Next i
CleanUp:
Set htmlBDY = Nothing
Set xmlHTTP = Nothing
End Sub
That should be enough to get you started. The site you are targeting requires that charset=UTF-8 be added to the request. I had no success without it. I strongly suspect that this may have been the source of your object doesn't support error.

Related

Data values of uploaded images

There are a large number of classes ("df-table") on that link homepage. The first class was able to be entered into the Excel sheet, but not the second class ("df-table"), the third class ("df-table"), etc.
Various internet resources say it can be solved by using nth-of-type or using xpath, but the error keeps occurring.
I want the data values of the uploaded images.
Public Sub Jaemu()
Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual
Dim d As WebDriver, ws As Worksheet, URL As String
Set d = New ChromeDriver
Set ws = ThisWorkbook.Worksheets("gemstone2")
Dim http As New WinHttpRequest
With d
'.AddArgument "--headless"
.Start "Chrome"
Dim html As HTMLDocument
Dim JsonObject As Object
Set html = New HTMLDocument
URL = "https://globalmonitor.einfomax.co.kr/infomax_ds.html#/USA/1/1"
.get URL, Raise:=False ' set raise to false to avoid a timeout error
d.FindElementByCss("[ng-click='openStockSearchPopup();']").Click
d.FindElementByCss("[ng-enter='searchStockSearchPopup(true);']").SendKeys "GOOGL"
d.FindElementByCss("[ng-click='searchStockSearchPopup(true);']").Click
d.FindElementByCss("[class='slick-cell l1 r1 text-center clickable']").Click
Cells(2, 1).Value = d.FindElementByCss("[class='df-table']").Text
Cells(3, 1).Value = d.FindElementByCss(".table-contents[ng-if='IS_RT_STATE_SUCCESS(requeststate.prospectData)'] > .df-table").Text
End With
End Sub
Original OP error:
runtime error 32.
Coding line.
Cells(3, 1).Value = d.FindElementByCss("[class='df-table' class:nth-of-type(2)]").Text
New error following initial suggestion to use different CSS selector:
Runtime error 7
Coding line.
Cells(3, 1).Value = d.FindElementByCss(".table-contents[ng-if='IS_RT_STATE_SUCCESS(requeststate.prospectData)'] > .df-table").Text
Initial error (RTE32):
The :nth-of-type() pseudo class selector would go outside of the attribute selector closing ] i.e. "[class='df-table']:nth-of-type(2)", however this does not provide a match.
You can use:
.table-contents[ng-if='IS_RT_STATE_SUCCESS(requeststate.prospectData)'] > .df-table
This returns a single match and is specific enough, and not reliant on a long selector list, that it is likely fairly stable. Based on experience.
If you had wanted to use :nth-of-type() you could have used it on a parent element and taken first match, but that is less stable and prone to breaking if html changes:
.contents-area:nth-of-type(5) .df-table
Follow-up error (RTE7):
The element needs to be scrolled into view.
A not ideal way is (your mileage may vary):
d.ExecuteScript "window.scrollTo(0, document.body.scrollHeight/3);"
Application.Wait Now + TimeSerial(0, 0, 2)
Activesheet.Cells(1,3) = d.FindElementByCss(".table-contents[ng-if='IS_RT_STATE_SUCCESS(requeststate.prospectData)'] > .df-table").Text
There are better ways with inbuilt functions and custom JS which you can likely pull from the examples.xlsm by the author on GitHub. As I can only test with Python this was a quick and dirty test that worked.
You could avoid overhead of browser and use XHR to get a JSON response to then use a JSON parser with, as follows:
Option Explicit
Public Sub GetData()
Dim s As String
With CreateObject("MSXML2.XMLHTTP")
.Open "POST", "https://globalmonitor.einfomax.co.kr/facset/getKeyData", False
.SetRequestHeader "User-Agent", "Mozilla/5.0"
.SetRequestHeader "content-type", "application/json;charset=UTF-8"
.send "{""param"":""NAS:AAPL""}"
s = .responsetext
End With
Debug.Print s
End Sub

Scraping current date from website using Excel VBA

Error
Librarys
I need the date of the current day. I do not want to place it inside a variable to be able to have it work, instead I would like that variable to be Date or in its default String.
Sub WEB()
Dim IE As Object
Dim allelements As Object
Application.ScreenUpdating = False
Set IE = CreateObject("InternetExplorer.Application")
IE.navigate "http://www.fechadehoy.com/venezuela"
Do Until IE.ReadyState = 4
DoEvents
Loop
Application.Wait (Now + TimeValue("0:00:01"))
IE.document.getElementById ("date")
IE.Visible = True
Set IE = Nothing
Application.ScreenUpdating = True
End Sub
The website is http://www.fechadehoy.com/venezuela
I only need the date of this page. I am not interested in any other element of the macro.
I just need to extract the current date and get it in a variable.
if you need Lunes, 19 de agosto de 2019 then use getElementById for fecha
Debug.Print IE.document.getElementById("fecha").Innerhtml
Why go for IE when xhr can do the trick? You can get the date with the blink of an eye if you opt for XMLHttpRequest.
Sub GetCurrentDate()
Dim S$
With New XMLHTTP
.Open "GET", "http://www.fechadehoy.com/venezuela", False
.send
S = .responseText
End With
With New HTMLDocument
.body.innerHTML = S
MsgBox .getElementById("fecha").innerText
End With
End Sub
Reference to add:
Microsoft XML, v6.0
Microsoft HTML Object Library
To get rid of that reference altogether:
Sub GetCurrentDate()
Dim S$
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "http://www.fechadehoy.com/venezuela", False
.send
S = .responseText
End With
With CreateObject("htmlfile")
.body.innerHTML = S
MsgBox .getElementById("fecha").innerText
End With
End Sub
Although the answer given by #Siddharth Rout is perfectly fine, it would require quite a bit of string manipulation to get the date in a usable form.
For the above reason I'm providing an alternative solution which gets the date in a directly usable format, ready to be manipulated and used in further calculations if necessary.
As a bonus I am demonstrating how to get the date using an HTTP request instead of using the Internet Explorer. This makes the code more efficient.
Option Explicit
Sub getDate()
Dim req As New WinHttpRequest
Dim doc As New HTMLDocument
Dim el As HTMLParaElement
Dim key As String
Dim url As String
Dim retrievedDate As Date
url = "http://www.fechadehoy.com/venezuela"
key = "Fecha actual: "
''''''''''Bonus: Use an HTTP request to get the date instead of opening IE'''''''''''
With req '
.Open "GET", url, False '
.send '
doc.body.innerHTML = .responseText '
'Debug.Print .responseText '
End With '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
For Each el In doc.getElementsByTagName("p")
If el.innerText Like "Fecha actual*" Then
retrievedDate = Mid(el.innerText, InStr(el.innerText, key) + Len(key), Len(el.innerText))
End If
Next el
End Sub
You will need to add a reference to Microsoft HTML Object Libraryand Microsoft WinHTTP Services version 5.1. To do that, go to VB editor>Tools>References.
Having the date in this format, means it can easily be manipulated. An example would be the use of functions like day(retrievedDate) , month(retrievedDate), year(retrievedDate) etc.

VBA scrape HTML from URL with JavaScript elements

Using the following code
Sub Test()
'Must have the Microsoft HTML Object Library reference enabled
Dim oHtml As HTMLDocument
Dim oElement As Object
Dim link As String
Set oHtml = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "https://www.afklcargo.com/WW/en/local/app/index.jsp#/tntdetails/074-26126063", False
.Send
oHtml.Body.innerHTML = .responseText
End With
End Sub
I am unable to get the actual HTML, I believe it's because the website is using Javascript? How can I circumvent this so I can get my value?
It is dynamically added via another xhr call which you can find in the network tab of browser. So change your url to that and then use a json parser to parse the response.
Use a json parser, such as jsonconverter.bas to handle the response. After installing the code from that link in a standard module called JsonConverter, go to VBE > Tools > References > Add a reference to Microsoft Scripting Runtime.
You extract your value from the json which is returned as unix timestamp.
1561791600000 > 2019-06-29T07:00:00.000Z
The json response actually has all the info regarding the tracking. You can explore it here.
Option Explicit
Public Sub Test()
Dim json As Object
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "https://www.afklcargo.com/resources/tnt/singleAwbDetails?awbId=074-26126063", False
.send
Set json = JsonConverter.ParseJson(.responseText)
Debug.Print json("booking")(1)("arrivalDateLong")("local")
End With
End Sub
Two functions, by Schmidt and Navion for doing the conversion to datetime from stamp copied directly from here.
Function Epoch2Date(ByVal E As Currency, Optional msFrac) As Date
Const Estart As Double = #1/1/1970#
msFrac = 0
If E > 10000000000# Then E = E * 0.001: msFrac = E - Int(E)
Epoch2Date = Estart + (E - msFrac) / 86400
End Function
Function Epoch2DateS(ByVal epochstamp$) As Date
Epoch2DateS = CVDate(CDbl(DateSerial(1970, 1, 1)) + Int(Val(epochstamp$) / 1000#) / 86400)
End Function
N.B.
Using json parser is the reliable and recommended way, but you can also regex or split function to get the value.

VBA : how to connect MSXML2.XMLHTTP60 response to IHTMLDocument(iframe)

I am trying to use MSXML2 and IHTMLDocument
to deal with iframe part of HTML web page.
I want to use MSXML2 and save it to better "capture" the data, thinking it's faster than just using InternetExplorer or VBA selenium reference supported by VBA menu.
(I don't want to avoid using IE or selenium as much as possible)
But I couldn't find out how to save document as XML format(to take advantage of its speed) and at the same time click on the element in the document without the help of browser(ie or selenium).
And even after clicking some tab(id="cns_Tab21") on this web page, I have difficulty retrieving data.
So my question is..
1> Is it possible to minimize the use of browser for clicking?
2> Even after clicking(using Selenium), it throws an xpath related error in VBA editor.
Thank you for your answer in advance and the URL used for this is
http://bitly.kr/finance
and the iframe inside the link is http://bitly.kr/LT0aCb
'I declared objects
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim iframeDoc As IHTMLDocument
'and saved XML data to HTML format
HTMLDoc.body.innerHTML = XMLReq.responseText
'and trying to save this HTML to iframe...
Set iframeDoc = HTMLDoc.getElementById("coinfo_cp")
'I tried .contentDocument but it maybe HTMLdoc doesn't have this property.
and I don't know how to access information I saved to iframeDoc above.
'And after I use Selenium I can't figure out why it throw an error
For Each ele In selenium.FindElementsByTag("th")
If ele.Attribute("innerText") = "CAPEX" Then
Debug.Print ele.FindElementsByXPath("./../td").Attribute("innerText")
This post isn't a duplicate since I am trying to use XML to handle iframe element and without InternetExplorer reference in VBA Excel.(ie.document)
You can make replicate the xhr request the page makes when that tab (not iframe) is selected. I use clipboard to copy table to Excel. Note: url I am using is from our discussions. This info should be reflected in question.
Option Explicit
Public Sub GetTable()
'VBE > Tools > References > Microsoft HTML Object Library
Dim html As HTMLDocument, hTable As HTMLTable, clipboard As Object
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://navercomp.wisereport.co.kr/v2/company/ajax/cF1001.aspx?cmp_cd=005930&fin_typ=0&freq_typ=Y&encparam=ZXR1cWFjeGJnS1lWOHhCYmNScmJXUT09&id=bG05RlB6cn", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
End With
Set hTable = html.querySelector(".hbG05RlB6cn + .gHead01")
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}") ' New DataObject
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
ThisWorkbook.Worksheets("Sheet1").Cells(1, 1).PasteSpecial
End Sub
You can find the params of the ajax url for the tab content update in the scripts of the page
Along with the target for the update:
This needs tidying up:
Option Explicit
Public Sub GetTable()
'https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930
'VBE > Tools > References > Microsoft HTML Object Library
Dim html As HTMLDocument, hTable As HTMLTable, clipboard As Object, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://navercomp.wisereport.co.kr/v2/company/ajax/cF1001.aspx?cmp_cd=005930&fin_typ=0&freq_typ=Y&encparam=ZXR1cWFjeGJnS1lWOHhCYmNScmJXUT09&id=bG05RlB6cn", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
End With
Set hTable = html.querySelector(".hbG05RlB6cn + .gHead01") '2nd tab. CAPEX row
Dim html2 As HTMLDocument, i As Long
Set html2 = New HTMLDocument
html2.body.innerHTML = hTable.outerHTML
Dim tableBodyRows As Object, tableBodyRowLength As Long, tableHeaderRowLength As Long, tableHeaderRows As Object, targetRow As Long
Set tableBodyRows = html2.querySelectorAll("tbody tr .bg")
tableBodyRowLength = tableBodyRows.Length
tableHeaderRowLength = html2.querySelectorAll("thead tr").Length + 2
For i = 0 To tableBodyRowLength - 1
If Trim$(tableBodyRows.item(i).innerText) = "CAPEX" Then
targetRow = i + tableHeaderRowLength + 1
Exit For
End If
Next
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}") ' New DataObject
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
ws.Cells(1, 1).PasteSpecial
Dim unionRng As Range
For i = (tableHeaderRowLength + 1) To (tableBodyRowLength + tableHeaderRowLength)
If i <> targetRow Then
If Not unionRng Is Nothing Then
Set unionRng = Union(ws.rows(i), unionRng)
Else
Set unionRng = ws.rows(i)
End If
End If
Next
If Not unionRng Is Nothing Then unionRng.Delete
End Sub

Excel Web Query Object and Cookies: Is there a better way?

I have a HTML web page at work that I want to query data from tables into excel 2007. This web page requires I sign on with a password. I sign in with my normal IE7 browser, then I go to DATA -> connections -> my connections and edit the query. This reads the IE7 cookie cache and I re-POST the data to connect to the server's security by clicking "retry" when it says "the web query returned no data". After I do this, the data imports fine.
I can do this just fine and it only needs to be done once a day. Other users of my application find this difficult which leads to my question:
Is there a way to automatically POST this data back with VB? I'm thinking maybe I should use the cookie property of the IE.Document.cookie?
I'm calling the following login script, before I continue with the web query (set reference to XML library). Look around to find some instructions how you can find your POST parameters.
Sub XMLHttpLogin()
Dim i As Integer
Dim sExpr As String
Dim sPar As String, sURL as String
Dim sResp As String
Dim XMLHttp As MSXML2.XMLHTTP60
Set XMLHttp = New MSXML2.XMLHTTP60
sPar = "name=user1&pass=pass1&form_id=form1" 'The parameters to send.
sURL = "http://www.stackoverflow.com"
With XMLHttp
.Open "POST", sURL, True 'Needs asynchronous connection
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
.send (sPar)
i = 0 'wait until data has been downloaded
Do While i = 0
If .readyState = 4 Then
If .Status = 200 Then Exit Do
End If
DoEvents
Loop
sResp = .responseText 'contains source code of website
sExpr = "not-logged-in" 'look for this string in source code
If InStr(1, sResp, sExpr, vbTextCompare) Then
MsgBox "Not logged in. Error in XMLHttpLogin"
End If
End With
End Sub

Resources