I have created an Excel macro to run an analysis for a larger dataset (~24,000 lines). The macro is working well for the first c. 2,000 requests/lines but thereafter generally crashes. The request/line number, at which the crash occurs, thereby varies.
The debugger, which is appearing at that point, highlights that the code line leading to the crash is:
Cells(i, 7).Value = var1.innerText
I have already checked other threads and solutions on this topic, including a suggested change from 'Int' to 'Long' etc - none of these worked however. Can anyone help why the macro is working well for smaller requests but failing after a certain number of lines?
The code I am using is pasted below. Many thanks in advance.
Sub Gethits()
Dim url As String, lastRow As Long
Dim XMLHTTP As Object, html As Object, objResultDiv As Object, objH3 As Object, link As Object
Dim start_time As Date
Dim end_time As Date
Dim var As String
Dim var1 As Object
lastRow = Range("A" & Rows.Count).End(xlUp).Row
Dim cookie As String
Dim result_cookie As String
start_time = Time
Debug.Print "start_time:" & start_time
For i = 1654 To lastRow
url = "https://www.google.com/search?q=" & Cells(i, 4) & "&source=lnt&tbs=cdr%3A1%2Ccd_min%3A" & Cells(i, 5) & "%2Ccd_max%3A" & Cells(i, 6) & "&tbm=nws"
Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
XMLHTTP.Open "GET", url, False
XMLHTTP.setRequestHeader "Content-Type", "text/xml"
XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
XMLHTTP.send
Set html = CreateObject("htmlfile")
html.body.innerHTML = XMLHTTP.responseText
Set objResultDiv = html.getElementById("rso")
Set var1 = html.getElementById("resultStats")
Cells(i, 7).Value = var1.innerText
DoEvents
Next
end_time = Time
Debug.Print "end_time:" & end_time
Debug.Print "done" & "Time taken : " & DateDiff("n", start_time, end_time)
MsgBox "done" & "Time taken : " & DateDiff("n", start_time, end_time)
End Sub
Thanks for your feedback. The restriction indeed seems to be driven by Google limiting the number of requests in a given time period. When visiting the Google webpage after the error occurs, I am asked to run trough a captcha process due to unsually high request activity.
The idea of submitting multiple requests at a time seems very interesting in any case, as it would save a lot of time collecting the data. Is there, however, a simplified approach to the example by TinMan (codereview.stackexchange.com/a/196922/171419)? I have checked for other threads but have not found anything helpful thus far.
Related
I am trying to connect 2 modules in vba such that the output of the first module (geturl) feeds into the other (getdata).
Get Url to look up the dossier URL online for substances entered in column A e.g. Acetone or alternatively the CAS number in column B can be used (see image below). Note: currently only looks up for substance info in A1 or B1.
Public Function GetUrl() As String
Const Url = "https://echa.europa.eu/information-on-chemicals/registered-substances?p_p_id=dissregisteredsubstances_WAR_dissregsubsportlet&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&_dissregisteredsubstances_WAR_dissregsubsportlet_javax.portlet.action=dissRegisteredSubstancesAction"
Set oHtml = New HTMLDocument
Set oHttp = CreateObject("MSXML2.XMLHTTP")
Set MyDict = CreateObject("Scripting.Dictionary")
SubstanceName = Cells(1, 1)
CASNumber = Cells(1, 2)
MyDict("_dissregisteredsubstances_WAR_dissregsubsportlet_disreg_name") = SubstanceName
MyDict("_dissregisteredsubstances_WAR_dissregsubsportlet_disreg_cas-number") = CASNumber
MyDict("_disssimplesearchhomepage_WAR_disssearchportlet_disclaimer") = "true"
MyDict("_disssimplesearchhomepage_WAR_disssearchportlet_disclaimerCheckbox") = "on"
payload = vbNullString
For Each DictKey In MyDict
payload = IIf(Len(DictKey) = 0, WorksheetFunction.EncodeURL(DictKey) & "=" & WorksheetFunction.EncodeURL(MyDict(DictKey)), _
payload & "&" & WorksheetFunction.EncodeURL(DictKey) & "=" & WorksheetFunction.EncodeURL(MyDict(DictKey)))
Next DictKey
With oHttp
.Open "POST", Url, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
.send (payload)
oHtml.body.innerHTML = .responseText
End With
GetUrl = oHtml.querySelector(".details").getAttribute("href")
Debug.Print oHtml.querySelector(".substanceNameLink ").innerText
Debug.Print oHtml.querySelector(".details").getAttribute("href")
End Function
If run this should return
Acetone https://echa.europa.eu/registration-dossier/-/registered-dossier/15460
Get Data uses the Url from geturl to return "DNEL" values:
Sub GetData()
'Start ECHA Search via XML HTTP Request
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim Route(1 To 3) As String
Route(1) = "sGeneralPopulationHazardViaInhalationRoute"
Route(2) = "sGeneralPopulationHazardViaDermalRoute"
Route(3) = "sGeneralPopulationHazardViaOralRoute"
XMLReq.Open "Get", GetUrl & "/7/1", False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
'Loops through each element
For c = 1 To UBound(Route, 1)
Set Info = HTMLDoc.getElementById(Route(c))
Debug.Print Info.innerText
Set Info = HTMLDoc.getElementById(Route(c)).NextSibling.NextSibling.NextSibling
Set Data = Info.getElementsByTagName("dd")(0)
Debug.Print Data.innerText
Set Data = Info.getElementsByTagName("dd")(1)
Debug.Print Data.innerText
'Cells(r, c + 2) = Data.innerText
Next c
End Sub
For Acetone in Cell(1,1) This should Return:
Acetone
https://echa.europa.eu/registration-dossier/-/registered-dossier/15460
General Population - Hazard via inhalation route
DNEL (Derived No Effect Level)
200 mg/m³
General Population - Hazard via dermal route
DNEL (Derived No Effect Level)
62 mg/kg bw/day
General Population - Hazard via oral route
DNEL (Derived No Effect Level)
62 mg/kg bw/day
Instead of just relying on Cell A1 however, I wish to have the entire code loop for each cell with a substance in columnA/ColumnB. So in this case the URL for Acetone is found and the corresponding data is then pulled then the same occurs for Oxydipropanol.
Note in this image Substances can be looked up online using either the substance name, CAS number in columnB, or a combination of both.
Trying to connect the two modules, zo far I have only been able to get the geturl module to cycle through for each substance. I have also tried to combine both into 1 module but cant figure out how to correctly nest the for loops.
A quick google search states that you cant nest functions in vba. This makes me wonder if what I'm doing is even the right way to approach this. But I've seen similar things achieved In the past so I'm sure it's possible.
Note: If testing please use the example substances for testing. Using a random chemical say Benzene may result in an error as the tox profile for this substance doesn't exist. I still need to implement handling errors but this can be ignored for now.
I Will update you here with any further progress made, Thanks.
This worked for me:
Sub PopulateExposures()
Dim url, rw As Range
Set rw = Sheets("data").Range("A1:E1") 'first row with inputs
Do While Application.CountA(rw) > 0
url = SubstanceUrl(rw.Cells(1).Value, rw.Cells(2).Value) 'get the URL
rw.Cells(3).Resize(1, 3).Value = ExposureData(url) 'get exposure data (as array) and add to row
Set rw = rw.Offset(1, 0) 'next substance
Loop
End Sub
Public Function SubstanceUrl(SubstanceName, CASNumber) As String
Const url = "https://echa.europa.eu/information-on-chemicals/registered-substances?" & _
"p_p_id=dissregisteredsubstances_WAR_dissregsubsportlet&p_p_lifecycle=1&" & _
"p_p_state=normal&p_p_mode=view&" & _
"__dissregisteredsubstances_WAR_dissregsubsportlet_javax.portlet.action=dissRegisteredSubstancesAction"
Dim oHTML, oHttp, MyDict, payload, DictKey, sep
Set oHTML = New HTMLDocument
Set oHttp = CreateObject("MSXML2.XMLHTTP")
Set MyDict = CreateObject("Scripting.Dictionary")
MyDict("_dissregisteredsubstances_WAR_dissregsubsportlet_disreg_name") = SubstanceName
MyDict("_dissregisteredsubstances_WAR_dissregsubsportlet_disreg_cas-number") = CASNumber
MyDict("_disssimplesearchhomepage_WAR_disssearchportlet_disclaimer") = "true"
MyDict("_disssimplesearchhomepage_WAR_disssearchportlet_disclaimerCheckbox") = "on"
payload = ""
For Each DictKey In MyDict
payload = payload & sep & DictKey & "=" & WorksheetFunction.EncodeURL(MyDict(DictKey))
sep = "&"
Next DictKey
With oHttp
.Open "POST", url, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
.send payload
oHTML.body.innerHTML = .responseText
End With
SubstanceUrl = oHTML.querySelector(".details").getAttribute("href")
End Function
Function ExposureData(urlToGet)
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As HTMLDocument, dds
Dim Route(1 To 3) As String, Results(1 To 3) As String, c, Info, Data
Route(1) = "sGeneralPopulationHazardViaInhalationRoute"
Route(2) = "sGeneralPopulationHazardViaDermalRoute"
Route(3) = "sGeneralPopulationHazardViaOralRoute"
XMLReq.Open "Get", urlToGet & "/7/1", False
XMLReq.send
If XMLReq.Status <> 200 Then
Results(1) = "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Else
Set HTMLDoc = New HTMLDocument
HTMLDoc.body.innerHTML = XMLReq.responseText
For c = 1 To UBound(Route, 1)
Set Info = HTMLDoc.getElementById(Route(c))
If Not Info Is Nothing Then
Set Info = Info.NextSibling.NextSibling.NextSibling
Set dds = Info.getElementsByTagName("dd")
If dds.Length > 1 Then
Results(c) = dds(1).innerText
Else
Results(c) = "hazard unknown"
End If
Else
Results(c) = "no info"
End If
Next c
End If
ExposureData = Results
End Function
I am using the following code in Excel to calculate the travel distance between various places. In this calculation it calculates the "driving" directions.
I am however also looking for the public transport directions. I've been puzzling how to change the code to change the travel mode (https://developers.google.com/maps/documentation/directions/overview#TravelModes), but I cant get it to work. Does anybody have an idea how to incorporate this?
Function TRAVELTIME(origin, destination, apikey)
Dim strUrl As String
strUrl = "https://maps.googleapis.com/maps/api/directions/json?origin=" & origin & "&destination=" & destination & "&key=" & apikey
Set httpReq = CreateObject("MSXML2.XMLHTTP")
With httpReq
.Open "GET", strUrl, False
.Send
End With
Dim response As String
response = httpReq.ResponseText
Dim parsed As Dictionary
Set parsed = JsonConverter.ParseJson(response)
Dim seconds As Integer
Dim leg As Dictionary
For Each leg In parsed("routes")(1)("legs")
seconds = seconds + leg("duration")("value")
Next leg
TRAVELTIME = seconds
EDIT:
I have tested the response text and I found out the reason behind the error is that the request gets redirected to the "I'm not a robot" page due to high number of search requests sent by excel. Is there anything that can be done to prevent this from happening?
Original:
I have this code (found it online and modified it) that works fine when i execute it few times, it then creates
Run-time error " 91 - Object Variable or with block variable not set"
Then after I wait for about an hour it works fine without me changing anything, then it creates the error again. and so on...
The code was originally to search Google for a list of strings and return how many results. I modified it to search for the exact string and just return if there are results or not.
Here is the code:
Sub SearchHits()
Dim url, name As String
Dim i, lastRow As Long
Dim XMLHTTP As Object
Dim html As Object
Application.ScreenUpdating = False
Application.DisplayStatusBar = False
Application.EnableEvents = False
ActiveSheet.DisplayPageBreaks = False
lastRow = Range("a" & Rows.Count).End(xlUp).Row
For i = 1 To lastRow
Name = """" & Cells(i, 1).Value & """"
url = "https://www.google.co.in/search?q=" & Name & "&rnd=" & WorksheetFunction.RandBetween(1, 10000)
Set XMLHTTP = CreateObject("MSXML2.ServerXMLHTTP.6.0")
XMLHTTP.Open "GET", url, False
XMLHTTP.setRequestHeader "Content-Type", "text/xml"
XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
XMLHTTP.send
Set html = CreateObject("htmlfile")
html.body.innerHTML = XMLHTTP.ResponseText
If html.getElementById("topstuff").innerText <> "" Then '<-----Highlighted on error
Cells(i, 9) = "–"
Else
Cells(i, 9) = "Results Found"
End If
Next
End Sub
I'm not sure what is the reason, I have few guesses but am hoping someone over here could explain to me the reason and the work around. I work in research and everyday I search a lot of names and it would be very helpful if I can know which has results and which doesn't.
It would be even more amazing if someone can help with the code to return the exact number of searches (the one found at the bottom of the page).
Best Regards!
For a university research project, I plan to run a larger number of data requests with a total of c. 3,000 different spreadsheets, each including about 800-1,000 unique data requests.
Aim of the code is to get the number of Google News results for certain search terms within specific time frames, as for instance the results for "Elon Musk" between 01/01/2015 and 12/31/2015.
I have so far constructed a code that works relatively well with a single-threaded approach but would, given the large number of data requests, take weeks for the 3,000 spreadsheets to be completed (Google is occasionally blocking the requests given the sheer number of search queries but this is something generally manageable).
On the other hand, I have read that there is not "native" option in VBA to go for a time saving multi-threading approach, while several suggestions exist to work around this lack of a real multi-threading function. None of them has however really worked for my case so far.
Is there any practicable option to have the code below send out several Google requests at a time? This would allow the data collection in a much shorter time span. As mentioned, I have already been through a number of complex "out of the box" multi-threading solutions, of which none really worked.
Option Explicit
Sub TermCheck()
Dim url As String, lastRow As Long
Dim XMLHTTP As Object, html As Object, objResultDiv As Object, objH3 As Object, link As Object
Dim start_time As Date
Dim end_time As Date
Dim var As String
Dim var1 As Object
lastRow = Range("A" & Rows.Count).End(xlUp).Row
Dim cookie As String
Dim result_cookie As String
start_time = Time
Debug.Print "start_time:" & start_time
For i = 62 To lastRow
url = "https://www.google.com/search?q=" & Cells(i, 3) & "&source=lnt&tbs=cdr%3A1%2Ccd_min%3A" & Cells(i, 4) & "%2Ccd_max%3A" & Cells(i, 5) & "&tbm=nws"
Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
XMLHTTP.Open "GET", url, False
XMLHTTP.setRequestHeader "Content-Type", "text/xml"
XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
XMLHTTP.send
Set html = CreateObject("htmlfile")
html.body.innerHTML = XMLHTTP.responseText
Set objResultDiv = html.getElementById("rso")
Set var1 = html.getElementById("resultStats")
If Not var1 Is Nothing Then
Cells(i, 6).Value = var1.innerText
End If
DoEvents
Next
end_time = Time
Debug.Print "end_time:" & end_time
Debug.Print "done" & "Time taken : " & DateDiff("n", start_time, end_time)
MsgBox "done" & "Time taken : " & DateDiff("n", start_time, end_time)
End Sub
I think you are on the right track by consolidating your data upfront, this is often a good idea for most projects. I'm not so sure you should give up on Async requests so quickly, have a go at this code below and see if that helps speed up your project.
I assumed that dates where in Columns D and E, so I mocked up my data to this idea. I also hard coded 'Elon Musk' in there to make things easier testing. You probably need to change this.
Option Explicit
Sub TermCheck(RunAsync As Boolean)
Const READYSTATE_COMPLETE As Long = 4
Dim url As String
Dim WebRequest As Object
Dim WebRequests As Object
Dim CellIndex As Variant
Dim Document As Object
Dim ResultStat As Object
Dim ws As Worksheet
Dim StartDate As Date
Dim EndDate As Date
Dim i As Long
StartDate = #1/1/2015#
EndDate = #1/2/2015#
Set ws = ThisWorkbook.Worksheets("Sheet3")
Set WebRequests = CreateObject("Scripting.Dictionary")
For i = 1 To 30
'Change URL here
url = "https://www.google.com/search?q=Elon%20Musk" & _
"&source=lnt&tbs=cdr%3A1%2Ccd_min%3A" & Replace(Format(ws.Cells(i, 4), "m/d/yyyy"), "/", "%2F") & _
"%2Ccd_max%3A" & Replace(Format(ws.Cells(i, 5), "m/d/yyyy"), "/", "%2F") & "&tbm=nws"
Set WebRequest = CreateObject("MSXML2.XMLHTTP")
With WebRequest
.Open "GET", url, RunAsync
.setRequestHeader "Content-Type", "text/xml"
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
.send
End With
StartDate = DateAdd("d", 1, StartDate)
EndDate = DateAdd("d", 1, EndDate)
WebRequests.Add i, WebRequest
Next
For Each CellIndex In WebRequests.Keys
Set WebRequest = WebRequests(CellIndex)
While WebRequest.Readystate <> READYSTATE_COMPLETE: DoEvents: Wend
If WebRequest.Status = 200 Then
Set Document = CreateObject("htmlfile")
Document.body.innerhtml = WebRequest.ResponseText
Set ResultStat = Document.getElementById("resultStats")
'Missing equals sign was here
If Not ResultStat Is Nothing Then ws.Cells(CellIndex, 6).Value2 = ResultStat.innertext
End If
Next
End Sub
Sub TestRunRequests()
'Run it Synchronous
Application.ScreenUpdating = False
Dim MyTimer As Double
MyTimer = Timer
TermCheck False
Debug.Print "Synchronous took: " & Timer - MyTimer
'Run it Asynchronous
MyTimer = Timer
TermCheck True
Debug.Print "Asynchronous took: " & Timer - MyTimer
Application.ScreenUpdating = True
End Sub
Here are my the timings (in seconds) I got when I did 100 Requests for each method (async and sync):
Synchronous took: 44.5625
Asynchronous took: 22.46875
This may rather be the "pedestrian way" regarding a time saving approach but I decided to paste all relevant spreadsheets into one major Excel file. When amending the code to cover all spreadsheets simultaneously (and not only the active one), the search processes can literally be conducted in a fraction of the time otherwise required.
In my day to day tasks I currently have to search a large number of products and gather information on these products. So my idea is to search the product on google and get the info from the first search result by extracting the data from the product title section and pretty much loop this for a number of products.
Here is my code below so far:
Sub SkuAutomation()
Dim ie As Object
'Navigates to google
Set ie = CreateObject("InternetExplorer.application")
ie.Visible = True
ie.Navigate "https://google.co.uk/search?q=" & Worksheets("sheet1").Cell(9, 4).Value & " " & Worksheets("sheet1").Cells(9, 2)
'Waits for page to load before next action
Do While ie.ReadyState <> READYSTATE_COMPLETE
Loop
End Sub
I just want to add a piece of code which either clicks on the first link that google returns or returns the link for me. My idea would then to be scrape the data from the product title section from that page! still very early stages though.
I am just a beginner so any type of help would be much appreciated! Many thanks in advance.
Your mileage will likely vary on this but for what you have provided you can use a CSS selector combination to target the first link by the page styling.
I use #search div.r [href*=http] but you could simplify to #search .r a. I am interested in knowing there is an http in the href though.
The # is an id selector, a space " " is a descendant selector (selects a child of the preceeding element and the [] is an attribute selector. A "." is a class selector i.e. selects an element by class name.
I am looking for the first element with an href attribute containing http in its value that has a parent element div element with class name r, whose parent has an id of search.
Option Explicit
Public Sub GetLink()
Dim ie As New InternetExplorer
With ie
.Visible = True
.navigate "https://google.co.uk/search?q=Currys+241825"
While .Busy Or .readyState < 4: DoEvents: Wend
Debug.Print .document.querySelector("#search div.r [href*=http]").href
.Quit
End With
End Sub
This is how I would do it. Put some search criteria in Cell A2, going down in ColumnA as far as you want to go. Then run the code below. The results will go into the adjacent cells in ColumnB
girafe
rhino
starbucks
Sub Gethits()
Dim url As String, lastRow As Long
Dim XMLHTTP As Object, html As Object, objResultDiv As Object, objH3 As Object, link As Object
Dim start_time As Date
Dim end_time As Date
Dim var As String
Dim var1 As Object
lastRow = Range("A" & Rows.Count).End(xlUp).Row
Dim cookie As String
Dim result_cookie As String
start_time = Time
Debug.Print "start_time:" & start_time
For i = 2 To lastRow
url = "https://www.google.com/search?q=" & Cells(i, 1) & "&rnd=" & WorksheetFunction.RandBetween(1, 10000)
Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
XMLHTTP.Open "GET", url, False
XMLHTTP.setRequestHeader "Content-Type", "text/xml"
XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
XMLHTTP.send
Set html = CreateObject("htmlfile")
html.body.innerHTML = XMLHTTP.ResponseText
Set objResultDiv = html.getelementbyid("rso")
Set var1 = html.getelementbyid("resultStats")
Cells(i, 2).Value = var1.innerText
DoEvents
Next
end_time = Time
Debug.Print "end_time:" & end_time
Debug.Print "done" & "Time taken : " & DateDiff("n", start_time, end_time)
MsgBox "done" & "Time taken : " & DateDiff("n", start_time, end_time)
End Sub