scraping a webpage using vba - excel

I am trying to scrap information from multiple web sites.
<div class="detailSection">
<span>Officer/Director Detail</span>
<span><b>Name & Address</b></span>
<br/>
<br/>
<span>Title VD</span>
<br/>
<br/>
GUNN, BETTY <span>
<div>
6922 SOUTH LAGOON DR<br/>
PANAMA CITY BEACH, FL 32408<br/>
</div>
I am able to pull all of the information except for the name "GUNN, BETTY".
The web page is http://search.sunbiz.org/Inquiry/CorporationSearch/SearchResultDetail?inquiryType=DocumentNumber&aggregateId=domnp-763425-68d63992-2677-4bd5-9e1e-3f63ef505809&directionType=Initial&searchNameOrder=AMBASSADORBEACHOWNERSASSOCIATI%207634250&searchTerm=763425
Officer_Director_Detail2 = Doc.getElementsByClassName("detailSection")(5).getElementsByTagName("span")(2).innerText copies "Title VD".
Officer_Director_Detail3 = Doc.getElementsByClassName("detailSection")(5).getElementsByTagName("span")(3).innerText copies "6922 SOUTH LAGOON DR PANAMA CITY BEACH, FL 32408".
I have tried using "br" and "div" but neither will copy the name. HELP!!!

try this code and select the fields (txt(i)) you are interested in 'BETTY GUNN, is at txt(5)
txt = Split(doc.getElementsByClassName("detailSection")(5).innerText, vbCrLf)
For i = 0 To UBound(txt)
MsgBox i & ":" & txt(i)
Next i

Sadly you can't use XPath of a text node but can get just that string using Split in selenium using XPath. This uses selenium type library reference after installing selenium basic.
Option Explicit
Public Sub GetInfo()
Dim d As WebDriver, arr() As String
Set d = New ChromeDriver
Const URL = "http://search.sunbiz.org/Inquiry/CorporationSearch/SearchResultDetail?inquiryType=DocumentNumber&aggregateId=domnp-763425-68d63992-2677-4bd5-9e1e-3f63ef505809&directionType=Initial&searchNameOrder=AMBASSADORBEACHOWNERSASSOCIATI%207634250&searchTerm=763425"
With d
.AddArgument "--headless"
.Start "Chrome"
.get URL
Debug.Print Split(.FindElementByXPath("//*[#id='maincontent']/div[2]/div[6]").Text, Chr$(10))(5)
.Quit
End With
End Sub

Related

Web scraping DEEPL.com using VBA Excel and Selenium

i'm trying to code a function to translate sentences in Excel using DEEPL.com
My approach is using Selenium to scrape the web using Chrome (as IExplore is not supported by the web).
Public Function deepL(txt As String, inputLang As String, outputLang As String)
Dim url As String
Dim driver As New WebDriver
url = "https://www.deepl.com/translator#" & inputLang & "/" & outputLang & "/" & txt
driver.Start "Chrome"
driver.Timeouts.ImplicitWait = 5000
driver.Get url
deepL = driver.FindElementById("target-dummydiv").Text
driver.Close
End Function
----
Sub translating()
'test for word "probando" from "es" to "en"
'url: https://www.deepl.com/translator#es/en/probando
'it should return: "testing"
MsgBox (deepL("probando", "es", "en"))
End Sub
The problem comes when loading the web, so the div containing the translation is empty on load, and the GET instruction returns an empty text.
But after 1 second, the page refreshes with the correct result:
<div id="target-dummydiv" aria-hidden="true" class="lmt__textarea lmt__textarea_dummydiv" lang="en-US">testing</div>
I tried adding an implicit wait of 5 seconds in order to give time to the webpage to load, but the result is the same.
What am I doing wrong?
EDIT: I found that the div with the translation has visibility: hidden. If I show the visibility, the results are correct, but don't know how to get that in my code
OK, I found a solution:
just select the textarea where the translation is located and get the translation with .attribute("value") instead of .text
deepL = driver.FindElementByCss("textarea.lmt__textarea.lmt__target_textarea.lmt__textarea_base_style").Attribute("value")

readonly cells ... .. ......

Code is required
which type of value format is needed to enter in Excel cell, with one example
these are the coding stuff, also provided you an image where any can visualize how the problem look like. And in this problem we can't just use .SendKeys here it is more typical, because it have the Date-Month-Time, so help me out in this.
I tried, after removing "readonly" word in HTML .. then its working fine, but this is not the way can you edit in this code,
Sub google_search()
Dim row As Integer
row = 2
Dim bot As WebDriver
Set bot = New WebDriver
Dim GenderDD As Selenium.WebElement
bot.Start "chrome"
bot.Get "https://abcd.com/"
bot.FindElementbyName("sample_cdate").SendKeys "Value"
Stop
End Function
Also giving Inspect of Targeted Site, for the reference
<input type="text" class="form-control datetimepicker" name="sample_cdate" id="sample_cdate" placeholder="Date and Time of Sample Collection" **readonly**="">
I tried, after removing "readonly" word in HTML .. then its working
fine, but this is not the way can you edit in this code
You should replace .SendKeys() method:
'bot.FindElementbyName("patient_id").SendKeys Sheet1.Cells(row, 3).Value
bot.ExecuteScript "arguments[0].setAttribute('value', arguments[1])", _
Array(bot.FindElementById("sample_cdate"), _
Format(Sheet1.Cells(row, 16).Value, "yyyy-mm-ddThh:mm:ss"))
As a readonly element, similar as on graphic WebBrowser, you cannot type input using .SendKeys(), but you can use JavaScript to set .Value attribute through programming.
As you show, your input id may be id="sample_rdate", not sample_cdate.

How to find a table using selenium and vba on webpage that uses iframes?

The below code worked up until a few days ago to go to the url, find the table and import the contents of the table into Excel. I then did some other formatting to get the table into the appropriate rows and columns. But now this code cannot locate the table. I do not fully understand the "Set a = .FindElementsByTag("iframe")(2)" and the ".SwitchToFrame 1". But my general understanding is that this portion of the code switches to a different frame which then extracts the internal url, which then is used to get the data form the table.
I need help identifying what to change in order to get the intended "url2", which is "https://docs.google.com/spreadsheets/d/e/2PACX-1vT__QigQ9cJV03ohUkeK5dgQjfAbJqxrc68bXh9Is1WFST8wjxMxDy7hYUCFHynqRvInsANUI22GdIM/pubhtml?gid=817544912&single=true&chrome=false&widget=false&headers=false" url. *note: I do not use this docs.google url because I do not know if this url will change periodically. I know the rosterresource.com/mlb-roster-grid url will stay consistent.
I have tried changing some of the integers for "Set a = .FindElementsByTag("iframe")(2)" and the ".SwitchToFrame 1", but I am doing that blindly since I am not familiar with this art of the code.
Sub GetRRgrid()
'"Selenium type library" is a reference used
Dim d As WebDriver, a As Object
Set d = New ChromeDriver
Const url = "https://www.rosterresource.com/mlb-roster-grid/"
With d
.Start "Chrome"
.Get url
Set a = .FindElementsByTag("iframe")(2)
.SwitchToFrame 1
url2 = .FindElementByCss("iframe").Attribute("src")
.Get url2
ele = .FindElementByTag("tbody").Attribute("innerText")
d.Close
End With
' other processes t format the data after it is imported
end sub
````
Getting the iframe and switching to it:
You need to pass the iframe element (identifier argument) to SwitchToFrame, you are then within that document and can interact with its contents. No need to .get on that with Selenium. You have to switch to .SwitchToDefaultContent to go back to parent document.
You can identify the iframe in question in a number of ways. Modern browsers are optimized for css selectors so I usually go with those. The css equivalent of
.FindElementByTag("iframe")
is
.FindElementByCss("iframe")
Your iframe is the first (and only) so I wouldn't bother gathering a set of webElements and indexing into it. Also, you want to try for a short selector of a single element where possible to be more efficient.
VBA:
Option Explicit
Public Sub Example()
Dim d As WebDriver
Const URL As String = "https://www.rosterresource.com/mlb-roster-grid/"
Set d = New ChromeDriver
With d
.Start "Chrome"
.get URL
.SwitchToFrame .FindElementByCss("iframe")
Stop
.Quit
End With
End Sub
Writing to Excel (.AsTable.ToExcel) :
Something I only just discovered, haven't seen documented anywhere, and am excited by, is that there is a method to write the table direct to Excel:
Option Explicit
Public Sub Example()
Dim d As WebDriver
Const URL As String = "https://www.rosterresource.com/mlb-roster-grid/"
Set d = New ChromeDriver
With d
.Start "Chrome"
.get URL
.SwitchToFrame .FindElementByTag("iframe")
.FindElementByCss(".waffle").AsTable.ToExcel ThisWorkbook.Worksheets("Sheet1").Range("A1")
Stop
.Quit
End With
End Sub
Here is what I ended up doing for this question. Thanks to QHarr for the guidance.
Public Sub GetRRrostergrid()
Dim d As WebDriver
Const URL As String = "https://www.rosterresource.com/mlb-roster-grid/"
Dim URL2 As String
Set d = New ChromeDriver
Sheet20.Activate
With d
.Start "Chrome"
.Get URL
URL2 = .FindElementByClass("post_content").FindElementByTag("iframe").Attribute("src")
.Get URL2
.FindElementByCss(".waffle").AsTable.ToExcel ThisWorkbook.Worksheets("RRchart").Range("b1")
.Quit
End With
End Sub

IE script for manipulating forms based on Excel data

I'm attempting to:
open a specific URL & pass log-in information
grab data from Excel and search specified data
once search is complete, manipulate a data field to correlating Excel data and execute several commands within the application
close IE or loop search for next cell in data
I've attempted using VBA forms and modules.
I found this code online which seemed to have worked once to pass my credentials, but I can't get it to work again.
These Objects all.email & all.password would be found in the source code on the webpage as the ID?
HTMLDoc.all.Email.Value = "email#example.com"
HTMLDoc.all.Password.Value = "ex5566"
Dim HTMLDoc As HTMLDocument
Dim oBrowser As InternetExplorer
Sub Login_2_Website()
Dim oHTML_Element As IHTMLElement
Dim sURL As String
On Error GoTo Err_Clear
sURL = "example.com"
Set oBrowser = New InternetExplorer
oBrowser.Silent = True
oBrowser.timeout = 60
oBrowser.navigate sURL
oBrowser.Visible = True
Do
' Wait till the Browser is loaded
Loop Until oBrowser.readyState = READYSTATE_COMPLETE
Set HTMLDoc = oBrowser.Document
HTMLDoc.all.Email.Value = "email#example.com"
HTMLDoc.all.Password.Value = "ex5566"
For Each oHTML_Element In HTMLDoc.getElementsByTagName("input")
If oHTML_Element.Type = "submit" Then oHTML_Element.Click: Exit For
Next
' oBrowser.Refresh ' Refresh If Needed
Err_Clear:
If Err <> 0 Then
Err.Clear
Resume Next
End If
End Sub
I think you can use the same code, which you use for finding the submit button, to find the e-mail and password elements. If you know which name or id these elements have (by checking the html code of the page), you can use for instance If oHTML_Element.Name = "password" then oHTML_Element.Value = "ex5566"
If the specific elements have an ID, you can also go directly to them by using oHTML_Element = document.getElementById("[id of element]")
oHTML_Element.Value = "password" This can also be done if they don't have an id, but only a name, but then you have to find out if the name is used multiple times.
The web developer can name their inputs, buttons, forms, ids whatever they want. The email could be named Email, or ID, or Username, or XYZ, this is why you must inspect the elements in the website so you can build your code accordingly. Lets take twitter for example.
<input class="js-username-field email-input js-initial-focus" type="text" name="session[username_or_email]" autocomplete="on" value="" placeholder="Phone, email or username">
The tag is an input tag, with a class name of js-username-field email-input js-initial-focus there is no ID on it, therefore you can not use HTMLDoc.getElementByID, you have to use HTMLDoc.getElementsByClassName or you could use HTMLDoc.getElementsByTagName but if there are more than 1 input you have to loop them and correctly detect the one you need.
Its easier than it sounds but you have to have some basic knowledge of HTML. Continuing with twitter, the tag for the password is:
<input class="js-password-field" type="password" name="session[password]" placeholder="Password">
Different class and different name to differentiate between the two. And finally the login/submit button:
<button type="submit" class="submit EdgeButton EdgeButton--primary EdgeButtom--medium">Log in</button>
With these 3 portions of the HTML elements, you can log in the following way:
HTMLDoc.getElementsByClassName("js-username-field email-input js-initial-focus")(0).Value = "email#example.com"
HTMLDoc.getElementsByClassName("js-password-field")(0).Value = "ex5566"
HTMLDoc.getElementsByClassName("submit EdgeButton EdgeButton--primary EdgeButtom--medium")(0).Click
What does the (0) mean? in HTML you can have many tags with the same class name, and they all are on an array when you call getElementsByClassName, since the login site only has 1 tag with those class names, the array position of "0" is the one you are looking for.
Again, the developer can name the class, the id, anything they want, therefore you want to inspect the website to properly code your script.

Web scraping worked fine in IE 9 but breaks in IE 11

I had a procedure that scraped information from a website in IE9 however after updating to IE11 the procedure breaks when trying to enter a piece of data into
an input box on the webpage. The code recognizes the field and it is listed as on object when I debug but when I try to enter a value into the box using CUSIP.value it does not enter anything on the webpage. I think it has something to do with the source being updated after the browser was updated. I could have sworn that the identifier for "txtCusipNo" in the HTML was listed as an ID instead of a Name. Any help is appreciated. Thanks.
HTML from website
<td class="tbl1">
<INPUT TYPE="TEXT" NAME="txtCusipNo" VALUE="" CLASS="input" SIZE="11" MAXLENGTH="9">
<img src="/RDPANN/pbs/images/lookup.gif" border="0" alt="Open Security Finder" align="absmiddle"> <IMG NAME="txtCusipIMG"SRC="/RDPANN/pbs/images/req.gif" ALIGN="ABSMIDDLE">
</td>
VBA code
Private Sub EnterCUSIP()
Retry:
Set CUSIP = Doc.getElementById("txtCusipNo")
Err.Clear
valA = ActiveSheet.Cells(row, 1)
On Error Resume Next
CUSIP.Value = ActiveSheet.Cells(row, 1) 'insert CUSIP
If Err.Number = 91 Then GoTo Retry
Set CurrentWindow = IE.document.parentWindow
Call CurrentWindow.execScript("javascript:processForm(document.forms.frmSearchEntry)") 'Search (hit enter)
If Err.Number = -2147352319 Then Exit Sub
On Error GoTo 0
Do While (IE.Busy Or IE.READYSTATE <> READYSTATE.READYSTATE_COMPLETE):DoEvents: Loop
End Sub
If you suspect that the HTML source has been changed and may make unannounced changes in the future, I would recommend switching to the ie.Document.All.Item property.
Doc.all.Item("txtCusipNo").Value = 123
The .Item identifier can be either an ID or a Name, there is no distinction between the two. However, I would be concerned that the identifying factor (e.g. txtCusipNo) may not be unique on that page. Yes, it is supposed to be but a growing number of HTML developers are using code like divs(0).getElementById("txtCusipNo") and divs(1).getElementById("txtCusipNo").

Resources