Get hyperlinks from web - excel

I am trying to get the data with hyperlinks from the web. I copied the data from the web and pasted it in excel. Whole data has been pasted in single-cell and no hyperlink carried when I separated the data with text to columns.
Source link: https://www.sec.gov/cgi-bin/current?q1=3&q2=6&q3=
I also tried to dump the data in Excel using the "From Web" option. Unfortunately, no hyperlink carried. Could you help provide suggestions?
Thanks

The macro grabs only all links (second and third column) from the table (which is no table). It takes a moment. Wait till the IE will close. Read the comments in the code please:
Sub LinkList()
Dim url As String
Dim browser As Object
Dim nodeContainer As Object
Dim nodeAllLinks As Object
Dim nodeOneLink As Object
Dim currentRow As Long
Dim controlCounter As Long
ActiveSheet.Columns("B:B").NumberFormat = "#"
ActiveSheet.Columns("D:D").NumberFormat = "#"
currentRow = 2
url = "https://www.sec.gov/cgi-bin/current?q1=3&q2=6&q3="
'Initialize Internet Explorer, set visibility,
'call URL and wait until page is fully loaded
Set browser = CreateObject("internetexplorer.application")
browser.Visible = True 'You can set this to False to make the IE invisible
browser.navigate url
Do Until browser.ReadyState = 4: DoEvents: Loop
'Get the container with all links inside
Set nodeContainer = browser.document.getElementsByTagName("pre")(0)
'Get all links in a node collection
Set nodeAllLinks = nodeContainer.getElementsByTagName("a")
'Get each link
For Each nodeOneLink In nodeAllLinks
'Every second link should be in the same row than the first link of a HTML table row
If controlCounter Mod 2 = 0 Then
With ActiveSheet
'Set link as link
.Hyperlinks.Add Anchor:=.Cells(currentRow, 1), Address:=nodeOneLink.href, TextToDisplay:=nodeOneLink.href
'Write the text of the link from the page to the column afte the link in Excel
.Cells(currentRow, 2).Value = nodeOneLink.innertext
End With
Else
With ActiveSheet
.Hyperlinks.Add Anchor:=.Cells(currentRow, 3), Address:=nodeOneLink.href, TextToDisplay:=nodeOneLink.href
.Cells(currentRow, 4).Value = nodeOneLink.innertext
End With
currentRow = currentRow + 1
End If
'Increment the control variable to devide between first and second link
controlCounter = controlCounter + 1
Next nodeOneLink
'Clean up
browser.Quit
Set browser = Nothing
Set nodeContainer = Nothing
Set nodeAllLinks = Nothing
Set nodeOneLink = Nothing
ActiveSheet.Columns("A:D").EntireColumn.AutoFit
End Sub

Related

Web Scraping - Problems with tags

I am new scraping web data and also using For...Next. I am trying to get data (all pages) from a website but it seems the code is wrong, since I get error 91. This is the code:
Dim ie As Object
Sub connect()
Set ie = CreateObject("INTERNETEXPLORER.APPLICATION")
ie.NAVIGATE "https://www.worldathletics.org/world-rankings/100m/men"
ie.Visible = True
End Sub
Sub id_tr_td_for()
Range("a1:z10000").ClearContents
For i = 0 To 10
For j = 0 To 5
Cells(i + 1, j + 1) = ie.document.getElementById("toplists").getElementsByTagName("tr")(i).getElementsByTagName("td")(j).innerText
Next
Next
End Sub
Can somebody help me with it and also to let me know who can I list all pages?
Thank you.
I'm not sure where the error comes from, I got it too.
The following code should be helpful, it will print the contents of the table for the specified page(s) to the debug window.
The following code should copy all the data for selected pages to sheet1
You will need to Add a couple of references in the VBA Editor to be able to use it. (Tools Menu, References and then find and select them) Microsoft HTML Object Library and Microsoft Internet Controls
Const MaxPage = 2 ' set to 26 (or however many there are) - at 2 for testing purposes
Dim Browser As InternetExplorer
Sub Start()
Dim Page As Integer: Page = 1 ' start at page 1
Dim PageDocument As IHTMLDocument
Dim RecordRow As IHTMLElementCollection
Dim RecordItem As IHTMLElement
Dim Sheet As Worksheet: Set Sheet = ThisWorkbook.Worksheets("Sheet1") ' output sheet
If Browser Is Nothing Then
Set Browser = New InternetExplorer
End If
Dim oRow As Integer: oRow = 2 ' begin output at row 2 (account for header)
Dim Record As Integer
For Page = 1 To MaxPage
LoadPage Page
For Record = 0 To 99 ' zero index, 100 items (1-99)
Set PageDocument = Browser.Document
Set RecordRow = PageDocument.getElementById("toplists").getElementsByTagName("table")(0).getElementsByTagName("tbody")(0).getElementsByTagName("tr")(Record).getElementsByTagName("td")
Sheet.Cells(oRow, 1).Value = Trim(RecordRow(0).innerText)
Sheet.Cells(oRow, 2).Value = Trim(RecordRow(1).innerText)
Sheet.Cells(oRow, 3).Value = Trim(RecordRow(2).innerText)
Sheet.Cells(oRow, 4).Value = Trim(RecordRow(3).innerText)
Sheet.Cells(oRow, 5).Value = Trim(RecordRow(4).innerText)
Sheet.Cells(oRow, 6).Value = Trim(RecordRow(5).innerText)
oRow = oRow + 1
Next Record
Next Page
Browser.Quit
End Sub
Sub LoadPage(ByVal PageNumber As Integer)
Debug.Print "Navigating to Page #" & CStr(PageNumber)
Browser.navigate "https://www.worldathletics.org/world-rankings/100m/men?page=" & CStr(PageNumber)
While Browser.readyState <> 4 Or Browser.Busy: DoEvents: Wend
Debug.Print "Navigation Complete"
End Sub
Updated Code
The Index Out-of-Bound error likely occurred due to the hard-coded indexes, if a page does not have 99 records it will fail, if a record doesn't have 5 fields, it will fail. The following code does away with indexes and just scrapes every row and cell it finds. You shouldn't get index errors but the output could be jagged.
Further Update
The 462 error was caused by the Browser.Quit. This closes the browser but does not set the reference to Nothing so when you run the code again it is trying to use a non-existent browser. Explicitly setting it to nothing at the end fixes this.
There is no link in the competitor column, the whole row has a data-url which is handled by something else. That URL can easily be accessed though.
Sub NewStart()
Dim PageDocument As IHTMLDocument
Dim Records As IHTMLElementCollection
Dim Record As IHTMLElement
Dim RecordItems As IHTMLElementCollection
Dim RecordItem As IHTMLElement
Dim OutputRow As Integer: OutputRow = 2
Dim OutputColumn As Integer
Dim Page As Integer
Dim Sheet As Worksheet: Set Sheet = ThisWorkbook.Worksheets("Sheet1")
If Browser Is Nothing Then
Set Browser = New InternetExplorer
Browser.Visible = True
End If
For Page = 1 To MaxPage
LoadPage Page
Set PageDocument = Browser.Document
Set Records = PageDocument.getElementById("toplists").getElementsByTagName("table")(0).getElementsByTagName("tbody")(0).getElementsByTagName("tr")
For Each Record In Records
Set RecordItems = Record.getElementsByTagName("td")
OutputColumn = 1
For Each RecordItem In RecordItems
Sheet.Cells(OutputRow, OutputColumn).Value = Trim(RecordItem.innerText)
OutputColumn = OutputColumn + 1
Next RecordItem
Sheet.Cells(OutputRow, OutputColumn).Value = "http://worldathletics.org/" & Record.getAttribute("data-athlete-url") ' This will add the link after the last column
OutputRow = OutputRow + 1
Next Record
Next Page
Browser.Quit
Set Browser = Nothing ' This will fix the 462 error
End Sub

Get Value and Position of Checkbox in Word Table to Copy Entire Table to Excel

I have a Word file with approximately 10 pages and 20 tables. Some of these tables have checkboxes. I want to copy these tables into an Excel file.
The following code copies all tables from my Word file into my Excel file:
Sub Import()
Option Explicit
Dim wb As Workbook
Dim sh As Worksheet
Dim sheet As Worksheet
Dim lzeile As Integer
Set wb = ActiveWorkbook
Set sh = wb.ActiveSheet
Set sheet = wb.Worksheets("Tabelle1")
Dim Btxt As Object
Set Btxt = CreateObject("Word.Application")
Btxt.Visible = True
Btxt.documents.Open "C:\Users\*.doc" '*=path
lzeile = 0
For i = 1 To 20
Btxt.ActiveDocument.Tables(i).Range.Copy
Application.Goto sheet.Cells(1 + lzeile, 1)
sheet.PasteSpecial Format:="HTML", Link:=False, DisplayAsIcon:=False
lzeile = sheet.Cells(Rows.Count, 1).End(xlUp).Row
lzeile = lzeile + 1
sheet.Cells(lzeile, 1) = "Tabelle" & i
Next i
Btxt.Quit
Set Btxt = Nothing
Set wb = Nothing
Set sh = Nothing
End Sub
It does not copy checkboxes or the value (0 = not checked / 1 = checked) of the checkbox.
I can write the value of a checkbox into a cell in my excel sheet with this line:
sheet.Cells(j, 10) = Btxt.ActiveDocument.Tables(i).FormFields.Item("Kontrollkästchen" & j).Result
With a loop j over all "Kontrollkästchen" (german translation of contentcontrol or formfield item) so basically the name of all formfield items in this Word file.
How can I get the position of these formfield items or identify which formfield item / ContentControl is in which table?
I tried to go through all rows and columns in each table because none of them are larger than 10x10. But I can´t find a way to check if a checkbox is maybe in table 3 on column 5 row 5 and then read the name of this checkbox to a safe the value (0 / 1) in the Excel cell on the same position in my copied table.
The solution depends on whether they're formfields or content controls.
Assuming they're formfields:
Sub Demo()
Dim i As Long, j As Long, Rng As Range
With ActiveDocument
For i = .FormFields.Count To 1 Step -1
With .FormFields(i)
If .Type = wdFieldFormCheckBox Then
j = Abs(.CheckBox.Value)
Set Rng = .Range
.Delete
Rng.Text = j
End If
End With
Next
End With
End Sub
Assuming they're content controls:
Sub Demo()
Dim i As Long, j As Long, Rng As Range
With ActiveDocument
For i = .ContentControls.Count To 1 Step -1
With .ContentControls(i)
If .Type = wdContentControlCheckBox Then
j = Abs(.Checked)
Set Rng = .Range
.Delete
Rng.Text = j
End If
End With
Next
End With
End Sub
For the sake of simplicity and clarity, the sample code below leaves out the parts having to do with Excel, as well as creating the instance of the Word Application. It shows only how to access the Word document's checkboxes and convert those to static values.
At the end, also, the document is closed without saving changes, which means forms protection and the checkboxes should be left intact - the macro will not have affected them.
Note: You should have Option Explicit at the top of the code page, not inside a "Sub".
How it works
The document to be processed is opened and at that moment set to an object (doc). Use this instead of ActiveDocument as it will be much clearer and, in case the user would try to do something, won't affect macro execution.
If the document has forms protection, this must be turned off in order to delete the checkboxes and insert static values.
Then all the form fields are looped. If they are checkboxes, the value is determined, the checkbox removed and the value assigned to the range the checkbox occupied.
After this has completed comes the code to transfer data to Excel. Then the document is closed without saving changes.
Sub ConvertCheckBoxesToValues()
Dim ff As Object ' Word.FormField
Dim doc As Object ' Word.Document
Dim cbValue As String
Dim rngFF As Object ' Word.Range
Set doc = Btxt.Documents.Open("C:\Users\*.doc") '*=path
If doc.ProtectionType <> -1 Then 'wdNoProtection
doc.Unprotect
End If
For Each ff In doc.FormFields
If ff.Type = 71 Then 'wdFieldFormCheckBox
If ff.CheckBox.value = True Then
cbValue = "1"
Else
cbValue = "0"
End If
Set rngFF = ff.Range
ff.Delete
rngFF = cbValue
End If
Next
'Transfer the information to Excel, then
doc.Close 0 'wdDoNotSaveChanges
End Sub

Error 1004 on Resizing range and Transposing array

I will not post full code since it's quite huge - I will focus on a part that is causing an error.
The macro is supposed to copy URL's generated in excel, open them in IE, copy source code to another sheet, look for something in this code, save results in specific cell, remove sheet and go to next URL. It works quite well, it copies the source codes for many URLs, but for some URLs it just fails. When I open the URLs manually - they work perfectly, but somehow Excel throws me an error for them.
Could you guys check the below could to help me better understand where is the problem?
Here are two samples links:
This one works good - link1
This one throws error 1004 - link2
And here is the code:
Sub CC_Check()
Dim ie As InternetExplorer
Dim html As HTMLDocument
Dim URL As Range
Dim Rng As Range
Dim ws1 As Worksheet
Set ws1 = Worksheets("One Code")
Set ie = New InternetExplorer
Set Rng = ws1.Range("A3:A18")
For Each URL In Rng
ThisWorkbook.Sheets.Add After:=Sheets(Sheets.Count)
ActiveSheet.Name = ws1.Cells(URL.Row, 2).Value & "_" & ws1.Cells(6, 7).Value
ie.Visible = False
ie.navigate URL.Value
Do While ie.readyState <> READYSTATE_COMPLETE
DoEvents
Loop
Set html = ie.document
Range("A1").Value = html.DocumentElement.outerHTML
Dim arr
arr = Split(html.DocumentElement.outerHTML, vbLf)
Range("A1").Resize(UBound(arr) + 1, 1).Value = Application.Transpose(arr) '<-- this line causing error 1004
The Application.Transpose has a number of problems. It fails when
The array has only one member (UBound(arr) = 1)
One of the strings has a length > 32K (but I have seen other cases where it failed already when a string had more that 255 chars)
The array size is larger than 64K (however, in Excel 2016 this will not cause a runtime error but a crippled array with less size
So, bet bet is to do the transform by hand which is rather easy. You should, by the way, use a Worksheet-variable for the sheets you add - never rely on Activesheet. The following code will create the new sheet only if it doesn't exist (else it will clear it's content so you can run the code several times
Set newWs = Nothing
On Error Resume Next
Set newWs = ThisWorkbook.Sheets(wsName)
On Error GoTo 0
If newWs Is Nothing Then
' Sheet doesn't exist, create a new one and name it
Set newWs = ThisWorkbook.Sheets.Add(After:=Sheets(Sheets.Count))
newWs.Name = ws1.Cells(URL.row, 2).Value & "x" & ws1.Cells(6, 7).Value
Else
' Sheet already there, clear its content
newWs.UsedRange.ClearContents
End If
(..Load HTML and split..)
' Do your own transpose into a 2nd array and dump that into sheet
Dim brr
ReDim brr(LBound(arr) To UBound(arr), 1 To 1) ' Make it 2-dimensional
Dim i As Long
For i = LBound(arr) To UBound(arr)
brr(i, 1) = arr(i)
Next i
Range("A1").Resize(UBound(arr) + 1, 1).Value = brr

eBay Product scraper

I am very limited on VBA,
The Code is in a Module, the code has a sub process as well, so sorry if I post the code wrong
A) open IE
B) Subprocess gets the data.
The code works fine on ebay.com but NOT for ebay.co.uk - can't work out why, also it converts urls to hyperlinks
It only does the first page, I need it to go through an X amount of pages - have a code but can't get it to work so have removed it.
Can the search query be run AFTER Ebay opens, so it opens, then search item is input to ebay and then code runs, or to run from a cell, IF its Cell A1 the data extracted needs to be pasted in A2 and below.
I have looked at elements for ebay.com and ebay.co.uk and they look the same to me, so can't work out why its not working as it works for 1 and not the other.
I did input the code for getting data from several pages it did not work. I know this code works as I have it for when I fetch urls from google
Public IE As New SHDocVw.InternetExplorer
Sub GetData()
Dim HTMLdoc As MSHTml.HTMLDocument
Dim othwb As Variant
Dim objShellWindows As New SHDocVw.ShellWindows
Set IE = CreateObject("internetexplorer.application")
With IE
.Visible = True
'.Navigate "https://www.ebay.co.uk/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=jackets&_sacat=0"
.Navigate "https://www.ebay.com/sch/i.html_from=R40&_nkw=ralph+lauren&_sacat=1059&LH_TitleDesc=0&_dmd=1&rt=nc"
While .Busy Or .readyState <> 4: DoEvents: Wend
Set HTMLdoc = IE.document
ProcessHTMLPage HTMLdoc
.Quit
End With
End Sub
code here
enter
'''''' THIS IS THE SUB PROCESS '''''
Sub ProcessHTMLPage(HTMLPage As MSHTml.HTMLDocument)
Dim HTMLItem As MSHTml.IHTMLElement
Dim HTMLItems As MSHTml.IHTMLElementCollection
Dim HTMLInput As MSHTml.IHTMLElement
Dim rownum As Long
rownum = 1
Set HTMLItems = HTMLPage.getElementsByClassName("s-item__title")
For Each HTMLItem In HTMLItems
Cells(rownum, 1).Value = HTMLItem.innerText
rownum = rownum + 1
Next HTMLItem
rownum = 1
Set HTMLItems = HTMLPage.getElementsByClassName("s-item__price")
For Each HTMLItem In HTMLItems
Cells(rownum, 2).Value = HTMLItem.innerText
rownum = rownum + 1
Next HTMLItem
rownum = 1
Set HTMLItems = HTMLPage.getElementsByClassName("s-item__link")
For Each HTMLItem In HTMLItems
Cells(rownum, 3).Value = HTMLItem.href
rownum = rownum + 1
Next HTMLItem
'Converts each text hyperlink selected into a working hyperlink from C1 to 25000 rows
Range("C1:C25000").Select
For Each xCell In Selection
ActiveSheet.Hyperlinks.Add Anchor:=xCell, Address:=xCell.Formula
Next xCell
Range("C1").Select
End Sub
Code for going to next page
pageNumber = 1
'i = 2
If pageNumber >= 6 Then Exit Do 'the first 6 pages
internetdata.getElementById("pnnext").click 'next web page
Do While internet.Busy Or internet.readyState <> 4
DoEvents
Loop
Set internetdata = internet.document
pageNumber = pageNumber + 1
Loop
Does not work on Ebay.co.uk - NO RESULTS ARE EXTRACTED - Works fine in ebay.com
Need it to get data from X amount of pages and NOT just 1 page
Can the search query be run AFTER Ebay opens, so it opens, then search item is input to ebay and then code runs, or to run from a cell, IF its Cell A1 the data extracted needs to be pasted in A2 and below.
This is my code for google search, I have got it working so the search comes from cell A1, I am look for something like this, I am going to see if I can use the ebay code with this. As this also does the first 25 pages in google search
enter Sub webpage()
Dim ie As Object
Dim htmlDoc As Object
Dim nextPageElement As Object
Dim div As Object
Dim link As Object
Dim url As String
Dim pageNumber As Long
Dim i As Long
' Takes seach from A1 and places it into google
url = "https://www.google.co.uk/search?q=" & Replace(Worksheets("Sheet1").Range("A1").Value, " ", "+")
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
.navigate url
Do While .Busy Or .readyState <> 4
DoEvents
Loop
End With
Application.Wait Now + TimeSerial(0, 0, 5)
Set htmlDoc = ie.document
pageNumber = 1
i = 2
Do
For Each div In htmlDoc.getElementsByTagName("div")
If div.getAttribute("class") = "r" Then
Set link = div.getElementsByTagName("a")(0)
Cells(i, 2).Value = link.getAttribute("href")
i = i + 1
End If
Next div
If pageNumber >= 25 Then Exit Do 'the first 25 pages
Set nextPageElement = htmlDoc.getElementById("pnnext")
If nextPageElement Is Nothing Then Exit Do
' Clicks web next page
nextPageElement.Click 'next web page
Do While ie.Busy Or ie.readyState <> 4
DoEvents
Loop
Application.Wait Now + TimeSerial(0, 0, 5)
Set htmlDoc = ie.document
pageNumber = pageNumber + 1
Loop
MsgBox "All Done"
Set ie = Nothing
Set htmlDoc = Nothing
Set nextPageElement = Nothing
Set div = Nothing
Set link = Nothing
End Sub
code here
Question 1: Why does it work for one domain but not the other?
To answer question 1 (the other questions should be new posts) - the html is not the same at all. The classes which work for ebay.com are not found in ebay.co.uk; So, your loop over collections doesn't do anything because they are count 0 (or length 0 with nodeLists if using querySelectorAll). Instead, you need branched code. Set your selectors based on the url domain.
I have used css selectors as this is the easiest, and fastest way, to select the required elements whilst maintaining the flexibility of a code re-factor to reduce the lines of repeated code.
Side note:
If you are unsure about whether your selection method will work across different pages you can do at least two things:
Right click > inspect element > visually check the class names are the same for the elements you are attempting to compare. So, if you are looking at product names, are the class names in the html the same on both pages?
You can use the search facility of the browser > open element tab via F12 then press Ctrl+F to pull up search box > enter your class name from the first page into this box in the second page and hit enter. You can also enter css selectors here and some cases regex. You will get a hit count telling you how many matches found. You can keep pressing enter to cycle through matches and each match will be highlighted in the html above, so you can easily compare if matched results are what you expected.
click image to enlarge
img url: https://i.stack.imgur.com/MWkEx.png
VBA:
Option Explicit
Public Sub GetData()
Dim htmlDoc As MSHTML.HTMLDocument, ie As SHDocVw.InternetExplorer, ws As Worksheet
Set ie = New SHDocVw.InternetExplorer
Set htmlDoc = New MSHTML.HTMLDocument
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
.Visible = True
'.Navigate2 "https://www.ebay.co.uk/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=jackets&_sacat=0"
.Navigate2 "https://www.ebay.com/sch/i.html_from=R40&_nkw=ralph+lauren&_sacat=1059&LH_TitleDesc=0&_dmd=1&rt=nc"
While .Busy Or .readyState <> 4: DoEvents: Wend
Dim index As Long, HTMLItems As Object, rowNum As Long, xCell As Range
Dim cssSelectors(), i As Long
Select Case True
Case InStr(.document.URL, "ebay.co.uk") > 0
cssSelectors = Array(".gvtitle a", ".amt", ".gvtitle a")
Case InStr(.document.URL, "ebay.com") > 0
cssSelectors = Array(".s-item__title", ".s-item__price", ".s-item__link")
End Select
With ws
For i = LBound(cssSelectors) To UBound(cssSelectors)
rowNum = 1
Set HTMLItems = ie.document.querySelectorAll(cssSelectors(i))
For index = 0 To HTMLItems.length - 1
.Cells(rowNum, i + 1).Value = IIf(i = 2, HTMLItems.item(index).getAttribute("href"), HTMLItems.item(index).innerText)
rowNum = rowNum + 1
Next
Next
For Each xCell In .Range("C1:C25000") '<= all these really?
.Hyperlinks.Add Anchor:=xCell, Address:=xCell.Formula
Next xCell
End With
.Quit
End With
End Sub
If this works on eBay then you need to find out yourself why it's not working on ebay.co.uk. My point is if the code itself works than there is nothing we can help you with here. You need to take some time to investigate ebay.co.uk and find the differences as I am sure it's something minor. I can't help u fix code that isn't actually broken. I wish you luck though.

Excel Pulling multiple Tables From a Website

I am working on a project to run some analytical models on NFL player stats. I have some code below that another user passed along to me. This code takes a list of links that I have on Sheet1, which is named "PlayerList", and creates a new tab for each player and pulls in their passing stats. All of the links are to Pro Football Reference. I am able to change this code to pull all necessary data for all positions other than quarterback. For the QBs I want to pull the passing stats table as well as the rushing and receiving stats table. Any help would be greatly appreciated. For reference here a few sample links:
https://www.pro-football-reference.com/players/R/RodgAa00.htm
https://www.pro-football-reference.com/players/B/BreeDr00.htm
Below is the code:
Option Explicit
Public Sub GetInfo()
Di If InStr(links(link, 1), "https://") > 0 Then
Set html = GetHTMLDoc(links(link, 1))
Set hTable = html.getElementById("passing")
If Not hTable Is Nothing Then
playerName = GetNameAbbr(links(link, 1))
Set ws = AddPlayerSheet(playerName)
WriteTableToSheet hTable, ws
FixTable ws
End If
End If
Next
Application.ScreenUpdating = True
End Sub
Public Function GetHTMLDoc(ByVal url As String) As HTMLDocument
Dim sResponse As String, html As New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", url, False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
html.body.innerHTML = sResponse
Set GetHTMLDoc = html
End Function
Public Sub WriteTableToSheet(ByVal hTable As HTMLTable, ByVal ws As Worksheet)
Dim x As Long, y As Long
With hTable
For x = 0 To .Rows.Length - 1
For y = 0 To .Rows(x).Cells.Length - 1
If y = 6 Or y = 7 Then
ws.Cells(x + 4, y + 1).Value = Chr$(39) & .Rows(x).Cells(y).innerText
Else
ws.Cells(x + 4, y + 1).Value = .Rows(x).Cells(y).innerText
End If
Next y
Next x
End With
End Sub
Public Function GetNameAbbr(ByVal url As String)
Dim tempArr() As String
tempArr = Split(url, "/")
GetNameAbbr = Left$(tempArr(UBound(tempArr)), 6)
End Function
Public Function AddPlayerSheet(ByVal playerName As String) As Worksheet
Dim ws As Worksheet
If SheetExists(playerName) Then
Application.DisplayAlerts = False
ThisWorkbook.Worksheets(playerName).Delete
Application.DisplayAlerts = True
End If
Set ws = ThisWorkbook.Worksheets.Add
ws.Name = playerName
Set AddPlayerSheet = ws
End Function
Public Function SheetExists(ByVal playerName As String) As Boolean
SheetExists = Evaluate("ISREF('" & playerName & "'!A1)")
End Function
Public Sub FixTable(ByVal ws As Worksheet)
Dim found As Range, numSummaryRows As Long
With ws
Set found = .Columns("A").Find("Career")
If found Is Nothing Then Exit Sub
numSummaryRows = .Cells(.Rows.Count, "A").End(xlUp).Row - found.Row
numSummaryRows = IIf(numSummaryRows = 0, 1, numSummaryRows + 1)
Debug.Print found.Offset(, 1).Resize(numSummaryRows, 30).Address, ws.Name
found.Offset(, 1).Resize(numSummaryRows, 30).Copy found.Offset(, 2)
found.Offset(, 1).Resize(numSummaryRows, 1).ClearContents
End With
End Subm html As New HTMLDocument, links(), link As Long, wsSourceSheet As Worksheet
Dim hTable As HTMLTable, ws As Worksheet, playerName As String
Set wsSourceSheet = ThisWorkbook.Worksheets("PlayerList")
Application.ScreenUpdating = False
With wsSourceSheet
links = .Range("C2:C" & .Cells(.Rows.Count, "C").End(xlUp).Row).Value
End With
For link = LBound(links, 1) To UBound(links, 1)
Is there a reason you need to do this with VBA? Excel is quite capable of importing well-organized data such as the [several] tables on that page.
Under the Data tab, click From Web and then enter the Website URL.
Click images to enlarge
Next you will choose the table(s) that you want. Don't go nuts - only get what you need, but you can choose more than one tables by enabling the checkbox.
It can take a few long minutes to parse and organize all the data on the page...
Once you're back at the worksheet you'll see the queries on the right side. Right-click a query and choose Load To..., then choose Table and a location for the table data. There are a ton of other properties that you can customize; there are tutorials describing what you can do.
More things to customize are hidden in two ribbon tabs that only appear when you click on a table, Design and Query.
I think there's also a way to just create a list of players and then to use the Advanced option when entering the URL to allow you to dynamically choose any player you want, while only adding the tables once... but I've never quite figured that part out yet.
I'm not a sports fan, but I assume the data will be changing throughout the season, and an advantage of using tables like this is that once you set up your worksheet how you want it, there are settings you can choose to auto-update every time you open the workbook, or on schedule, or manually, or never; whatever is appropriate.
Google "Excel web query" to find out more about the plethora of options available to you when using queries (aka: "Get & Transform") to extract and organize your data.
Perhaps this could be an alternative to consider instead of coding functionality that's already built-in to Excel.
Good luck, and "Go Sports!"
Yes there is a reason for doing this with VBA. In fact at least five.....
You don't manually have to set up it up for all the links, which if you have a very long list means you would end up having to turn to automation anyway;
On a related theme, powerquery has limitations on how many connections it can support and with NFL player lists you can easily go way beyond what is supported and end up, even when at the max number of connections allowed, with a workbook that crashes or grinds to a halt (I have been there!);
Both tables are not always present so the below has error handling to deal with that;
You get your player named sheets as before, and again error handling for if sheet already present;
Not all versions of powerquery have the nice interface which will allow you to select all the tables individually for these pages. My version of Excel 2016 basically offers only to select the entire page. In that case you have more data than you need and a slowed down process.
Whilst there may be ways to handle this with inbuilt tools, I love me a bit of powerquery, it is no longer "out of the box", but requires knowing how to code in M to some extent and/or reverting to using some VBA anyway.
If you tie this to a button on a sheet you can easily press to refresh when you want, link it to a workbook_open event to refresh on opening, even have windows scheduler open the workbook and refresh at certain times (just so you know VBA still got your back! Though maybe with a little help from my friends ♫ aka Windows).
It seems XHR is just a little too fast for the lower tables on each page, but do not despair, you could use Internet Explorer, with a short delay to ensure the Rushing & Receiving table is populated, or, as I have, use Selenium to automate the browser (I have used Chrome but Internet Explorer is possible). Although this is slower than XHR, we can be a little more efficient by running a headless browser instance.
Here you go with VBA which will give you the different tabs as you go and select only those tables required. Based on links in at C2 on sheet1.
Option Explicit
Public Sub GetInfo()
Dim d As New ChromeDriver
Dim html As New HTMLDocument, links(), link As Long, wsSourceSheet As Worksheet, clipboard As Object
Dim hTablePass As HTMLTable, hTableRushReceive As HTMLTable, ws As Worksheet, playerName As String
Set wsSourceSheet = ThisWorkbook.Worksheets("Sheet1") '<change to sheet containing links
Application.ScreenUpdating = False
With wsSourceSheet
If .Cells(.Rows.Count, "C").End(xlUp).Row = 2 Then
ReDim links(1 To 1, 1 To 1): links(1, 1) = .Range("C2")
Else
links = .Range("C2:C" & .Cells(.Rows.Count, "C").End(xlUp).Row).Value
End If
End With
For link = LBound(links, 1) To UBound(links, 1)
If InStr(links(link, 1), "https://") > 0 Then
With d
.AddArgument "--headless"
.get links(link, 1)
html.body.innerHTML = .PageSource
Set hTablePass = html.querySelector("#all_passing #passing")
Set hTableRushReceive = html.querySelector("#all_rushing_and_receiving #rushing_and_receiving")
playerName = GetNameAbbr(links(link, 1))
Set ws = AddPlayerSheet(playerName)
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
If Not hTablePass Is Nothing Then
clipboard.SetText Replace$(Replace$(hTablePass.outerHTML, "--></DIV>", vbNullString), "<!--", vbNullString)
clipboard.PutInClipboard
ws.Cells(GetLastRow(ws, 1), 1).PasteSpecial
End If
If Not hTableRushReceive Is Nothing Then
clipboard.SetText hTableRushReceive.outerHTML
clipboard.PutInClipboard
ws.Cells(GetLastRow(ws, 1) + 2, 1).PasteSpecial
End If
End With
End If
Next
Application.ScreenUpdating = True
End Sub
Public Function GetNameAbbr(ByVal url As String) As String
Dim tempArr() As String
tempArr = Split(url, "/")
GetNameAbbr = Left$(tempArr(UBound(tempArr)), 6)
End Function
Public Function AddPlayerSheet(ByVal playerName As String) As Worksheet
Dim ws As Worksheet
If SheetExists(playerName) Then
Application.DisplayAlerts = False
ThisWorkbook.Worksheets(playerName).Delete
Application.DisplayAlerts = True
End If
Set ws = ThisWorkbook.Worksheets.Add
ws.Name = playerName
Set AddPlayerSheet = ws
End Function
Public Function SheetExists(ByVal playerName As String) As Boolean '<== *#Rory
SheetExists = Evaluate("ISREF('" & playerName & "'!A1)")
End Function
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.Rows.Count, columnNumber).End(xlUp).Row
End With
End Function
References:
Microsoft HTML Object Library
Selenium Type Library
Selenium basic download:
https://github.com/florentbr/SeleniumBasic
*Function adapted from #Rory

Resources