Inserting shapes gets progressively slower - excel

I make entomological specimen labels that come with an embedded QR code. Museum curators can scan the QR codes of a series of specimens in the same group and manipulate data.
The QR code images are inserted as "shapes" (I believe--they respond to shape commands in the macro), generated via VBA code by Jiri Gabriel, with editing by Jonas Heidelberg (https://github.com/JonasHeidelberg/barcode-vba-macro-only).
The macro takes data, populates cells with strings and values (i.e., what gets printed on the human-readable part of the individual labels). When all of the printed text is inserted, the macro iteratively generates one QR code image at a time and places each generated image next to the corresponding human-readable label.
The macro is quick to generate and insert the first few QR code images then gets progressively slower. I presume because Excel is not built to handle a large number of high-resolution images on the same spreadsheet. My sheet design accommodates 220 individual QR code images, but it takes nearly 10 minutes to populate the spreadsheet with 50 QR code images (it takes less than 30 seconds to populate 10 QR code images, so the slowdown is appreciable).
I have tried:
Disable screen updating - does not seem to improve the processing speed
Set calculation to manual - does not seem to improve the processing speed
After generating each QR code image, hide the image by using the following code, and then at the very end, turn all the images visible - seems to help a little bit but not nearly sufficient to make the macro usable at scale.
ActiveSheet.Shapes.SelectAll
Selection.ShapeRange.Visible = False
I looked for solutions to see if the QR code image shapes can be merged into one shape, because after all, wouldn't it be easier to manage a single shape than 200+ individual small shapes? There seems to be no functionality to combine all of the shapes into a single shape.
Another solution I thought about is simultaneously generating all of the QR codes, instead of iteratively, then perhaps it won't have the issue of the later-coming shapes being slow to render due to having to hold all of the previously rendered codes in its memory. I haven't found a way to write the code such that all QR code image shapes are generated in parallel, rather than in sequence.
Another solution I toyed with is to paste the shapes as PNG or some other image that could potentially be easier to deal with, but I get a lot of loss of quality, which seems strange because the QR code should be just a matrix of black and white cells, right? Why do they lose so much quality?

I would suggest an approach based on built in MS Word 2013+ feature (https://support.microsoft.com/en-us/office/field-codes-displaybarcode-6d81eade-762d-4b44-ae81-f9d3d9e07be3). Below is an example of generating 200 QR codes in 10.6 seconds:
Option Explicit
Sub MakeQRcodes()
Const QR_COUNT = 200
Dim fld As Field, tbl As Table, rng As Range
Dim Code As String, i As Integer, t As Single
t = Timer
ThisDocument.Range.Delete
Set tbl = ThisDocument.Tables.Add(Range:=Selection.Range, NumRows:=QR_COUNT, NumColumns:= _
2, DefaultTableBehavior:=wdWord9TableBehavior, AutoFitBehavior:= _
wdAutoFitFixed)
For i = 1 To QR_COUNT
Code = "Insect #" & i ' data can be obtained from Excel spreadsheet
tbl.Cell(i, 1).Range.Text = "QR code for [" & Code & "]:"
Set rng = tbl.Cell(i, 2).Range
rng.Collapse Direction:=wdCollapseStart
Set fld = ThisDocument.Fields.Add(Range:=rng, Type:=wdFieldEmpty, _
Text:="DisplayBarcode """ & Code & """ QR \q 3")
Next i
ThisDocument.ActiveWindow.View.ShowFieldCodes = False
With tbl.Range
.ParagraphFormat.Alignment = wdAlignParagraphCenter
.Cells.VerticalAlignment = wdCellAlignVerticalCenter
End With
Debug.Print "Done " & QR_COUNT & " items in " & Timer - t & " seconds"
End Sub
' Done 200 items in 10,62109 seconds
Result:
Edit2 (VBA Excel code)
Please note that in my experience the DisplayBarcode field works well only with the Latin alphabet. If you have other symbols, check the code on real lines.
Option Explicit
Sub makeQRs()
Dim arr
arr = ThisWorkbook.Sheets("Sheet1").ListObjects("Table1").DataBodyRange.Columns(3)
Call MakeQRcodes(arr)
End Sub
Sub MakeQRcodes(arr) ' arr(n,1)
'you need to add a reference to the "Microsoft Word Object Library" in the Tools-References VBE menu
Dim wd As New Word.Application, doc As Word.Document, fld As Word.Field, tbl As Word.Table, rng As Word.Range
Dim Code As String, i As Integer, QR_count As Integer, t As Single
QR_count = UBound(arr, 1)
t = Timer
wd.Visible = False ' hide the Word app
Set doc = wd.Documents.Add ' create a new Word document
Set tbl = doc.Tables.Add(Range:=doc.Range, NumRows:=QR_count, _
NumColumns:=2, DefaultTableBehavior:=1) 'wdWord9TableBehavior = 1
For i = 1 To QR_count
Code = arr(i, 1)
tbl.Cell(i, 1).Range.Text = "QR code for [" & Code & "]:"
Set rng = tbl.Cell(i, 2).Range
rng.Collapse Direction:=wdCollapseStart
Set fld = doc.Fields.Add(Range:=rng, Type:=-1, _
Text:="DisplayBarcode """ & Code & """ QR \q 3")
Next i
doc.ActiveWindow.View.ShowFieldCodes = False
With tbl.Range ' center text and QR-code in the table cells
.ParagraphFormat.Alignment = 1 'wdAlignParagraphCenter
.Cells.VerticalAlignment = 1 'wdCellAlignVerticalCenter
End With
Application.DisplayAlerts = 0 'wdAlertsNone
With doc
' save the Word doc as .pdf in the same folder as this Excel workbook
.SaveAs2 ThisWorkbook.Path & "\QR.pdf", 17 'wdFormatPDF
.Close False ' close Word document without saving
wd.Quit ' close Word app
End With
Application.DisplayAlerts = -1 'wdAlertsAll
MsgBox "Done " & QR_count & " QR-codes in " & Round(Timer - t, 1) & " seconds," & vbLf _
& "saved in " & ThisWorkbook.Path & "\QR.pdf"
End Sub
Data & result MsgBox
QR.pdf

Related

Insert graphs from Excel to Word

Hi this is a follow up question from my previous question: Insert text to Word from Excel looping through drop down list
In addition to inserting text, I need to insert graphs for every region and copy the graph under each text. The graphs changes with the data for every region and is located next to the data table.
So the result has to look like this:
Text 1
Graph 1
Text 2
Graph 2 etc.
The code that inserts text (see from the previous question):
Sub Export()
Dim reg As Variant, col As String, txt As String
With ThisWorkbook.Sheets("Sheet 1")
For Each reg In Array("Region1", "Region2", "Region3")
.Range("B3") = reg
.Calculate
col = IIf(.Range("D2").Value = 14, "C", "D") 'select column due to D2 value
' collect all texts in txt
txt = txt & vbTab & "For " & reg & ", on June, 21 the estimate was " & _
.Range(col & "6").Text & " and the volume was " & .Range(col & "7").Text & _
" and the variance was " & .Range(col & "8").Text & vbLf
Next
End With
With CreateObject("Word.Application").Documents.Add
.Range.Text = txt ' output all text to the document
.SaveAs "C:\temp\AllTheText.docx" ' your path and name
.Parent.Quit 'quit Word
End With
End Sub
Sorry this isn't a bonafide answer, I don't have the reputation to ask a follow up question...
Have you considered copying the chart from Excel into Word, then issuing a Refresh command on the chart in Word once data is updated in Excel? You could then create a "AlltheTextTemplate" Word file with the charts preset. You'd then just create a copy of the template when creating a new report, fill it in with the data, and create a small bit of code to refresh the chart object.
Here's a few lines you could add:
Set wordapp = CreateObject("Word.Application")
' Path listed here assumes the template is stored in the same DIR as the workbook
Set objDoc = wordapp.documents.Open(Application.ThisWorkbook.Path & "\Template.docx")
wordapp.Visible = True 'Remove if you don't need to see it
With ObjDoc
.Range.Text = txt ' output all text to the document
.InlineShapes(1).LinkFormat.Update 'update the shape ID as needed
.InlineShapes(1).LinkFormat.BreakLink 'If you want to break the data link
.SaveAs "C:\temp\AllTheText.docx" ' your path and name
.Parent.Quit 'quit Word
End With

Copy/Paste Excel cells to Word in loop not working/saving properly

Admittedly new to VBA. I have an excel file that consists of all our part numbers broken down by "-" marks that I wrote code to break down into more descriptive phrases for making labels. I am trying to write code here to loop through different types of part numbers, grab particular cells in that part #'s row and copy/pasting them into a word document and saving the word doc as the part #'s name. As is, it loops but grabs all the info from the different ranges instead of just the info from the same row as part.
The code works (besides saving) if I change the ranges to 1 single cell, but once I have multiple cells in the ranges, it begins copying everything in the range instead of just in the row of the part that it should be looping.
Sub exceltoword2()
Dim part As Range
Dim funct As Range
Dim finish As Range
Dim lever As Range
Dim backset As Range
Dim trim As Range
Set part = Range("A2:A5")
Set funct = Range("Q2:Q5")
Set finish = Range("R2:R5")
Set lever = Range("S2:S5")
Set backset = Range("T2:T5")
Set trim = Range("U2:U5")
Dim wdapp As Word.Application
Set wdapp = New Word.Application
Dim SaveName As String
Dim path As String
path = "C:\Users\bpickett\Desktop\Parts\"
For Each part In part 'Long list of part #'s that will be looped through with particular variables commented out as needed as I adjust range on part variable
With wdapp
.Visible = True
.Documents.Add
.Activate
part.Copy '********************************Part copied
.Selection.PasteSpecial
With .Selection '**********************Function copied
.Font.Name = "Calibri"
.Font.Size = 22
.TypeText "FUNCTION " '7 spaces
End With
funct.Copy
.Selection.PasteSpecial
With .Selection '**************************Finish
.Font.Name = "Calibri"
.Font.Size = 22
.TypeText "FINISH " '14 spaces
End With
finish.Copy
.Selection.PasteSpecial
With .Selection '***************************Backset
.Font.Name = "Calibri"
.Font.Size = 22
.TypeText "BACKSET " '10 spaces
End With
backset.Copy
.Selection.PasteSpecial
ActiveDocument.SaveAs2 path & part & ".docx"
End With
Next
End Sub
The code when ran has the 1st part # correct then, just copies the entire range of Backset/Function/Finish under each instead of just the single cell in the row of the part #.
Took advice from comments above and did a days more worth of research and made some changes that solved all problems.
For i = 1 to 1275
Set part = Range("A" & i)
Set funct = Range("Q" & i)
Set finish = Range("R" & i)
Set lever = Range("S" & i)
Set backset = Range("T" & i)
Set trim = Range("U" & i)
Qualifying the ranges(or at least to me it did) and adding the i array instead of for each part in part was huge for code to loop and grab only necessary information. But when running, it was crashing with error 4605 a lot. But a 7 year old question that was answered on here surrounded the copy/paste commands with labels and error handler
Pg1CopyAttempt:
DoEvents
part.Copy
On Error GoTo Pg1PasteFail
.selection.pastespecial
On Error goto 0 'disable the error handler
Pg1PasteFail:
If Err.Number = 4605 Then ' clipboard is empty or not valid.
DoEvents
Resume Pg1CopyAttempt
End If
Which worked FLAWLESSLY to go through strings of 100's or 1000's of loops(Files created). Just had to modify Pg1 to 2 to 3 and ect through code.

How to speed up a data file import in Excel VBA

Update as of June 11, 2019: I still haven’t figured out why practically all of my delay happens in those two lines, but current status is that I put up with the delay. So far, I have about 6000 rows of data in the master document, and an import process takes about 20 seconds regardless of how many rows I import.
—
I have a "master document" and I import data from lots and lots of little documents all day long. I admit I'm not a super-genius here, and a lot of my coding habits come from doing it "old school" so there may be "Excel ways" that I don't know (but want to learn!).
The issue I'm seeing is how much time a data file import can take.
When I started the tool out, data imports took only a few seconds.
Now that I have about 3500 rows of data, data imports take about 15-20 seconds. It doesn't matter if I am importing one row or a hundred rows. I expect this to keep going up. By the time I get to 7000 rows or 10,000 rows, I expect it to become unbearable.
By using message boxes (remember: "old school"), I've been able to narrow the speed bottleneck down to two lines of code. Between "Step 1" and "Step 2" is about 30% of my delay, and between "Step 2" and "Step 3" is about 70% of my delay.
I've included the whole sub below to make sure I'm not missing something obvious, but I made sure to UNINDENT my message boxes so you can go r-i-g-h-t to the code I suspect. Also, I included the entire sub because usually one of the first responses is “can you show the whole sub so I have better context?”
Thank you kindly for any thoughts or suggestions you might have. :)
Private Sub Btn_ImportDataFiles_Click()
' Search the current worksheet and assign the next TransactionID
Dim TransactionCounter As Integer
Dim TransactionID As Long ' This is the next available Transaction ID
TransactionID = Application.WorksheetFunction.Max(Range("a:a")) + 1
' open the file and import the data
Dim customerBook As Workbook
Dim filter As String
Dim caption As String
Dim customerFilename As String
Dim customerWorkbook As Workbook
Dim targetWorkbook As Workbook
' make weak assumption that active workbook is the target
Set targetWorkbook = Application.ActiveWorkbook
' get the customer workbook
filter = "Text files (*.xlsx),*.xlsx"
caption = "Please Select an input file "
customerFilename = Application.GetOpenFilename(filter, , caption)
If customerFilename <> "False" Then
' If they have uploaded the file before, let them know.
' If they want to keep uploading it, no harm done,
' but no need to stupidly add data that is already present.
' Select the archive sheet
Sheets("Upload_Archive").Select
Dim FileNameHunt As String
Dim cell As Range
Dim ContinueUpload As Boolean
ContinueUpload = True
FileNameHunt = Mid(customerFilename, InStrRev(customerFilename, "\") + 1)
Columns("A:A").Select
Set cell = Selection.Find(what:=FileNameHunt, after:=ActiveCell, LookIn:=xlFormulas, lookat:=xlWhole, searchorder:=xlByRows, searchdirection:=xlNext, MatchCase:=False, searchformat:=False)
If cell Is Nothing Then ' Add the new filename to the archive
Sheets("Upload_Archive").Select
Rows(1).Insert shift:=xlDown
Range("a1:a1").Value = FileNameHunt
Sheets("MasterSheet").Select
Application.Cells.Font.Name = "Calibri Light"
Application.Cells.Font.Size = "8"
Application.Cells.Font.Bold = False
Else
response = MsgBox("This data file has previously been uploaded. " & vbCrLf & "Do you want to cancel this upload?" & vbCrLf & vbCrLf & "Pressing [yes] will cancel the process." & vbCrLf & "Pressing [no] will continue with the file upload" & vbCrLf & "and add the data to the tracking sheet.", vbYesNo)
If response = vbYes Then
ContinueUpload = False
Sheets("MasterSheet").Select
Exit Sub
End If
End If ' If cell Is Nothing Then...
If ContinueUpload = True Then
' Continue with data upload procedure
Sheets("MasterSheet").Select
Set customerWorkbook = Application.Workbooks.Open(customerFilename)
' Copy data from customer to target workbook
Dim targetSheet As Worksheet
Set targetSheet = targetWorkbook.Worksheets(1)
Dim sourceSheet As Worksheet
Set sourceSheet = customerWorkbook.Worksheets(1)
Dim ImportRecordCount As Integer
ImportRecordCount = sourceSheet.Range("B1")
Dim ReconciliationID As String
ReconciliationID = ""
If sourceSheet.Range("E3") = "Removed from Depot" Then ReconciliationID = "1"
MsgBox ("Step 1")
targetSheet.Range("A1").EntireRow.Offset(1).Resize(ImportRecordCount).Insert shift:=xlDown ' Add the blank rows
MsgBox ("Step 2")
targetSheet.Range("B2:AB" & ImportRecordCount + 1).Value = sourceSheet.Range("A3:AA" & ImportRecordCount + 2).Value ' Bring in the big pile of data
MsgBox ("Step 3")
targetSheet.Range("AJ2:AJ" & ImportRecordCount + 1).Value = ReconciliationID ' To help with reconciling shipments
targetSheet.Range("AK2:AK" & ImportRecordCount + 1).Value = ReconciliationID ' To help with deployment timing
'targetSheet.Range("AI2:AI" & ImportRecordCount + 1).Value = "=COUNTIFS($D:$D, D2, $F:$F, F2)" ' This is the helper formula for identifying duplicates (deprecated, but I'm saving the code)
For TransactionCounter = 2 To ImportRecordCount + 1 ' Create and add the new Transaction ID values
targetSheet.Range("a" & TransactionCounter) = TransactionID + ImportRecordCount - TransactionCounter + 1
Next
' Close customer workbook
customerWorkbook.Close
' Format the sheet properly
Application.Cells.Font.Name = "Calibri Light"
Application.Cells.Font.Size = "8"
Application.Cells.Font.Bold = False
Application.Range("1:1").Font.Size = "10"
Application.Range("1:1").Font.Bold = True
' Query the User -- delete the file?
If MsgBox("Delete the local client-generated data file?" & vbCrLf & vbCrLf & "(this will NOT affect your email)", vbYesNo, "Confirm") = vbYes Then
Kill customerFilename
' MsgBox ("File: " & vbCrLf & customerFilename & vbCrLf & "has been deleted.")
End If
End If ' If ContinueUpload = True Then
End If ' If customerFilename <> "False" Then
End Sub
edit
I edited your original question to highlight things I found as suspect. These are things I felt are worth pointing out to you. I shaved everything else out as to focus on these particular issue. Review them and do soem research to see if you can find yourself in a better situation.
MsgBox ("Step 2")
'Ive never moved large amounts of data using this method. Ive always just used arrays. I have moved smaller bits of data though.
' I suspect that this might take a moment if the data set is large. Again use arrays to grab the data and move it.
' Edward says “This step takes about 70% of my delay — even if bringing in only a single line of data.”
targetSheet.Range("B2:AB" & ImportRecordCount + 1).Value = sourceSheet.Range("A3:AA" & ImportRecordCount + 2).Value ' Bring in the big pile of data
MsgBox ("Step 3")
' this loop is probably your main culprit of your performance issue.
' Edward says “Nope, this flies by. It is not the issue at all. I have verified this already.”
' Learn how to construct an array of data on the fly and then learn how to dump the entire array to
' sheet using a simple method.
For TransactionCounter = 2 To ImportRecordCount + 1 ' Create and add the new Transaction ID values
targetSheet.Range("a" & TransactionCounter) = TransactionID + ImportRecordCount - TransactionCounter + 1
Next
It looks like you have a lot of good things going here. A few things that I saw that could potentially be changed to improve your performance.
First, between "Step 1" and "Step 2": In my experience, adding rows takes longer than using rows that already exist. It looks like you are basically "pushing" everything down to make room for the new data, such that the newly entered data is at the top and the oldest data is at the bottom. (Correct me if I am wrong on any of this.) If you were to simply add the data to the end of the sheet, you would probably see some performance improvements, although I don't know how big of an improvement it would be.
Second, between "Step 2" and "Step 3": I have found that using .Value2 as opposed to .Value can give you some performance improvements, and the larger the data the bigger the improvement. This has a down side - Value2 does not retain any of the formatting that might be present, meaning that the number type (date, accounting, etc) does not pull over correctly. If this is something that you do not need, then you can use Value2.
Finally, other methods: When I run extensive macros, I always try to do everything I can to get a performance boost. You can get slight boosts across the board by using tricks like turning off screen updating (Application.ScreenUpdating = False), just be sure to turn it back on at the end of the macro.
I hope that this helps you figure it out! If all else fails, you can do it once or twice by hand to remember how much faster it is using the macro! Haha. Good Luck!
Have you tried using .value2? In some scenarios it might bring you better performance. Check some performance comparatives here: https://fastexcel.wordpress.com/2011/11/30/text-vs-value-vs-value2-slow-text-and-how-to-avoid-it/
It's difficult seeing where's the issue without having access to the original sheets. Maybe the issue is with the data itself instead of your VBA code and sometimes you might need to clean your source data of the heavy stuff and then add it again if needed.
You could also look into doing some parts with Python but I guess that's out of the question if you don't want to add additional software layers to your solution.
Try adding this at the beginning and end of your script. Just be sure to set everything back to TRUE!!
Application.ScreenUpdating = False
Application.DisplayAlerts = False
...CODE HERE...
Application.ScreenUpdating = True
Application.DisplayAlerts = True

Extract Data from PDF and Add to Worksheet

I am trying to extract the data from a PDF document into a worksheet. The PDFs show and text can be manually copied and pasted into the Excel document.
I am currently doing this through SendKeys and it is not working. I get an error when I try to paste the data from the PDF document. Why is my paste not working? If I paste after the macro has stopped running it pastes as normal.
Dim myPath As String, myExt As String
Dim ws As Worksheet
Dim openPDF As Object
'Dim pasteData As MSForms.DataObject
Dim fCell As Range
'Set pasteData = New MSForms.DataObject
Set ws = Sheets("DATA")
If ws.Cells(ws.Rows.Count, "A").End(xlUp).Row > 1 Then Range("A3:A" & ws.Cells(ws.Rows.Count, "A").End(xlUp).Row).ClearContents
myExt = "\*.pdf"
'When Scan Receipts Button Pressed Scan the selected folder/s for receipts
For Each fCell In Range(ws.Cells(1, 1), ws.Cells(1, ws.Cells(1, ws.Columns.Count).End(xlToLeft).Column))
myPath = Dir(fCell.Value & myExt)
Do While myPath <> ""
myPath = fCell.Value & "\" & myPath
Set openPDF = CreateObject("Shell.Application")
openPDF.Open (myPath)
Application.Wait Now + TimeValue("00:00:2")
SendKeys "^a"
Application.Wait Now + TimeValue("00:00:2")
SendKeys "^c"
'Application.Wait Now + TimeValue("00:00:2")
ws.Select
ActiveSheet.Paste
'pasteData.GetFromClipboard
'ws.Cells(3, 1) = pasteData.GetText
Exit Sub
myPath = Dir
Loop
Next fCell
You can open the PDF file and extract its contents using the Adobe library (which I believe you can download from Adobe as part of the SDK, but it comes with certain versions of Acrobat as well)
Make sure to add the Library to your references too (On my machine it is the Adobe Acrobat 10.0 Type Library, but not sure if that is the newest version)
Even with the Adobe library it is not trivial (you'll need to add your own error-trapping etc):
Function getTextFromPDF(ByVal strFilename As String) As String
Dim objAVDoc As New AcroAVDoc
Dim objPDDoc As New AcroPDDoc
Dim objPage As AcroPDPage
Dim objSelection As AcroPDTextSelect
Dim objHighlight As AcroHiliteList
Dim pageNum As Long
Dim strText As String
strText = ""
If (objAvDoc.Open(strFilename, "") Then
Set objPDDoc = objAVDoc.GetPDDoc
For pageNum = 0 To objPDDoc.GetNumPages() - 1
Set objPage = objPDDoc.AcquirePage(pageNum)
Set objHighlight = New AcroHiliteList
objHighlight.Add 0, 10000 ' Adjust this up if it's not getting all the text on the page
Set objSelection = objPage.CreatePageHilite(objHighlight)
If Not objSelection Is Nothing Then
For tCount = 0 To objSelection.GetNumText - 1
strText = strText & objSelection.GetText(tCount)
Next tCount
End If
Next pageNum
objAVDoc.Close 1
End If
getTextFromPDF = strText
End Function
What this does is essentially the same thing you are trying to do - only using Adobe's own library. It's going through the PDF one page at a time, highlighting all of the text on the page, then dropping it (one text element at a time) into a string.
Keep in mind what you get from this could be full of all kinds of non-printing characters (line feeds, newlines, etc) that could even end up in the middle of what look like contiguous blocks of text, so you may need additional code to clean it up before you can use it.
Hope that helps!
I know this is an old issue but I just had to do this for a project at work, and I am very surprised that nobody has thought of this solution yet:
Just open the .pdf with Microsoft word.
The code is a lot easier to work with when you are trying to extract data from a .docx because it opens in Microsoft Word. Excel and Word play well together because they are both Microsoft programs. In my case, the file of question had to be a .pdf file. Here's the solution I came up with:
Choose the default program to open .pdf files to be Microsoft Word
The first time you open a .pdf file with word, a dialogue box pops up claiming word will need to convert the .pdf into a .docx file. Click the check box in the bottom left stating "do not show this message again" and then click OK.
Create a macro that extracts data from a .docx file. I used MikeD's Code as a resource for this.
Tinker around with the MoveDown, MoveRight, and Find.Execute methods to fit the need of your task.
Yes you could just convert the .pdf file to a .docx file but this is a much simpler solution in my opinion.
Over time, I have found that extracting text from PDFs in a structured format is tough business. However if you are looking for an easy solution, you might want to consider XPDF tool pdftotext.
Pseudocode to extract the text would include:
Using SHELL VBA statement to extract the text from PDF to a temporary file using XPDF
Using sequential file read statements to read the temporary file contents into a string
Pasting the string into Excel
Simplified example below:
Sub ReadIntoExcel(PDFName As String)
'Convert PDF to text
Shell "C:\Utils\pdftotext.exe -layout " & PDFName & " tempfile.txt"
'Read in the text file and write to Excel
Dim TextLine as String
Dim RowNumber as Integer
Dim F1 as Integer
RowNumber = 1
F1 = Freefile()
Open "tempfile.txt" for Input as #F1
While Not EOF(#F1)
Line Input #F1, TextLine
ThisWorkbook.WorkSheets(1).Cells(RowNumber, 1).Value = TextLine
RowNumber = RowNumber + 1
Wend
Close #F1
End Sub
Since I do not prefer to rely on external libraries and/or other programs, I have extended your solution so that it works.
The actual change here is using the GetFromClipboard function instead of Paste which is mainly used to paste a range of cells.
Of course, the downside is that the user must not change focus or intervene during the whole process.
Dim pathPDF As String, textPDF As String
Dim openPDF As Object
Dim objPDF As MsForms.DataObject
pathPDF = "C:\some\path\data.pdf"
Set openPDF = CreateObject("Shell.Application")
openPDF.Open (pathPDF)
'TIME TO WAIT BEFORE/AFTER COPY AND PASTE SENDKEYS
Application.Wait Now + TimeValue("00:00:2")
SendKeys "^a"
Application.Wait Now + TimeValue("00:00:2")
SendKeys "^c"
Application.Wait Now + TimeValue("00:00:1")
AppActivate ActiveWorkbook.Windows(1).Caption
objPDF.GetFromClipboard
textPDF = objPDF.GetText(1)
MsgBox textPDF
If you're interested see my project in github.
Copying and pasting by user interactions emulation could be not reliable (for example, popup appears and it switches the focus). You may be interested in trying the commercial ByteScout PDF Extractor SDK that is specifically designed to extract data from PDF and it works from VBA. It is also capable of extracting data from invoices and tables as CSV using VB code.
Here is the VBA code for Excel to extract text from given locations and save them into cells in the Sheet1:
Private Sub CommandButton1_Click()
' Create TextExtractor object
' Set extractor = CreateObject("Bytescout.PDFExtractor.TextExtractor")
Dim extractor As New Bytescout_PDFExtractor.TextExtractor
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
' Load sample PDF document
extractor.LoadDocumentFromFile ("c:\sample1.pdf")
' Get page count
pageCount = extractor.GetPageCount()
Dim wb As Workbook
Dim ws As Worksheet
Dim TxtRng As Range
Set wb = ActiveWorkbook
Set ws = wb.Sheets("Sheet1")
For i = 0 To pageCount - 1
RectLeft = 10
RectTop = 10
RectWidth = 100
RectHeight = 100
' check the same text is extracted from returned coordinates
extractor.SetExtractionArea RectLeft, RectTop, RectWidth, RectHeight
' extract text from given area
extractedText = extractor.GetTextFromPage(i)
' insert rows
' Rows(1).Insert shift:=xlShiftDown
' write cell value
Set TxtRng = ws.Range("A" & CStr(i + 2))
TxtRng.Value = extractedText
Next
Set extractor = Nothing
End Sub
Disclosure: I am related to ByteScout
Using Bytescout PDF Extractor SDK is a good option. It is cheap and gives plenty of PDF related functionality. One of the answers above points to the dead page Bytescout on GitHub. I am providing a relevant working sample to extract table from PDF. You may use it to export in any format.
Set extractor = CreateObject("Bytescout.PDFExtractor.StructuredExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
' Load sample PDF document
extractor.LoadDocumentFromFile "../../sample3.pdf"
For ipage = 0 To extractor.GetPageCount() - 1
' starting extraction from page #"
extractor.PrepareStructure ipage
rowCount = extractor.GetRowCount(ipage)
For row = 0 To rowCount - 1
columnCount = extractor.GetColumnCount(ipage, row)
For col = 0 To columnCount-1
WScript.Echo "Cell at page #" +CStr(ipage) + ", row=" & CStr(row) & ", column=" & _
CStr(col) & vbCRLF & extractor.GetCellValue(ipage, row, col)
Next
Next
Next
Many more samples available here: https://github.com/bytescout/pdf-extractor-sdk-samples
To improve the solution of Slinky Sloth I had to add this beforere get from clipboard :
Set objPDF = New MSForms.DataObject
Sadly it didn't worked for a pdf of 10 pages.
This doesn't seem to work with the Adobe Type library. As soon as it gets to Open, I get a 429 error. Acrobat works fine though...

Generate Word Documents (in Excel VBA) from a series of Document Templates

Hey all. I'll try to make this brief and simple. :)
I have
40 or so boilerplate word documents with a series of fields (Name, address, etc) that need to be filled in. This is historically done manually, but it's repetitive and cumbersome.
A workbook where a user has filled a huge set of information about an individual.
I need
A way to programatically (from Excel VBA) open up these boilerplate documents, edit in the value of fields from various named ranges in the workbook, and save the filled in templates to a local folder.
If I were using VBA to programatically edit particular values in a set of spreadsheets, I would edit all those spreadsheets to contain a set of named ranges which could be used during the auto-fill process, but I'm not aware of any 'named field' feature in a Word document.
How could I edit the documents, and create a VBA routine, so that I can open each document, look for a set of fields which might need to be filled in, and substitute a value?
For instance, something that works like:
for each document in set_of_templates
if document.FieldExists("Name") then document.Field("Name").value = strName
if document.FieldExists("Address") then document.Field("Name").value = strAddress
...
document.saveAs( thisWorkbook.Path & "\GeneratedDocs\ " & document.Name )
next document
Things I've considered:
Mail merge - but this is insufficient because it requires opening each document manually and structuring the workbook as a data source, I kind of want the opposite. The templates are the data source and the workbook is iterating through them. Also, mail merge is for creating many identical documents using a table of different data. I have many documents all using the same data.
Using placeholder text such as "#NAME#" and opening each document for a search and replace. This is the solution I would resort to if nothing more elegant is proposed.
It's been a long time since I asked this question, and my solution has undergone more and more refinement. I've had to deal with all sorts of special cases, such as values that come directly from the workbook, sections that need to be specially generated based on lists, and the need to do replacements in headers and footers.
As it turns out, it did not suffice to use bookmarks, as it was possible for users to later edit documents to change, add, and remove placeholder values from the documents. The solution was in fact to use keywords such as this:
This is just a page from a sample document which uses some of the possible values that can get automatically inserted into a document. Over 50 documents exist with completely different structures and layouts, and using different parameters. The only common knowledge shared by the word documents and the excel spreadsheet is a knowledge of what these placeholder values are meant to represent. In excel, this is stored in a list of document generation keywords, which contain the keyword, followed by a reference to the range that actually contains this value:
These were the key two ingredients required. Now with some clever code, all I had to do was iterate over each document to be generated, and then iterate over the range of all known keywords, and do a search and replace for each keyword in each document.
First, I have the wrapper method, which takes care of maintaining an instance of microsoft word iterating over all documents selected for generation, numbering the documents, and doing the user interface stuff (like handling errors, displaying the folder to the user, etc.)
' Purpose: Iterates over and generates all documents in the list of forms to generate
' Improves speed by creating a persistant Word application used for all generated documents
Public Sub GeneratePolicy()
Dim oWrd As New Word.Application
Dim srcPath As String
Dim cel As Range
If ERROR_HANDLING Then On Error GoTo errmsg
If Forms.Cells(2, FormsToGenerateCol) = vbNullString Then _
Err.Raise 1, , "There are no forms selected for document generation."
'Get the path of the document repository where the forms will be found.
srcPath = FindConstant("Document Repository")
'Each form generated will be numbered sequentially by calling a static counter function. This resets it.
GetNextEndorsementNumber reset:=True
'Iterate over each form, calling a function to replace the keywords and save a copy to the output folder
For Each cel In Forms.Range(Forms.Cells(2, FormsToGenerateCol), Forms.Cells(1, FormsToGenerateCol).End(xlDown))
RunReplacements cel.value, CreateDocGenPath(cel.Offset(0, 1).value), oWrd
Next cel
oWrd.Quit
On Error Resume Next
'Display the folder containing the generated documents
Call Shell("explorer.exe " & CreateDocGenPath, vbNormalFocus)
oWrd.Quit False
Application.StatusBar = False
If MsgBox("Policy generation complete. The reserving information will now be recorded.", vbOKCancel, _
"Policy Generated. OK to store reserving info?") = vbOK Then Push_Reserving_Requirements
Exit Sub
errmsg:
MsgBox Err.Description, , "Error generating Policy Documents"
End Sub
That routine calls RunReplacements which takes care of opening the document, prepping the environment for a fast replacement, updating links once done, handling errors, etc:
' Purpose: Opens up a document and replaces all instances of special keywords with their respective values.
' Creates an instance of Word if an existing one is not passed as a parameter.
' Saves a document to the target path once the template has been filled in.
'
' Replacements are done using two helper functions, one for doing simple keyword replacements,
' and one for the more complex replacements like conditional statements and schedules.
Private Sub RunReplacements(ByVal DocumentPath As String, ByVal SaveAsPath As String, _
Optional ByRef oWrd As Word.Application = Nothing)
Dim oDoc As Word.Document
Dim oWrdGiven As Boolean
If oWrd Is Nothing Then Set oWrd = New Word.Application Else oWrdGiven = True
If ERROR_HANDLING Then On Error GoTo docGenError
oWrd.Visible = False
oWrd.DisplayAlerts = wdAlertsNone
Application.StatusBar = "Opening " & Mid(DocumentPath, InStrRev(DocumentPath, "\") + 1)
Set oDoc = oWrd.Documents.Open(Filename:=DocumentPath, Visible:=False)
RunAdvancedReplacements oDoc
RunSimpleReplacements oDoc
UpdateLinks oDoc 'Routine which will update calculated statements in Word (like current date)
Application.StatusBar = "Saving " & Mid(DocumentPath, InStrRev(DocumentPath, "\") + 1)
oDoc.SaveAs SaveAsPath
GoTo Finally
docGenError:
MsgBox "Un unknown error occurred while generating document: " & DocumentPath & vbNewLine _
& vbNewLine & Err.Description, vbCritical, "Document Generation"
Finally:
If Not oDoc Is Nothing Then oDoc.Close False: Set oDoc = Nothing
If Not oWrdGiven Then oWrd.Quit False
End Sub
That routine then invokes RunSimpleReplacements. and RunAdvancedReplacements. In the former, we iterate over the set of Document Generation Keywords and call WordDocReplace if the document contains our keyword. Note that it's much faster to try and Find a bunch of words to figure out that they don't exist, then to call replace indiscriminately, so we always check if a keyword exists before attempting to replace it.
' Purpose: While short, this short module does most of the work with the help of the generation keywords
' range on the lists sheet. It loops through every simple keyword that might appear in a document
' and calls a function to have it replaced with the corresponding data from pricing.
Private Sub RunSimpleReplacements(ByRef oDoc As Word.Document)
Dim DocGenKeys As Range, valueSrc As Range
Dim value As String
Dim i As Integer
Set DocGenKeys = Lists.Range("DocumentGenerationKeywords")
For i = 1 To DocGenKeys.Rows.Count
If WordDocContains(oDoc, "#" & DocGenKeys.Cells(i, 1).Text & "#") Then
'Find the text that we will be replacing the placeholder keyword with
Set valueSrc = Range(Mid(DocGenKeys.Cells(i, 2).Formula, 2))
If valueSrc.MergeCells Then value = valueSrc.MergeArea.Cells(1, 1).Text Else value = valueSrc.Text
'Perform the replacement
WordDocReplace oDoc, "#" & DocGenKeys.Cells(i, 1).Text & "#", value
End If
Next i
End Sub
This is the function used to detect whether a keyword exists in the document:
' Purpose: Function called for each replacement to first determine as quickly as possible whether
' the document contains the keyword, and thus whether replacement actions must be taken.
Public Function WordDocContains(ByRef oDoc As Word.Document, ByVal searchFor As String) As Boolean
Application.StatusBar = "Checking for keyword: " & searchFor
WordDocContains = False
Dim storyRange As Word.Range
For Each storyRange In oDoc.StoryRanges
With storyRange.Find
.Text = searchFor
WordDocContains = WordDocContains Or .Execute
End With
If WordDocContains Then Exit For
Next
End Function
And this is where the rubber meets the road - the code that executes the replacement. This routine got more complicated as I encountered difficulties. Here are the lessons you will only learn from experience:
You can set the replacement text directly, or you can use the clipboard. I found out the hard way that if you are doing a VBA replace in word using a string longer than 255 characters, the text will get truncated if you try to place it in the Find.Replacement.Text, but you can use "^c" as your replacement text, and it will get it directly from the clipboard. This was the workaround I got to use.
Simply calling replace will miss keywords in some text areas like headers and footers. Because of this, you actually need to iterate over the document.StoryRanges and run the search and replace on each one to ensure that you catch all instances of the word you want to replace.
If you're setting the Replacement.Text directly, you need to convert Excel line breaks (vbNewLine and Chr(10)) with a simple vbCr for them to appear properly in word. Otherwise, anywhere your replacement text has line breaks coming from an excel cell will end up inserting strange symbols into word. If you use the clipboard method however, you do not need to do this, as the line breaks get converted automatically when put in the clipboard.
That explains everything. Comments should be pretty clear too. Here's the golden routine that executes the magic:
' Purpose: This function actually performs replacements using the Microsoft Word API
Public Sub WordDocReplace(ByRef oDoc As Word.Document, ByVal replaceMe As String, ByVal replaceWith As String)
Dim clipBoard As New MSForms.DataObject
Dim storyRange As Word.Range
Dim tooLong As Boolean
Application.StatusBar = "Replacing instances of keyword: " & replaceMe
'We want to use regular search and replace if we can. It's faster and preserves the formatting that
'the keyword being replaced held (like bold). If the string is longer than 255 chars though, the
'standard replace method doesn't work, and so we must use the clipboard method (^c special character),
'which does not preserve formatting. This is alright for schedules though, which are always plain text.
If Len(replaceWith) > 255 Then tooLong = True
If tooLong Then
clipBoard.SetText IIf(replaceWith = vbNullString, "", replaceWith)
clipBoard.PutInClipboard
Else
'Convert excel in-cell line breaks to word line breaks. (Not necessary if using clipboard)
replaceWith = Replace(replaceWith, vbNewLine, vbCr)
replaceWith = Replace(replaceWith, Chr(10), vbCr)
End If
'Replacement must be done on multiple 'StoryRanges'. Unfortunately, simply calling replace will miss
'keywords in some text areas like headers and footers.
For Each storyRange In oDoc.StoryRanges
Do
With storyRange.Find
.MatchWildcards = True
.Text = replaceMe
.Replacement.Text = IIf(tooLong, "^c", replaceWith)
.Wrap = wdFindContinue
.Execute Replace:=wdReplaceAll
End With
On Error Resume Next
Set storyRange = storyRange.NextStoryRange
On Error GoTo 0
Loop While Not storyRange Is Nothing
Next
If tooLong Then clipBoard.SetText ""
If tooLong Then clipBoard.PutInClipboard
End Sub
When the dust settles, we're left with a beautiful version of the initial document with production values in place of those hash marked keywords. I'd love to show an example, but of course every filled in document contain all-proprietary information.
The only think left to mention I guess would be that RunAdvancedReplacements section. It does something extremely similar - it ends up calling the same WordDocReplace function, but what's special about the keywords used here is that they don't link to a single cell in the original workbook, they get generated in the code-behind from lists in the workbook. So for instance, one of the advanced replacements would look like this:
'Generate the schedule of vessels
If WordDocContains(oDoc, "#VESSELSCHEDULE#") Then _
WordDocReplace oDoc, "#VESSELSCHEDULE#", GenerateVesselSchedule()
And then there will be a corresponding routine which puts together a string containing all the vessel information as configured by the user:
' Purpose: Generates the list of vessels from the "Vessels" sheet based on the user's configuration
' in the booking tab. The user has the option to generate one or both of Owned Vessels
' and Chartered Vessels, as well as what fields to display. Uses a helper function.
Public Function GenerateVesselSchedule() As String
Dim value As String
Application.StatusBar = "Generating Schedule of Vessels."
If Booking.Range("ListVessels").value = "Yes" Then
Dim VesselCount As Long
If Booking.Range("ListVessels").Offset(1).value = "Yes" Then _
value = value & GenerateVesselScheduleHelper("Vessels", VesselCount)
If Booking.Range("ListVessels").Offset(1).value = "Yes" And _
Booking.Range("ListVessels").Offset(2).value = "Yes" Then _
value = value & "(Chartered Vessels)" & vbNewLine
If Booking.Range("ListVessels").Offset(2).value = "Yes" Then _
value = value & GenerateVesselScheduleHelper("CharteredVessels", VesselCount)
If Len(value) > 2 Then value = Left(value, Len(value) - 2) 'Remove the trailing line break
Else
GenerateVesselSchedule = Booking.Range("VesselSchedAlternateText").Text
End If
GenerateVesselSchedule = value
End Function
' Purpose: Helper function for the Vessel Schedule generation routine. Generates either the Owned or
' Chartered vessels based on the schedule parameter passed. The list is numbered and contains
' the information selected by the user on the Booking sheet.
' SENSITIVE: Note that this routine is sensitive to the layout of the Vessel Schedule tab and the
' parameters on the Configure Quotes tab. If either changes, it should be revisited.
Public Function GenerateVesselScheduleHelper(ByVal schedule As String, ByRef VesselCount As Long) As String
Dim value As String, nextline As String
Dim numInfo As Long, iRow As Long, iCol As Long
Dim Inclusions() As Boolean, Columns() As Long
'Gather info about vessel info to display in the schedule
With Booking.Range("VesselInfoToInclude")
numInfo = Booking.Range(.Cells(1, 1), .End(xlToRight)).Columns.Count - 1
ReDim Inclusions(1 To numInfo)
ReDim Columns(1 To numInfo)
On Error Resume Next 'Some columns won't be identified
For iCol = 1 To numInfo
Inclusions(iCol) = .Offset(0, iCol) = "Yes"
Columns(iCol) = sumSchedVessels.Range(schedule).Cells(1).EntireRow.Find(.Offset(-1, iCol)).Column
Next iCol
On Error GoTo 0
End With
'Build the schedule
With sumSchedVessels.Range(schedule)
For iRow = .row + 1 To .row + .Rows.Count - 1
If Len(sumSchedVessels.Cells(iRow, Columns(1)).value) > 0 Then
VesselCount = VesselCount + 1
value = value & VesselCount & "." & vbTab
nextline = vbNullString
'Add each property that was included to the description string
If Inclusions(1) Then nextline = nextline & sumSchedVessels.Cells(iRow, Columns(1)) & vbTab
If Inclusions(2) Then nextline = nextline & "Built: " & sumSchedVessels.Cells(iRow, Columns(2)) & vbTab
If Inclusions(3) Then nextline = nextline & "Length: " & _
Format(sumSchedVessels.Cells(iRow, Columns(3)), "#'") & vbTab
If Inclusions(4) Then nextline = nextline & "" & sumSchedVessels.Cells(iRow, Columns(4)) & vbTab
If Inclusions(5) Then nextline = nextline & "Hull Value: " & _
Format(sumSchedVessels.Cells(iRow, Columns(5)), "$#,##0") & vbTab
If Inclusions(6) Then nextline = nextline & "IV: " & _
Format(sumSchedVessels.Cells(iRow, Columns(6)), "$#,##0") & vbTab
If Inclusions(7) Then nextline = nextline & "TIV: " & _
Format(sumSchedVessels.Cells(iRow, Columns(7)), "$#,##0") & vbTab
If Inclusions(8) And schedule = "CharteredVessels" Then _
nextline = nextline & "Deductible: " & Format(bmCharterers.Range(schedule).Cells( _
iRow - .row, 9), "$#,##0") & vbTab
nextline = Left(nextline, Len(nextline) - 1) 'Remove the trailing tab
'If more than 4 properties were included insert a new line after the 4th one
Dim tabloc As Long: tabloc = 0
Dim counter As Long: counter = 0
Do
tabloc = tabloc + 1
tabloc = InStr(tabloc, nextline, vbTab)
If tabloc > 0 Then counter = counter + 1
Loop While tabloc > 0 And counter < 4
If counter = 4 Then nextline = Left(nextline, tabloc - 1) & vbNewLine & Mid(nextline, tabloc)
value = value & nextline & vbNewLine
End If
Next iRow
End With
GenerateVesselScheduleHelper = value
End Function
the resulting string can be used just like the contents of any excel cell, and passed to the replacement function, which will appropriately use the clipboard method if it exceeds 255 characters.
So this template:
Plus this spreadsheet data:
Becomes this document:
I sincerely hope that this helps someone out some day. It was definitely a huge undertaking and a complex wheel to have to re-invent. The application is huge, with over 50,000 lines of VBA code, so if I've referenced a crucial method in my code somewhere that someone needs, please leave a comment and I'll add it in here.
http://www.computorcompanion.com/LPMArticle.asp?ID=224 Describes the use of Word bookmarks
A section of text in a document can be bookmarked, and given a variable name. Using VBA, this variable can be accessed and the content in the document can be replaced with alternate content. This is a solution to having placeholders such as Name and Address in the document.
Furthermore, using bookmarks, documents can be modified to reference bookmarked text. If a name appears several times throughout a document, the first instance can be bookmarked, and additional instances can reference the bookmark. Now when the first instance is programatically changed, all other instances of the variable throughout the document are also automatically changed.
Now all that's needed is to update all the documents by bookmarking the placeholder text and using a consistent naming convention throughout the documents, then iterate through each documents replacing the bookmark if it exists:
document.Bookmarks("myBookmark").Range.Text = "Inserted Text"
I can probably solve the problem of variables that don't appear in a given document using the on error resume next clause before attempting each replacement.
Thanks to Doug Glancy for mentioning the existance of bookmarks in his comment. I had no knowledge of their existence beforehand. I will keep this topic posted on whether this solution suffices.
You might consider an XML based approach.
Word has a feature called Custom XML data-binding, or data-bound content controls. A content control is essentially a point in the document which can contain content. A "data-bound" content control gets its content from an XML document you include in the docx zip file. An XPath expression is used to say which bit of XML. So all you need to do is include your XML file, and Word will do the rest.
Excel has ways to get data out of it as XML, so the whole solution should work nicely.
There is plenty of information on content control data-binding on MSDN (some of which has been referenced in earlier SO questions) so I won't bother including them here.
But you do need a way of setting up the bindings. You can either use the Content Control Toolkit, or if you want to do it from within Word, my OpenDoPE add-in.
Having done a similar task I found that inserting values into tables was much quicker than searching for named tags - the data can then be inserted like this:
With oDoc.Tables(5)
For i = 0 To Data.InvoiceDictionary.Count - 1
If i > 0 Then
oDoc.Tables(5).rows.Add
End If
Set invoice = Data.InvoiceDictionary.Items(i)
.Cell(i + 2, 1).Range.Text = invoice.InvoiceCCNumber
.Cell(i + 2, 2).Range.Text = invoice.InvoiceDate
.Cell(i + 2, 3).Range.Text = invoice.TransactionType
.Cell(i + 2, 4).Range.Text = invoice.Description
.Cell(i + 2, 5).Range.Text = invoice.SumOfValue
Next i
.Cell(i + 1, 4).Range.Text = "Total:"
End With
in this case row 1 of the table was the headers; row 2 was empty and there were no further rows - thus the rows.add applies once more than one row was attached. The tables can be very detailed documents and by hiding the borders and cell borders can be made to look like ordinary text. Tables are numbered sequentially following the document flow. (i.e. Doc.Tables(1) is the first table...

Resources