Extract tables from pdf (to excel), pref. w/ vba - excel

I am trying to extract tables from pdf files with vba and export them to excel. If everything works out the way it should, it should go all automatic. The problem is that the table are not standardized.
This is what I have so far.
VBA (Excel) runs XPDF, and converts all .pdf files found in current folder to a text file.
VBA (Excel) reads through each text file line by line.
And the code:
With New Scripting.FileSystemObject
With .OpenTextFile(strFileName, 1, False, 0)
If Not .AtEndOfStream Then .SkipLine
Do Until .AtEndOfStream
//do something
Loop
End With
End With
This all works great. But now I am getting to the issue of extracting the tables from the text files.
What I am trying to do is VBA to find a string e.g. "Year's Income", and then output the data, after it, into columns. (Until the table ends.)
The first part is not very difficult (find a certain string), but how would I go about the second part. The text file will look like this Pastebin. The problem is that the text is not standardized. Thus for example some tables have 3-year columns (2010 2011 2012) and some only two (or 1), some tables have more spaces between the columnn, and some do not include certain rows (such as Capital Asset, net).
I was thinking about doing something like this but not sure how to go about it in VBA.
Find user defined string. eg. "Table 1: Years' Return."
a. Next line find years; if there are two we will need three columns in output (titles +, 2x year), if there are three we will need four (titles +, 3x year).. etc
b. Create title column + column for each year.
When reaching end of line, go to next line
a. Read text -> output to column 1.
b. Recognize spaces (Are spaces > 3?) as start of column 2. Read numbers -> output to column 2.
c. (if column = 3) Recognize spaces as start of column 3. Read numbers -> output to column 3.
d. (if column = 4) Recognize spaces as start of column 4. Read numbers -> output to column 4.
Each line, loop 4.
Next line does not include any numbers - End table. (probably the easiet just a user defined number, after 15 characters no number? end table)
I based my first version on Pdf to excel, but reading online people do not recommend OpenFile but rather FileSystemObject (even though it seems to be a lot slower).
Any pointers to get me started, mainly on step 2?

You have a number of ways to dissect a text file and depending on how complex it is might cause you to lean one way or another. I started this and it got a bit out of hand... enjoy.
Based on the sample you've provided and the additional comments, I noted the following. Some of these may work well for simple files but can get unwieldy with bigger more complex files. Furthermore, there may be slightly more efficient methods or tricks to what I have used here but this will definitely get you going an achieve the desired outcome. Hopefully this makes sense in conjunction with the code provided:
You can use booleans to help you determine what 'section' of the text file you are in. Ie use InStr on the current line to
determine you are in a Table by looking for the text 'Table' and then
once you know you are in the 'Table' section of the file start
looking for the 'Assets' section etc
You can use a few methods to determine the number of years (or columns) you have. The Split function along with a loop will do
the job.
If your files always have constant formatting, even only in certain parts, you can take advantage of this. For example, if you know your
file line will always have a dollar sign in front of the them, then
you know this will define the column widths and you can use this on
subsequent lines of text.
The following code will extract the Assets details from the text file, you can mod it to extract other sections. It should handle multiple rows. Hopefully I've commented it sufficient. Have a look and I'll edit if needs to help out further.
Sub ReadInTextFile()
Dim fs As Scripting.FileSystemObject, fsFile As Scripting.TextStream
Dim sFileName As String, sLine As String, vYears As Variant
Dim iNoColumns As Integer, ii As Integer, iCount As Integer
Dim bIsTable As Boolean, bIsAssets As Boolean, bIsLiabilities As Boolean, bIsNetAssets As Boolean
Set fs = CreateObject("Scripting.FileSystemObject")
sFileName = "G:\Sample.txt"
Set fsFile = fs.OpenTextFile(sFileName, 1, False)
'Loop through the file as you've already done
Do While fsFile.AtEndOfStream <> True
'Determine flag positions in text file
sLine = fsFile.Readline
Debug.Print VBA.Len(sLine)
'Always skip empty lines (including single spaceS)
If VBA.Len(sLine) > 1 Then
'We've found a new table so we can reset the booleans
If VBA.InStr(1, sLine, "Table") > 0 Then
bIsTable = True
bIsAssets = False
bIsNetAssets = False
bIsLiabilities = False
iNoColumns = 0
End If
'Perhaps you want to also have some sort of way to designate that a table has finished. Like so
If VBA.Instr(1, sLine, "Some text that designates the end of the table") Then
bIsTable = False
End If
'If we're in the table section then we want to read in the data
If bIsTable Then
'Check for your different sections. You could make this constant if your text file allowed it.
If VBA.InStr(1, sLine, "Assets") > 0 And VBA.InStr(1, sLine, "Net") = 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = False
If VBA.InStr(1, sLine, "Liabilities") > 0 Then bIsAssets = False: bIsLiabilities = True: bIsNetAssets = False
If VBA.InStr(1, sLine, "Net Assests") > 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = True
'If we haven't triggered any of these booleans then we're at the column headings
If Not bIsAssets And Not bIsLiabilities And Not bIsNetAssets And VBA.InStr(1, sLine, "Table") = 0 Then
'Trim the current line to remove leading and trailing spaces then use the split function to determine the number of years
vYears = VBA.Split(VBA.Trim$(sLine), " ")
For ii = LBound(vYears) To UBound(vYears)
If VBA.Len(vYears(ii)) > 0 Then iNoColumns = iNoColumns + 1
Next ii
'Now we can redefine some variables to hold the information (you'll want to redim after you've collected the info)
ReDim sAssets(1 To iNoColumns + 1, 1 To 100) As String
ReDim iColumns(1 To iNoColumns) As Integer
Else
If bIsAssets Then
'Skip the heading line
If Not VBA.Trim$(sLine) = "Assets" Then
'Increment the counter
iCount = iCount + 1
'If iCount reaches it's limit you'll have to redim preseve you sAssets array (I'll leave this to you)
If iCount > 99 Then
'You'll find other posts on stackoverflow to do this
End If
'This will happen on the first row, it'll happen everytime you
'hit a $ sign but you could code to only do so the first time
If VBA.InStr(1, sLine, "$") > 0 Then
iColumns(1) = VBA.InStr(1, sLine, "$")
For ii = 2 To iNoColumns
'We need to start at the next character across
iColumns(ii) = VBA.InStr(iColumns(ii - 1) + 1, sLine, "$")
Next ii
End If
'The first part (the name) is simply up to the $ sign (trimmed of spaces)
sAssets(1, iCount) = VBA.Trim$(VBA.Mid$(sLine, 1, iColumns(1) - 1))
For ii = 2 To iNoColumns
'Then we can loop around for the rest
sAssets(ii, iCount) = VBA.Trim$(VBA.Mid$(sLine, iColumns(ii) + 1, iColumns(ii) - iColumns(ii - 1)))
Next ii
'Now do the last column
If VBA.Len(sLine) > iColumns(iNoColumns) Then
sAssets(iNoColumns + 1, iCount) = VBA.Trim$(VBA.Right$(sLine, VBA.Len(sLine) - iColumns(iNoColumns)))
End If
Else
'Reset the counter
iCount = 0
End If
End If
End If
End If
End If
Loop
'Clean up
fsFile.Close
Set fsFile = Nothing
Set fs = Nothing
End Sub

I cannot examine the sample data as the PasteBin has been removed. Based on what I can glean from the problem description, it seems to me that using Regular Expressions would make parsing the data much easier.
Add a reference to the Scripting Runtime scrrun.dll for the FileSystemObject.
Add a reference to the Microsoft VBScript Regular Expressions 5.5. library for the RegExp object.
Instantiate a RegEx object with
Dim objRE As New RegExp
Set the Pattern property to "(\bd{4}\b){1,3}"
The above pattern should match on lines containing strings like:
2010
2010 2011
2010 2011 2012
The number of spaces between the year strings is irrelevant, as long as there is at least one (since we're not expecting to encounter strings like 201020112012 for example)
Set the Global property to True
The captured groups will be found in the individual Match objects from the MatchCollection returned by the Execute method of the RegEx object objRE. So declare the appropriate objects:
Dim objMatches as MatchCollection
Dim objMatch as Match
Dim intMatchCount 'tells you how many year strings were found, if any
Assuming you've set up a FileSystemObject object and are scanning the text file, reading each line into a variable strLine
First test to see if the current line contains the pattern sought:
If objRE.Test(strLine) Then
'do something
Else
'skip over this line
End If
Set objMatches = objRe.Execute(strLine)
intMatchCount = objMatches.Count
For i = 0 To intMatchCount - 1
'processing code such as writing the years as column headings in Excel
Set objMatch = objMatches(i)
e.g. ActiveCell.Value = objMatch.Value
'subsequent lines beneath the line containing the year strings should
'have the amounts, which may be captured in a similar fashion using an
'additional RegExp object and a Pattern such as "(\b\d+\b){1,3}" for
'whole numbers or "(\b\d+\.\d+\b){1,3}" for floats. For currency, you
'can use "(\b\$\d+\.\d{2}\b){1,3}"
Next i
This is just a rough outline of how I would approach this challenge. I hope there is something in this code outline that will be of help to you.

Another way to do this I have some success with is to use VBA to convert to a .doc or .docx file and then search for and pull tables from the Word file. They can be easily extracted into Excel sheets. The conversion seems to handle tables nicely. Note however that it works on a page by page basis so tables extending over a page end up as separate tables in the word doc.

Related

Is there a way to read .xlsx files in Applescript using Microsoft Excel scripting without opening the file?

So I was wondering if I could use the Microsoft Excel scripting in Applescript without opening the actual file. So something like this:
tell application "Microsoft Excel" to return value of cell "E10" of front sheet of "/Path/to/my/file"
Reading .xlsx files without opening them is very hard or nigh impossible. However, .csv-files store their data much more readably. You could set up something like this:
set theText to (read file "/Path/to/my/file")
set theLines to paragraphs of theText
return word 6 of (item 10 of theLines)
You could also write a function to do the work for you so you don't have to write it out every time:
on cell(theColumn, theRow, theTable)
return word theColumn of item theRow of theTable
end cell
return cell(2, 2, theLines)
Here is a bit of a bodgy function I wrote that even takes Excel-style inputs ("E10"):
on cell(theLocation, theTable)
set theChars to every character of theLocation
set letterPosition to 0
repeat with i from 1 to count of theChars
try
set dummy to item i of theChars as number
exit repeat
on error
set letterPosition to letterPosition + 1
end try
end repeat
if letterPosition is equal to 0 then display dialog "Incorrect input"
set theRow to items 1 thru letterPosition of theChars
set theColumn to items (letterPosition + 1) thru (count of theChars) of theChars as string
set theAlphabet to every character of "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
set theRowNumber to 0
set theMultiplier to 1
repeat with j from 1 to count of theRow
repeat with i from 1 to count of theAlphabet
if item i of theAlphabet is equal to (item ((count of theRow) - j + 1) of theRow) then
set theRowNumber to theRowNumber + i * theMultiplier
exit repeat
end if
end repeat
set theMultiplier to theMultiplier * (count of theAlphabet)
end repeat
return word theColumn of (item theRowNumber of theTable)
end cell
return cell("AA6", theLines)
Yes, I am aware of the repeat with ... in reverse but, it somehow didn't work with this function. I have no idea why.

Roman numeral page numbers for table of contents

I'm generating word docs entirely in VBA and am aiming to have roman numeral page numbers for my table of contents and numeric page numbers for the remainder of the document. My table of contents spans multiple pages and is variable in page size.
How would I achieve roman numeral page numbers for only a table of contents of variable page span?
Any help would be greatly appreciated.
If you don't know where to start, try this approach in Word:
Insert a section break after the table of content pages.
Turn on the macro recorder
Format the page numbers in the first section with Roman numerals
select the next section and unlink it from the previous section
Format the page numbers in the second section with regular numbers
Turn off the macro recorder.
Adjust the code as required.
I also find Word fiddly in this area, so here's some code to show one possible example. The code clears the content of the current document (so don't run it in your existing document!!), then generates a few headings, followed by a table of contents, both of which are then split by a section break. The section break allows different formatting of the page number (roman numerals for the first section, and arabic for the second). Change the for loop up to 100 will demonstrate multiple ToC pages. Might point you in the right direction. Cheers.
Option Explicit
Public Sub PageNumbers()
Dim myRange As Range
Dim Counter As Long
Dim myTOC As TableOfContents
' Delete word document content
ActiveDocument.StoryRanges(wdMainTextStory).Delete
' Add in some headings for testing
Set myRange = ActiveDocument.Range(0, 0)
For Counter = 1 To 10
myRange.InsertAfter "Heading " & Counter
myRange.Style = WdBuiltinStyle.wdStyleHeading1
myRange.InsertParagraphAfter
Next
' Add in a page number
With ActiveDocument.Sections(1)
.Footers(wdHeaderFooterPrimary).PageNumbers.Add _
PageNumberAlignment:=wdAlignPageNumberLeft, _
FirstPage:=True
End With
' Add in a section break at the start of the document
Set myRange = ActiveDocument.Range(0, 0)
myRange.InsertBreak Type:=wdSectionBreakNextPage
myRange.InsertParagraphAfter
' Insert a table of contents (into the first section)
Set myRange = ActiveDocument.Range(0, 0)
Set myTOC = ActiveDocument.TablesOfContents.Add(myRange, True, 1, 3, False)
' Format the page number of the first section to have roman numerals
With ActiveDocument.Sections.Item(1).Footers.Item(1).PageNumbers
.NumberStyle = wdPageNumberStyleLowercaseRoman
.HeadingLevelForChapter = 0
.IncludeChapterNumber = False
.ChapterPageSeparator = wdSeparatorHyphen
.RestartNumberingAtSection = False
.StartingNumber = 0
End With
' Format the page number of the second section to have arabic numerals
With ActiveDocument.Sections.Item(2).Footers.Item(1).PageNumbers
.NumberStyle = wdPageNumberStyleArabic
.HeadingLevelForChapter = 0
.IncludeChapterNumber = False
.ChapterPageSeparator = wdSeparatorHyphen
.RestartNumberingAtSection = True
.StartingNumber = 1
End With
End Sub
The Output:

How can I pick specific string fragments out of an excel cell using a custom formula written in VBA

At work I am required to reformat incorrect Addresses on a weekly basis from records in our Salesforce instance. We gather the incorrectly formatted addresses using a Report and export them to an Excel file. My job is simply to manipulate the data in the file to format them properly then reinsert them into the database.
Typically the addresses are formatted as so:
5 Sesame Street, Anytown, Anyplace
Separating these can be done easily by hand, but I typically have to work with hundreds of addresses at a time, and using default excel formulas tends to require lots of wrangling multiple cells at once to break it up into fragments.
Thus I wrote a custom formula to run through the cell and return a specific fragment of the string based on the "Comma Number" given. So if I give a Comma Number of 1, I would get "5 Sesame Street", 2 would get me "Anytown", etc.
Here is my code so far:
Public Function fragmentAddress(address As String, numberofcommas As Integer) As String
seen = 1
lastComma = -1
Dim x As Long
Dim frag As Long
For x = 0 To Len(address)
If Mid(address, x, 1) = "," & numberofcommas = seen Then
Exit For
ElseIf Mid(address, x, 1) = "," & numberofcommas <> seen Then
seen = seen + 1
lastComma = x
End If
Next
frag = Mid(address, lastComma + 1, seen - lastComma)
fragmentAddress = frag
I have not implemented the ability to handle the final value yet, but it does not give me any outputs, only outputting a "#VALUE!" error when I attempt to give it the input
=fragmentAddress("3 Ashley Close, Charlton Kings",1)
I have some experience with programming, but this is my first time writing anything in VBA.
Any help would be appreciated, thank you.
Not exactly sure what your question is, but this is simpler:
Public Function GetAddressFragment(ByVal Address As String, ByVal Index As Integer) As String
Dim addr() As String
addr = Split(Address, ",")
On Error Resume Next
GetAddressFragment = Trim(addr(Index - 1))
End Function

Read range of data line by line which is increasing dynamically

I need to read or clear all the line in specific range which will be increasing dynamically.
Problem is "Selected File List" table range is not specified. New file chosen with browse will be added to "Selected File List" dynamically. Number of file can't be predicted.
What I've tried so far is, keep track of the browse button click and add the file path to arraylist.
After that, (Row number 9 which is start line + arraylist size) to get the number of line of end of the table.
But due to some requirement, I want to read the file from excel file. Something like -Read until found blank line or border bottom or something.
I'm new to VBA and I'm not so sure what I'm doing with vba codes so please be kind :D
Any input would be greatly appreciated. Thanks for your precious time.
Provided there is no blank cell within the list of files, you could give the heading cell ("Selected File List") a name, e.g. "SelFiles" and do something like this ...
Sub Test()
' call the function
Debug.Print getSize(Range("SelFiles"), True)
End Sub
Function getSize(MyRange As Range, Optional HasHeader As Boolean = False)
Dim Idx As Integer
Idx = 2
Do While MyRange(Idx, 1) <> ""
Idx = Idx + 1
Loop
If HasHeader Then
getSize = Idx - 2
Else
getSize = Idx - 1
End If
End Function
The function uses a bit of abstraction, you can use it for ranges including / excluding ranges. You may have other areas in your application where you need to count rows, so the use of a function is preferred to just adding the loop into the code of your main Sub().

Lookup customer type by the meaningful part of the customer name and set prioritize

Is there any way excel 2010 can lookup customer type by using meaningful part of customer name?
Example, The customer name is Littleton's Valley Market, but the list I am trying to look up the customer type the customer names are formatted little different such as <Littletons Valley MKT #2807 or/and Littleton Valley.
Some customer can be listed under multiple customer types, how can excel tell me what which customer and can I set excel to pull primary or secondary type?
Re #1. Fails on the leading < (if belongs!) and any other extraneous prefix but this may be rare or non-existent so:
=INDEX(G:G,MATCH(LEFT(A1,6)&"*",F:F,0))
or similar may catch enough to be useful. This looks at the first six characters but can be adjusted to suit, though unfortunately only once at a time. Assumes the mismatches are in ColumnA (eg A1 for the formula above) and that the correct names are in ColumnF with the required type in the corresponding row of ColumnG.
On a large scale Fuzzy Lookup may be helpful.
Since with a VBA tag Soundex matching and Levenshtein distance may be of interest.
Re #2 If secondary type is in ColumnH, again in matching row, then adjust G:G above to H:H.
pnuts gives a good answer re: Fuzzy Lookup, Soundex matching, etc. Quick and dirty way I've handled this before:
Function isNameLike(nameSearch As String, nameMatch As String) As Boolean
On Error GoTo ErrorHandler
If InStr(1, invalidChars(nameSearch), invalidChars(nameMatch), vbTextCompare) > 0 Then isNameLike = True
Exit Function
ErrorHandler:
isNameLike = False
End Function
Function invalidChars(strIn As String) As String
Dim i As Long
Dim sIn As String
Dim sOut As String
sOut = ""
On Error GoTo ErrorHandler
For i = 1 To Len(strIn)
sIn = Mid(strIn, i, 1)
If InStr(1, " 1234567890~`!##$%^&*()_-+={}|[]\:'<>?,./" & Chr(34), sIn, vbTextCompare) = 0 Then sOut = sOut & sIn
Next i
invalidChars = sOut
Exit Function
ErrorHandler:
invalidChars = strIn
End Function
Then I can call isNameLike from code, or use it as a formula in a worksheet. Note that you still have to supply the "significant" part of the customer name you're looking for.

Resources