Get Excel cell properties that have been copied to the clipboard - excel

I'm trying to copy the values from multiple cells into one cell. If I only wanted to have the values of the cells combined, I would use something like
Dim str as string = My.Computer.ClipBoard.GetText
oxlapp.ActiveCell.Value = str
However
In this case I need to include html tagging, to create a table and I also want to include formatting like bold, italic and underlined. Therefore, instead of just the text from the clipboard, I need to know some cell properties.
I know that they should be there, because you can copy/paste entire cells of course.
So far I tried getting the Excel cells by using
My.Computer.Clipboard.GetData(XlClipboardFormat.xlClipboardFormatTable)
and
My.Computer.Clipboard.GetData(XlClipboardFormat.xlClipboardFormatCSV)
but while debugging I noticed that both of them returned Nothing.
Does someone know how I can get all cell properties from the clipboard?
To make it more clear, I want this
To turn into this:
If there is any other way than using the clipboard, I would be happy to try.

You need to use another format from clipboard - XML Spreadsheet. The copied data is contained in special XML with its own structure and attributes. Let's take the following sheet's data:
As you see, every cell has some formatting. The XML for this data is the following:
<ss:Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
<ss:Styles>
<ss:Style ss:ID="Default" ss:Name="Normal">
<ss:Alignment ss:Vertical="Bottom"/>
<ss:Borders/>
<ss:Font ss:FontName="Calibri" x:CharSet="204" x:Family="Swiss" ss:Size="11" ss:Color="#000000"/>
<ss:Interior/>
<ss:NumberFormat/>
<ss:Protection/>
</ss:Style>
<ss:Style ss:ID="s62">
<ss:Interior ss:Color="#FFFF00" ss:Pattern="Solid"/>
</ss:Style>
<ss:Style ss:ID="s63">
<ss:Font ss:FontName="Calibri" x:CharSet="204" x:Family="Swiss" ss:Size="11" ss:Color="#000000" ss:Bold="1"/>
</ss:Style>
<ss:Style ss:ID="s64">
<ss:Font ss:FontName="Calibri" x:CharSet="204" x:Family="Swiss" ss:Size="11" ss:Color="#0000FF"/>
</ss:Style>
<ss:Style ss:ID="s65">
<ss:Font ss:FontName="Calibri" x:CharSet="204" x:Family="Swiss" ss:Size="11" ss:Color="#000000" ss:Italic="1"/>
</ss:Style>
</ss:Styles>
<ss:Worksheet ss:Name="Sheet1">
<ss:Table ss:ExpandedColumnCount="2" ss:ExpandedRowCount="2" ss:DefaultRowHeight="15">
<ss:Row>
<ss:Cell ss:StyleID="s62">
<ss:Data ss:Type="String">A</ss:Data>
</ss:Cell>
<ss:Cell ss:StyleID="s63">
<ss:Data ss:Type="String">1</ss:Data>
</ss:Cell>
</ss:Row>
<ss:Row>
<ss:Cell ss:StyleID="s64">
<ss:Data ss:Type="String">B</ss:Data>
</ss:Cell>
<ss:Cell ss:StyleID="s65">
<ss:Data ss:Type="String">2</ss:Data>
</ss:Cell>
</ss:Row>
</ss:Table>
</ss:Worksheet>
</ss:Workbook>
As you see, you have all information about formatting in corresponding styles. For instace, value A in A1 cell has style s62 (StyleID attribute) - you can find the appropriate Style node with this number in Styles node. The structure of rows and columns in this XML is implicit - i.e. you won't see indexes of rows and columns - you need to calculate them yourself. For instance, the second Cell node in first Row node is first row, second column.
The following code generates data in the picture above and retrieves appropriate elements to manipulate.
A word of caution 1. If you take a look closely, the Workbook node has two urn:schemas-microsoft-com:office:spreadsheet namespaces: first is default and second is with prefix ss. Just remember - you always need to use ss prefix!
A word of caution 2. This method has one drawback - it doesn't understand hidden rows (manually or by autofilter) and columns! It includes hidden rows/columns, too!
Imports <xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
Imports <xmlns:x="urn:schemas-microsoft-com:office:excel">
Sub GetCellsWithFormat()
'// Create new Excel app
Dim xlApp = New Excel.Application With {.Visible = True}
Dim book = xlApp.Workbooks.Add()
Dim sheet = DirectCast(book.Sheets(1), Excel.Worksheet)
'// Apply some formatting
With sheet
.Range("A1").Interior.Color = Excel.XlRgbColor.rgbYellow
.Range("B1").Font.Bold = True
.Range("A2").Font.Color = Excel.XlRgbColor.rgbBlue
.Range("B2").Font.Italic = True
'// Add some values
Dim arr = Array.CreateInstance(GetType(String), {2, 2}, {1, 1})
arr(1, 1) = "A" : arr(1, 2) = "1"
arr(2, 1) = "B" : arr(2, 2) = "2"
With .Range("A1:B2")
.Value = arr
.Copy() '//Copy cells to clipboard
End With
Dim xml As XElement
Using xml_stream = DirectCast(Clipboard.GetData("XML Spreadsheet"), Stream)
'// Get rid of last character (new line) to avoid parsing error
xml_stream.SetLength(xml_stream.Length - 1)
xml = XElement.Load(xml_stream)
End Using
'// Get any element you need
Dim styles = xml.<ss:Styles>(0)
Dim table = xml.<ss:Worksheet>.<ss:Table>(0)
Dim rows = table.<ss:Row>
'// Do something with this data
End With
End Sub
UPDATE
In fact, you don't need to use clipboard to get this XML - you just need to use xlRangeValueXMLSpreadsheet value of Value property:
With sheet
'// Same code...
With .Range("A1:B2")
.Value = arr
'.Copy() '//No need to copy!
End With
Dim xml_string = CStr(.Range("A1:B2").Value(Excel.XlRangeValueDataType.xlRangeValueXMLSpreadsheet))
'// Again, exclude last character
Dim xml = XElement.Parse(xml_string.Substring(0, xml_string.Length - 1))
'// Get any element you need
Dim styles = xml.<ss:Styles>(0)
Dim table = xml.<ss:Worksheet>.<ss:Table>(0)
Dim rows = table.<ss:Row>
'// Do something with this data
End With

When I want to allow data copied from Excel into a custom windows app that I wrote, I use GetText() instead of GetData().
Dim ClipboardText As String = Nothing
ClipboardText = My.Computer.Clipboard.GetText()
If it is multiple cells that are copied to the Clipboard from Excel, they could be seperated by vbTab and vbCrLf depending on the selection.

Related

Trouble with Excel VBA parsing of XML file for ISBN title lookup: Run-time error 91 Object variable or With block variable not set

Given a Column A in Excel with multiple cells containing ISBN (book id) values, I want my VBA macro to loop through each of them and, for each one, parse an online XML file that is unique to that ISBN and put the corresponding book title in Column B.
For example, if A1 contains 1931498717, I should parse this XML and grab the title "Don't think of an elephant! : know your values and frame the debate : the essential guide for progressives" and put that in B1.
Here is a sample of the XML file:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<classify xmlns="http://classify.oclc.org">
<response code="4"/>
<!--Classify is a product of OCLC Online Computer Library Center: http://classify.oclc.org-->
<workCount>2</workCount>
<start>0</start>
<maxRecs>25</maxRecs>
<orderBy>thold desc</orderBy>
<input type="isbn">1931498717</input>
<works>
<work author="Lakoff, George" editions="28" format="Book" holdings="1088" hyr="2014" itemtype="itemtype-book" lyr="2004" owi="796415685" schemes="DDC LCC" title="Don't think of an elephant! : know your values and frame the debate : the essential guide for progressives" wi="796415685"/>
<work author="Lakoff, George" editions="1" format="Musical score" holdings="1" hyr="2004" itemtype="itemtype-msscr" lyr="2004" owi="4735145535" schemes="DDC" title="Don't think of an elephant! : know your values and frame the debate : the essential guide for progressives" wi="4735145535"/>
</works>
</classify>
Notice there are two "work" elements. In this case, I am happy to just grab the title attribute from the first one. But even better would be to make sure it's the title of a book (format="Book") and not some other format.
Here is my macro code:
Sub ISBN()
Do
Dim xmlDoc As DOMDocument60
Set xmlDoc = New DOMDocument60
xmlDoc.async = False
xmlDoc.validateOnParse = False
r = CStr(ActiveCell.Value)
xmlDoc.Load ("http://classify.oclc.org/classify2/Classify?isbn=" + r + "&summary=true")
ActiveCell.Offset(0, 2).Value = xmlDoc.SelectSingleNode("/classify/works/work[1]").attributes.getNamedItem("title").text
ActiveCell.Offset(1, 0).Select
Loop Until IsEmpty(ActiveCell.Value)
End Sub
I get this error, "Run-time error 91 Object variable or With block variable not set," on the line that references "xmlDoc.SelectSingleNode("/classify/works/work[1]").attributes.getNamedItem("title").text"
I've tried numerous variations to try to isolate the title text but cannot get anything other than this error.
My Excel file is Microsoft Excel for Microsoft 365, on my laptop.
Help would be greatly appreciated. I am inexperienced in VBA programming and XML parsing and have been googling/reading on this for more time than I care to admit without making any progress. (There was a previous StackOverflow question on parsing ISBN XML files, but it was for a provider that no longer offers the XML files for free. That code inspired my code, but something was lost in the translation.)
Thanks tons for any help you can offer.
Your XML has a "default" namespace which applies to its contents, so when using xpath you need to create a dummy alias for that namespace and use it in your query.
See https://stackoverflow.com/a/72997440/478884
Eg:
Dim xmlDoc As DOMDocument60, col As Object, el As Object
Set xmlDoc = New DOMDocument60
xmlDoc.async = False
xmlDoc.validateOnParse = False
'set default namespace and alias "xx"
xmlDoc.SetProperty "SelectionNamespaces", _
"xmlns:xx='http://classify.oclc.org'"
xmlDoc.Load "http://classify.oclc.org/classify2/Classify?isbn=1931498717&summary=false"
Set col = xmlDoc.SelectNodes("/xx:classify/xx:works/xx:work")
'check each `work` for format and title
For Each el In col
Debug.Print "*****************"
Debug.Print el.Attributes.getNamedItem("format").Text
Debug.Print el.Attributes.getNamedItem("title").Text
Next el

Update an Excel style in VBA

My Excel macro reads the answers to a survey from a set of Excel files. The answers of a survey contain a score (from 1 to 4) and a description. The goal is to generate a a matrix. Each cell of the matrix has a color that represents the score. I would like the user to be able to modify the layout of these cell. To make it easy to the user, I created a template matrix and a button. The user should be able to modify the layout of the cells and on a click of a button, a set of styles (Score 1, Score 2,...) should be generated. Once the matrix is created, the Workbook should be to function without the survey files.
I have tried a couple of things:
Try 1
ThisWorkbook.Styles.Add "Score 1", BasedOn:=cell1
This gives errors. I don't fully understand when they occur, but one of the causes is when the user modifies the cell layout by selecting another style.
Try 2
ThisWorkbook.Styles("Score 1").Delete
ThisWorkbook.Styles.Add "Score 1", BasedOn:=cell1
This is not a good idea: all cells loose their styling when it is executed a second time.
Try 3: Current
Copy the most frequently used properties of the cells layout and copy them to the style. If this style is deleted by the user, it is recreated. This procedures is not ideal, since most style properties won't be covered.
Is there a way to update a cell style that is more general? I would like there to be as little room as possible to make the workbook in an inconsistent and non-functional state.
I sticked with try 3. Because it required a lot of code for all properties that seemed possible to be edited, and because of copying borders is tricky, I post the result.
'xR1_Template: the cell to base the style on
'nm_Style: the name of the style
Public Function Upsert_Style(xR1_Template As Excel.Range, nm_Style As String) As Excel.Style
Dim xStyle As Excel.Style
Set xStyle = Fn.TryGet(ThisWorkbook.Styles, nm_Style)
If Fn.IsNothing(xStyle) Then
Set xStyle = ThisWorkbook.Styles.Add(nm_Style)
End If
xStyle.Font.Color = xR1_Template.Font.Color
xStyle.Font.Bold = xR1_Template.Font.Bold
xStyle.Font.Name = xR1_Template.Font.Name
xStyle.Font.Italic = xR1_Template.Font.Italic
xStyle.Font.Size = xR1_Template.Font.Size
xStyle.Font.Strikethrough = xR1_Template.Font.Strikethrough
xStyle.Font.Subscript = xR1_Template.Font.Subscript
xStyle.Font.Superscript = xR1_Template.Font.Superscript
xStyle.Font.Underline = xR1_Template.Font.Underline
xStyle.Interior.Color = xR1_Template.Interior.Color
xStyle.Interior.Pattern = xR1_Template.Interior.Pattern
xStyle.Interior.PatternColor = xR1_Template.Interior.PatternColor
'NOTE: necessary to delete all borders first. There's no way to delete them one by one.
xStyle.Borders.LineStyle = xlNone
Dim iBorder As Long
For iBorder = 1 To xR1_Template.Borders.Count
Dim xBorder As Excel.Border
'NOTE: The Borders property claims to work with xlBordersIndex argument, but this is not true.
' Normal indexing is used.
Set xBorder = xR1_Template.Borders(iBorder)
'NOTE: "none-style" borders (=no border), should be skipped.
' Once they are retrieved using the Borders property, they are always visible.
' Setting them with xlLineStyle.xlLineStyleNone does not hide them.
If xBorder.LineStyle <> XlLineStyle.xlLineStyleNone Then
Dim xBorder_Style As Excel.Border
Set xBorder_Style = xStyle.Borders(iBorder)
xBorder_Style.Color = xBorder.Color
xBorder_Style.LineStyle = xBorder.LineStyle
xBorder_Style.Weight = xBorder.Weight
End If
Next iBorder
xStyle.AddIndent = xR1_Template.AddIndent
xStyle.FormulaHidden = xR1_Template.FormulaHidden
xStyle.HorizontalAlignment = xR1_Template.HorizontalAlignment
xStyle.IndentLevel = xR1_Template.IndentLevel
xStyle.NumberFormat = xR1_Template.NumberFormat
xStyle.NumberFormatLocal = xR1_Template.NumberFormatLocal
xStyle.Orientation = xR1_Template.Orientation
xStyle.ShrinkToFit = xR1_Template.ShrinkToFit
xStyle.VerticalAlignment = xR1_Template.VerticalAlignment
xStyle.WrapText = xR1_Template.WrapText
xStyle.IndentLevel = xR1_Template.IndentLevel
Set Upsert_Style = xStyle
End Function

Roman numeral page numbers for table of contents

I'm generating word docs entirely in VBA and am aiming to have roman numeral page numbers for my table of contents and numeric page numbers for the remainder of the document. My table of contents spans multiple pages and is variable in page size.
How would I achieve roman numeral page numbers for only a table of contents of variable page span?
Any help would be greatly appreciated.
If you don't know where to start, try this approach in Word:
Insert a section break after the table of content pages.
Turn on the macro recorder
Format the page numbers in the first section with Roman numerals
select the next section and unlink it from the previous section
Format the page numbers in the second section with regular numbers
Turn off the macro recorder.
Adjust the code as required.
I also find Word fiddly in this area, so here's some code to show one possible example. The code clears the content of the current document (so don't run it in your existing document!!), then generates a few headings, followed by a table of contents, both of which are then split by a section break. The section break allows different formatting of the page number (roman numerals for the first section, and arabic for the second). Change the for loop up to 100 will demonstrate multiple ToC pages. Might point you in the right direction. Cheers.
Option Explicit
Public Sub PageNumbers()
Dim myRange As Range
Dim Counter As Long
Dim myTOC As TableOfContents
' Delete word document content
ActiveDocument.StoryRanges(wdMainTextStory).Delete
' Add in some headings for testing
Set myRange = ActiveDocument.Range(0, 0)
For Counter = 1 To 10
myRange.InsertAfter "Heading " & Counter
myRange.Style = WdBuiltinStyle.wdStyleHeading1
myRange.InsertParagraphAfter
Next
' Add in a page number
With ActiveDocument.Sections(1)
.Footers(wdHeaderFooterPrimary).PageNumbers.Add _
PageNumberAlignment:=wdAlignPageNumberLeft, _
FirstPage:=True
End With
' Add in a section break at the start of the document
Set myRange = ActiveDocument.Range(0, 0)
myRange.InsertBreak Type:=wdSectionBreakNextPage
myRange.InsertParagraphAfter
' Insert a table of contents (into the first section)
Set myRange = ActiveDocument.Range(0, 0)
Set myTOC = ActiveDocument.TablesOfContents.Add(myRange, True, 1, 3, False)
' Format the page number of the first section to have roman numerals
With ActiveDocument.Sections.Item(1).Footers.Item(1).PageNumbers
.NumberStyle = wdPageNumberStyleLowercaseRoman
.HeadingLevelForChapter = 0
.IncludeChapterNumber = False
.ChapterPageSeparator = wdSeparatorHyphen
.RestartNumberingAtSection = False
.StartingNumber = 0
End With
' Format the page number of the second section to have arabic numerals
With ActiveDocument.Sections.Item(2).Footers.Item(1).PageNumbers
.NumberStyle = wdPageNumberStyleArabic
.HeadingLevelForChapter = 0
.IncludeChapterNumber = False
.ChapterPageSeparator = wdSeparatorHyphen
.RestartNumberingAtSection = True
.StartingNumber = 1
End With
End Sub
The Output:

Find specific text in a string

=IFERROR(MID(I46,FIND("-",I46,8)-2,5),"")
I have a vast string of information copied into one cell from another program. There are several parts of this string that mean absolutely nothing for my current purpose. I want to extract the relevant information, it is always formatted the same way, but not always in the same position within the string. I am working with the above code which returns what I'm looking for, but also other variations when "-" is found, which can be often.
The information I want to extract will always be LetterLetter-NumberNumber, ex: AA-01 AA-02 AB-01, and so on, always 5 characters.
How can I extract just this?
Then, in another row if need be, I want to remove duplicate instances (there will nearly always be duplicates).
What I get now is;
HA-03
HA-03
T - S
Y - R
HA-03
HA-03
HA-03
HA-03
Y - R
HA-06
HA-06
R - S
HA-07
HA-09
HA-09
HA-09
First, I'd like to just get;
HA-03
HA-03
HA-03
HA-03
HA-03
HA-03
HA-06
HA-06
HA-07
HA-09
HA-09
HA-09
Then convert that into;
HA-03
HA-06
HA-07
HA-09
If there is a way to skip the middle-man, I'm all ears =)
Thank You.
You could write a macro using Range.RemoveDuplicates to de-duplicate your list and then Like or regular expressions (see the code I posted in Parsing String Mixed with HTML, Words, Numbers, and Dates) in a loop over the remaining cells to check if cells meet your criteria and delete those that do not match.
De-duplicating then matching would probably be a bit more efficient, but you could do it either way around.
If you're new to VBA and want some code, add a comment and I'll post some.
EDIT:
You'll need to go to the Visual Basic editor (Alt-F11), select menu item Tools/References..., find and check "Microsoft VBScript Regular Expressions 5.5", then click OK. Then in the project explorer (Ctrl- R), right-click on the VBA Project for your workbook and Insert > Module.
Add the following code:
Public Function RegEx(strInput As String, strRegEx As String, Optional bIgnoreCase As Boolean = True, Optional bMultiLine As Boolean = False) As Boolean
Dim RegExp As VBScript_RegExp_55.RegExp
Set RegExp = New VBScript_RegExp_55.RegExp
With RegExp
.MultiLine = bMultiLine
.IgnoreCase = bIgnoreCase
.Pattern = strRegEx
End With
RegEx = RegExp.test(strInput)
Set RegExp = Nothing
End Function
(You can add the other Regular Expression code from Parsing String Mixed with HTML, Words, Numbers, and Dates if you think you might use it later)
Add the following code (Assuming that the data that you want to delete is in Column A):
Public Sub DedupeAndFilter()
Dim RCtr As Long
ActiveSheet.Range("A:A").RemoveDuplicates Columns:=1, Header:=xlNo
For RCtr = ActiveSheet.UsedRange.Rows.Count To 1 Step -1
If ActiveSheet.Range("A:A").Rows(RCtr).Text = "" Then
ActiveSheet.Range("A:A").Rows(RCtr).Delete xlShiftUp
ElseIf Not RegEx(ActiveSheet.Range("A1").Rows(RCtr), "[A-Z]{2}-\d\d", True) Then
ActiveSheet.Range("A:A").Rows(RCtr).Delete xlShiftUp
End If
Next
End Sub
then, with the cursor in the DedupeAndFilter code block, press F5 or click the green Run ">" triangle. The code will then remove duplicates, blank cells and non-conforming cells from Column A.
If you want to change the column affected, change "A:A" in ActiveSheet.Range("A:A") to any other column reference, or substitute Activesheet.Selection and select the column you want.
If you want to avoid VBA, then try this:
Column A
asdfHA-03asdfasdf
HA-03sadfsa
asdfT - S
Y - Rasdfsad
asdfHA-03adf
asdHA-04
asdfsadf
Then on use this formula:
=IF(ISERROR(FIND(" ",IFERROR(MID(A1,FIND("-",A1)-2,5),""))),IFERROR(MID(A1,FIND("-",A1)-2,5),""),"")
This should exclude the ones with spaces, then you can just copy paste as values, and Remove Duplicates

Extract tables from pdf (to excel), pref. w/ vba

I am trying to extract tables from pdf files with vba and export them to excel. If everything works out the way it should, it should go all automatic. The problem is that the table are not standardized.
This is what I have so far.
VBA (Excel) runs XPDF, and converts all .pdf files found in current folder to a text file.
VBA (Excel) reads through each text file line by line.
And the code:
With New Scripting.FileSystemObject
With .OpenTextFile(strFileName, 1, False, 0)
If Not .AtEndOfStream Then .SkipLine
Do Until .AtEndOfStream
//do something
Loop
End With
End With
This all works great. But now I am getting to the issue of extracting the tables from the text files.
What I am trying to do is VBA to find a string e.g. "Year's Income", and then output the data, after it, into columns. (Until the table ends.)
The first part is not very difficult (find a certain string), but how would I go about the second part. The text file will look like this Pastebin. The problem is that the text is not standardized. Thus for example some tables have 3-year columns (2010 2011 2012) and some only two (or 1), some tables have more spaces between the columnn, and some do not include certain rows (such as Capital Asset, net).
I was thinking about doing something like this but not sure how to go about it in VBA.
Find user defined string. eg. "Table 1: Years' Return."
a. Next line find years; if there are two we will need three columns in output (titles +, 2x year), if there are three we will need four (titles +, 3x year).. etc
b. Create title column + column for each year.
When reaching end of line, go to next line
a. Read text -> output to column 1.
b. Recognize spaces (Are spaces > 3?) as start of column 2. Read numbers -> output to column 2.
c. (if column = 3) Recognize spaces as start of column 3. Read numbers -> output to column 3.
d. (if column = 4) Recognize spaces as start of column 4. Read numbers -> output to column 4.
Each line, loop 4.
Next line does not include any numbers - End table. (probably the easiet just a user defined number, after 15 characters no number? end table)
I based my first version on Pdf to excel, but reading online people do not recommend OpenFile but rather FileSystemObject (even though it seems to be a lot slower).
Any pointers to get me started, mainly on step 2?
You have a number of ways to dissect a text file and depending on how complex it is might cause you to lean one way or another. I started this and it got a bit out of hand... enjoy.
Based on the sample you've provided and the additional comments, I noted the following. Some of these may work well for simple files but can get unwieldy with bigger more complex files. Furthermore, there may be slightly more efficient methods or tricks to what I have used here but this will definitely get you going an achieve the desired outcome. Hopefully this makes sense in conjunction with the code provided:
You can use booleans to help you determine what 'section' of the text file you are in. Ie use InStr on the current line to
determine you are in a Table by looking for the text 'Table' and then
once you know you are in the 'Table' section of the file start
looking for the 'Assets' section etc
You can use a few methods to determine the number of years (or columns) you have. The Split function along with a loop will do
the job.
If your files always have constant formatting, even only in certain parts, you can take advantage of this. For example, if you know your
file line will always have a dollar sign in front of the them, then
you know this will define the column widths and you can use this on
subsequent lines of text.
The following code will extract the Assets details from the text file, you can mod it to extract other sections. It should handle multiple rows. Hopefully I've commented it sufficient. Have a look and I'll edit if needs to help out further.
Sub ReadInTextFile()
Dim fs As Scripting.FileSystemObject, fsFile As Scripting.TextStream
Dim sFileName As String, sLine As String, vYears As Variant
Dim iNoColumns As Integer, ii As Integer, iCount As Integer
Dim bIsTable As Boolean, bIsAssets As Boolean, bIsLiabilities As Boolean, bIsNetAssets As Boolean
Set fs = CreateObject("Scripting.FileSystemObject")
sFileName = "G:\Sample.txt"
Set fsFile = fs.OpenTextFile(sFileName, 1, False)
'Loop through the file as you've already done
Do While fsFile.AtEndOfStream <> True
'Determine flag positions in text file
sLine = fsFile.Readline
Debug.Print VBA.Len(sLine)
'Always skip empty lines (including single spaceS)
If VBA.Len(sLine) > 1 Then
'We've found a new table so we can reset the booleans
If VBA.InStr(1, sLine, "Table") > 0 Then
bIsTable = True
bIsAssets = False
bIsNetAssets = False
bIsLiabilities = False
iNoColumns = 0
End If
'Perhaps you want to also have some sort of way to designate that a table has finished. Like so
If VBA.Instr(1, sLine, "Some text that designates the end of the table") Then
bIsTable = False
End If
'If we're in the table section then we want to read in the data
If bIsTable Then
'Check for your different sections. You could make this constant if your text file allowed it.
If VBA.InStr(1, sLine, "Assets") > 0 And VBA.InStr(1, sLine, "Net") = 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = False
If VBA.InStr(1, sLine, "Liabilities") > 0 Then bIsAssets = False: bIsLiabilities = True: bIsNetAssets = False
If VBA.InStr(1, sLine, "Net Assests") > 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = True
'If we haven't triggered any of these booleans then we're at the column headings
If Not bIsAssets And Not bIsLiabilities And Not bIsNetAssets And VBA.InStr(1, sLine, "Table") = 0 Then
'Trim the current line to remove leading and trailing spaces then use the split function to determine the number of years
vYears = VBA.Split(VBA.Trim$(sLine), " ")
For ii = LBound(vYears) To UBound(vYears)
If VBA.Len(vYears(ii)) > 0 Then iNoColumns = iNoColumns + 1
Next ii
'Now we can redefine some variables to hold the information (you'll want to redim after you've collected the info)
ReDim sAssets(1 To iNoColumns + 1, 1 To 100) As String
ReDim iColumns(1 To iNoColumns) As Integer
Else
If bIsAssets Then
'Skip the heading line
If Not VBA.Trim$(sLine) = "Assets" Then
'Increment the counter
iCount = iCount + 1
'If iCount reaches it's limit you'll have to redim preseve you sAssets array (I'll leave this to you)
If iCount > 99 Then
'You'll find other posts on stackoverflow to do this
End If
'This will happen on the first row, it'll happen everytime you
'hit a $ sign but you could code to only do so the first time
If VBA.InStr(1, sLine, "$") > 0 Then
iColumns(1) = VBA.InStr(1, sLine, "$")
For ii = 2 To iNoColumns
'We need to start at the next character across
iColumns(ii) = VBA.InStr(iColumns(ii - 1) + 1, sLine, "$")
Next ii
End If
'The first part (the name) is simply up to the $ sign (trimmed of spaces)
sAssets(1, iCount) = VBA.Trim$(VBA.Mid$(sLine, 1, iColumns(1) - 1))
For ii = 2 To iNoColumns
'Then we can loop around for the rest
sAssets(ii, iCount) = VBA.Trim$(VBA.Mid$(sLine, iColumns(ii) + 1, iColumns(ii) - iColumns(ii - 1)))
Next ii
'Now do the last column
If VBA.Len(sLine) > iColumns(iNoColumns) Then
sAssets(iNoColumns + 1, iCount) = VBA.Trim$(VBA.Right$(sLine, VBA.Len(sLine) - iColumns(iNoColumns)))
End If
Else
'Reset the counter
iCount = 0
End If
End If
End If
End If
End If
Loop
'Clean up
fsFile.Close
Set fsFile = Nothing
Set fs = Nothing
End Sub
I cannot examine the sample data as the PasteBin has been removed. Based on what I can glean from the problem description, it seems to me that using Regular Expressions would make parsing the data much easier.
Add a reference to the Scripting Runtime scrrun.dll for the FileSystemObject.
Add a reference to the Microsoft VBScript Regular Expressions 5.5. library for the RegExp object.
Instantiate a RegEx object with
Dim objRE As New RegExp
Set the Pattern property to "(\bd{4}\b){1,3}"
The above pattern should match on lines containing strings like:
2010
2010 2011
2010 2011 2012
The number of spaces between the year strings is irrelevant, as long as there is at least one (since we're not expecting to encounter strings like 201020112012 for example)
Set the Global property to True
The captured groups will be found in the individual Match objects from the MatchCollection returned by the Execute method of the RegEx object objRE. So declare the appropriate objects:
Dim objMatches as MatchCollection
Dim objMatch as Match
Dim intMatchCount 'tells you how many year strings were found, if any
Assuming you've set up a FileSystemObject object and are scanning the text file, reading each line into a variable strLine
First test to see if the current line contains the pattern sought:
If objRE.Test(strLine) Then
'do something
Else
'skip over this line
End If
Set objMatches = objRe.Execute(strLine)
intMatchCount = objMatches.Count
For i = 0 To intMatchCount - 1
'processing code such as writing the years as column headings in Excel
Set objMatch = objMatches(i)
e.g. ActiveCell.Value = objMatch.Value
'subsequent lines beneath the line containing the year strings should
'have the amounts, which may be captured in a similar fashion using an
'additional RegExp object and a Pattern such as "(\b\d+\b){1,3}" for
'whole numbers or "(\b\d+\.\d+\b){1,3}" for floats. For currency, you
'can use "(\b\$\d+\.\d{2}\b){1,3}"
Next i
This is just a rough outline of how I would approach this challenge. I hope there is something in this code outline that will be of help to you.
Another way to do this I have some success with is to use VBA to convert to a .doc or .docx file and then search for and pull tables from the Word file. They can be easily extracted into Excel sheets. The conversion seems to handle tables nicely. Note however that it works on a page by page basis so tables extending over a page end up as separate tables in the word doc.

Resources