word patterns within an excel column

word patterns within an excel column - excel

I have 2 Excel data sets each comprising a column of word patterns and have been searching for a way to copy and group all instances of repetition within these columns into a new column.
This is the closest result I could find so far:
Sub Common5bis()
Dim Joined
Set d = CreateObject("Scripting.Dictionary") 'make dictionary
d.CompareMode = 1 'not case sensitive
a = Range("A1", Range("A" & Rows.Count).End(xlUp)).Value 'data to array
For i = 1 To UBound(a) 'loop trough alle records
If Len(a(i, 1)) >= 5 Then 'length at least 5
For l = 1 To Len(a(i, 1)) - 4 'all strings withing record
s = Mid(a(i, 1), l, 5) 'that string
d(s) = d(s) + 1 'increment
Next
End If
Next
Joined = Application.Index(Array(d.Keys, d.items), 0, 0) 'join the keys and the items
With Range("D1").Resize(UBound(Joined, 2), 2) 'export range
.EntireColumn.ClearContents 'clear previous
.Value = Application.Transpose(Joined) 'write to sheet
.Sort .Range("B1"), xlDescending, Header:=xlNo 'sort descending
End With
End Sub
Which yielded this result for the particular question:
This example achieves 4 of the things I'm trying to achieve:
Identify repeating strings within a single column
Copies these strings into a separate column
Displays results in order of occurrence (in this case from least to most)
Displays the quantity of repetitions (including the first instance) in an adjacent column
However, although from reading the code there are basic things I've figured out that I can adapt to my purposes, it still fails to achieve these essential tasks which I'm still trying to figure out:
Identify individual words rather than single characters
I could possibly reduce the size from 5 to 3, but for the word stings I have (lists of pronouns from larger texts) that would include "I I" repetitions but won't be so great for "Your You" etc, whilst at least 4 or 5 would miss anything starting with "I I"
Include an indefinite amount of values - looking at the code and the replies to the forum it comes from it looks like it's capped at 5, but I'm trying to find a way to identify all repetitions for all multiple word strings which could be something like "I I my you You Me I You my"
Is case sensitive - this is quite important as some words in the column have been capitalised to differentiate different uses
I'm still learning the basics of VBA but have manually typed out this example of what I'm trying to do with the code I've found above:
Intended outcome:
And so on
I'm a bit screwed at this point which is why I'm reaching out here (sorry if this is a stupid question, I'm brand new to VBA as my work almost never needs Excel, let alone macros) so will massively appreciate any constructive advice towards a solution!

Because I've been working with it recently, I note that you can obtain your desired output using Power Query, available in Windows Excel 2010+ and Office 365 Excel
Select some cell in your original table
Data => Get&Transform => From Table/Range or From within sheet
When the PQ UI opens, navigate to Home => Advanced Editor
Make note of the Table Name in Line 2 of the code.
Replace the existing code with the M-Code below
Change the table name in line 2 of the pasted code to your "real" table name
Examine any comments, and also the Applied Steps window, to better understand the algorithm and steps
First add a custom function:
New blank query
Rename per the code comment
Edits to make case-insensitive
Custom Function
//rename fnPatterns
//generate all possible patterns of two words or more
(String as text)=>
let
//split text string into individual words & get the count of words
#"Split Words" = List.Buffer(Text.Split(String," ")),
wordCount = List.Count(#"Split Words"),
//start position for each number of words
starts = List.Numbers(0, wordCount-1),
//number of words for each pattern (minimum of two (2) words in a pattern
words = List.Reverse(List.Numbers(2, wordCount-1)),
//generate patterns as index into the List and number of words
// will be used in the List.Range function
patterns = List.Combine(List.Generate(
()=>[r={{0,wordCount}}, idx=0],
each [idx] < wordCount-1,
each [r=List.Transform({0..starts{[idx]+1}}, (li)=> {li, wordCount-[idx]-1}),
idx=[idx]+1],
each [r]
)),
//Generate a list of all the patterns by using the List.Range function
wordPatterns = List.Distinct(List.Accumulate(patterns, {}, (state, current)=>
state & {List.Range(#"Split Words", current{0}, current{1})}), Comparer.OrdinalIgnoreCase)
in
wordPatterns
Main Function
let
//change next line to reflect data source
//if data has a column name other than "Column1", that will need to be changed also wherever referenced
Source = Excel.CurrentWorkbook(){[Name="Table17"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
//Create a list of all the possible patterns for each string, added as a custom column
#"Invoked Custom Function" = Table.AddColumn(#"Changed Type", "Patterns", each fnPatterns([Column1]), type list),
//removed unneeded original column of strings
#"Removed Columns" = Table.RemoveColumns(#"Invoked Custom Function",{"Column1"}),
//Expand the column of lists of lists into a column of lists
#"Expanded Patterns" = Table.ExpandListColumn(#"Removed Columns", "Patterns"),
//convert all lists to lower case for text-insensitive comparison
#"Added Custom" = Table.AddColumn(#"Expanded Patterns", "lower case patterns",
each List.Transform([Patterns], each Text.Lower(_))),
//Count number of matches for each pattern
#"Added Custom1" = Table.AddColumn(#"Added Custom", "Count", each List.Count(List.Select(#"Added Custom"[lower case patterns], (li)=> li = [lower case patterns])), Int64.Type),
//Filter for matches of more than one (1)
// then remove duplicate patterns based on the "lower case pattern" column
#"Filtered Rows" = Table.SelectRows(#"Added Custom1", each ([Count] > 1)),
#"Removed Duplicates" = Table.Distinct(#"Filtered Rows", {"lower case patterns"}),
//Remove lower case pattern column and sort by count descending
#"Removed Columns1" = Table.RemoveColumns(#"Removed Duplicates",{"lower case patterns"}),
#"Sorted Rows" = Table.Sort(#"Removed Columns1",{{"Count", Order.Descending}}),
//Re-construct original patterns as text
#"Extracted Values" = Table.TransformColumns(#"Sorted Rows",
{"Patterns", each Text.Combine(List.Transform(_, Text.From), " "), type text})
in
#"Extracted Values"
Note that you could readily implement a similar algorithm using VBA, the VBA.Split function and a Dictionary

Related

Excel Power Query: Keep only matched and row above

I wanted to know if Power Query in Excel can handle matching something from another worksheet and keeping only the matching row and the row above it all the while not sorting the list.
Above is the report I get sent daily. It contains orders going out. But we only give our customers their orders if they paid, which our system also catches as an "order". Our database is created that links these two orders together but it does it in a single column with the order in above the order out.
The above is the flat text file from the database that shows the OUT orders and the IN orders (i.e. payments). They are sorted by IN and linked OUT order. The numbers are randomly made by the system.
Can Power Query be used to import this flat text file from the database, match those OUT orders from "Today's OUTS" sheet and the OrdersINs which is always the single row above?
I want to just end up with a sheet that contains Today's OUTS and their linked Order INs.
Thank you.

Yes, it can.
Read in the two tables
Add an Index column to the "Links" table to be able to restore original order
Do Table.Join with JoinKind.FullOuter (all rows from both)
Sort according to the Index column
At this point one could either
add a custom column to reference the previous row if there is something in the OUTS column or,
my preference as it will often be faster: offset the Links column by one; then filter out the nulls
Please read the comments in the code and explore the Applied Steps to better understand the algorithm:
M Code
let
Source = Excel.CurrentWorkbook(){[Name="Outs"]}[Content],
Outs = Table.TransformColumnTypes(Source,{{"Today's OUTS", type text}}),
Source2 = Excel.CurrentWorkbook(){[Name="Links"]}[Content],
Links = Table.TransformColumnTypes(Source2,{{"Order Links", type text}}),
//Add index column to links to restore order after join
#"Added Index" = Table.AddIndexColumn(Links, "Index", 0, 1, Int64.Type),
Joined = Table.Join(Outs,"Today's OUTS", #"Added Index", "Order Links", JoinKind.FullOuter),
#"Sorted Rows" = Table.Sort(Joined,{{"Index", Order.Ascending}}),
#"Removed Columns" = Table.RemoveColumns(#"Sorted Rows",{"Index"}),
//offset Links by one row (usually faster than using Index to reference previous row
prevRow = let
ShiftedList = {null} & List.RemoveLastN(Table.Column(#"Removed Columns", "Order Links"),1),
Custom1 = Table.ToColumns(#"Removed Columns") & {ShiftedList},
Custom2 = Table.FromColumns(Custom1, Table.ColumnNames(#"Removed Columns") & {"Order IN"})
in
Custom2,
#"Removed Columns1" = Table.RemoveColumns(prevRow,{"Order Links"}),
//Filter out the nulls
#"Filtered Rows" = Table.SelectRows(#"Removed Columns1", each ([#"Today's OUTS"] <> null))
in
#"Filtered Rows"
Edit: Outs without Links will show up in the Outs column with a blank in the In column. Not sure how you might want to handle this

Retrieving several matches from a string excel

Sorry if this is a stupid question but i've been racking my brain for a couple of days now and i can't seem to come up with a solution to this.
I have a list of phrases and a list of keywords that need to be searched, extracted and replaced.
For example i have the following list of keywords in sheet 1 column A that need to be extracted and replaced with the keywords in column B.
red - orange
blue - violet
green - pink
yellow - brown
And in sheet 2 I have a list of phrases in column A.
The girl with blue eyes had a red scarf.
I saw a yellow flower.
My cousin has a red car with blue rims and green mirrors.
And I want to extract in column B the keywords that are matched for every phrase in the exact order that they appear like so:
COLUMN A COLUMN B
The girl with blue eyes had a red scarf. violet, orange
I saw a yellow flower. brown
My cousin has a red car with blue rims and green mirrors. orange, violet, pink
Is there any way this can be achieved either by formula or VBA? Also this needs to be usable with Excel 2016 so i can't use fancy functions like "TEXTJOIN".
Thank you everyone in advance!
Cheers!
L.E.
I was able to find some code that almost does what I need it to do but it does not keep the correct order.
Is there anyway it could be modified to generate the desired results? Unfortunately I'm not that good with VBA. :(
Sub test()
Dim datacount As Long
Dim termcount As Long
datacount = Sheets("Sheet1").Cells(Rows.Count, "A").End(xlUp).Row
termcount = Sheets("Sheet2").Cells(Rows.Count, "A").End(xlUp).Row
For i = 1 To datacount
dataa = Sheets("Sheet1").Cells(i, "A").Text
result = ""
For j = 1 To termcount
terma = Sheets("Sheet2").Cells(j, "A").Text
termb = Sheets("Sheet2").Cells(j, "B").Text
If InStr(dataa, terma) > 0 Then
If result = "" Then
result = result & termb
Else
result = result & ", " & termb
End If
End If
Next j
Sheets("Sheet1").Cells(i, "B").Value = result
Next i
End Sub

You can do this with a User Defined Function making use of Regular Expressions.
The worksheet formula:
=matchWords(A2,$K$2:$L$5)
where A2 contains the sentence, and the second argument points to the translation table (which could be on another worksheet).
The code
Option Explicit
Function matchWords(ByVal s As String, translTbl As Range) As String
Dim RE As Object, MC As Object, M As Object
Dim AL As Object 'collect the replaced words
Dim TT As Variant
Dim I As Long
Dim vS As Variant
'create array
TT = translTbl
'initiate array for output
Set AL = CreateObject("system.collections.arraylist")
'initiate regular expression engine
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.ignorecase = True 'could change this if you want
.Pattern = "\w+" 'can change this if need to include some non letter/digit items
'split the sentence, excluding punctuation
If .test(s) Then
Set MC = .Execute(s)
For Each M In MC
For I = 1 To UBound(TT, 1)
If M = TT(I, 1) Then AL.Add TT(I, 2)
Next I
Next M
End If
End With
matchWords = Join(AL.toarray, ", ")
End Function

I would suggest you use Power Query which is a built-in function since Excel 2013.
Suppose the text strings of colours on your Sheet1 is in a Table named Tbl_LookUp
Suppose the phrases on your Sheet2 is in another Table named Tbl_Phrases
Go to the Data tab of your Excel and load both tables to the Power Query Editor (you can google how to load data from a table to the PQ Editor in Excel 2016). Please note the screenshot is from Excel 365.
Once loaded, go to the Tbl_Phrases query, and action the following steps:
Add an indexed column starting from 1
Split the Phrases column by delimiter, use space as the delimiter and choose to put the outcome into rows
Merge the current query with the Tbl_LookUp query, use the Phrase column to match the Old Text column
Expand the new column to show contents from New Text column
Group the New Text column by the Index column, you can choose to sum the values in the New Text column, and it will come up as an error after the grouping. Go to the formula field and replace this part of the formula List.Sum([New Text]) with Text.Combine([New Text],", "). Hit enter and the error will be corrected to the desired text string.
The following is the full M Code for the above query. You can copy and paste it in the Advanced Editor without manually going through each step:
let
Source = Excel.CurrentWorkbook(){[Name="Tbl_Phrases"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Phrases", type text}}),
#"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 1, 1, Int64.Type),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Added Index", {{"Phrases", Splitter.SplitTextByDelimiter(" ", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Phrases"),
#"Changed Type1" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"Phrases", type text}}),
#"Merged Queries" = Table.NestedJoin(#"Changed Type1", {"Phrases"}, Tbl_LookUp, {"Old Text"}, "Tbl_Replace", JoinKind.LeftOuter),
#"Expanded Tbl_Replace" = Table.ExpandTableColumn(#"Merged Queries", "Tbl_Replace", {"New Text"}, {"New Text"}),
#"Grouped Rows" = Table.Group(#"Expanded Tbl_Replace", {"Index"}, {{"Look up color", each Text.Combine([New Text],", "), type nullable text}})
in
#"Grouped Rows"
When you finish adding an index column in the Tbl_Phrases query, which is Step 1 from the above, you can make a copy of the query (simply right click the original query and select "duplicate"), then you will have a second query called Tbl_Phrases (2). No need to work on this query until you finish editing the original query ended up with desired text strings.
Then you can merge the Tbl_Phrases (2) query with the Tbl_Phrases query using the index column. Expand the new column to show the content from the look up colour column. Lastly, merge the Phrases column with the look up color column with delimiter (space)-(space), and remove the index column, then you should have the desired text string.
Here is the M Code for the Tbl_Phrases (2) query. Just a reminder, you must finish with the Tbl_Phrases query first otherwise the merging query step will lead to an error:
let
Source = Excel.CurrentWorkbook(){[Name="Tbl_Phrases"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Phrases", type text}}),
#"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 1, 1, Int64.Type),
#"Merged Queries" = Table.NestedJoin(#"Added Index", {"Index"}, Tbl_Phrases, {"Index"}, "Tbl_Phrases", JoinKind.LeftOuter),
#"Expanded Tbl_Phrases" = Table.ExpandTableColumn(#"Merged Queries", "Tbl_Phrases", {"Look up color"}, {"Look up color"}),
#"Merged Columns" = Table.CombineColumns(#"Expanded Tbl_Phrases",{"Phrases", "Look up color"},Combiner.CombineTextByDelimiter(" - ", QuoteStyle.None),"Merged"),
#"Removed Columns" = Table.RemoveColumns(#"Merged Columns",{"Index"})
in
#"Removed Columns"
You can then load the Tbl_Phrase (2) query to the desired worksheet within the same workbook (or to somewhere on Sheet2).
Let me know if you have any questions.

Excel - How to count pairs in two columns containing lists

I have a number of farmers registered on my database. Each farmer grows a few fruits and sells to a few counties.
For every fruit / county pair (e.g. apple, Warwickshire), how do I count the number of farmers that can supply that combo?
I have over 100 farmers registered on my database.
So my database has a row for each farmer, a column for fruits and a column for the counties they cover. The fruits and counties that each farmer covers are recorded as comma separated lists in the two cells on that farmer's row.
I want to create a matrix with fruits on the horizontal and counties on the vertical to count how many farmers cover that particular combo.
For the example in the screenshot, I've tried:
=COUNTIF(A2:B4,AND(ISNUMBER(SEARCH(G11,A2,1)),ISNUMBER(SEARCH(A13,B2,1)))="TRUE")
but with no luck.

IF you have Excel 2010+, you can do this with Power Query (aka Get & Transform in Excel 2016+).
Using Power Query allows you to update the table easily whenever any new products (or counties) are added. You just re-run the query after you add rows to the data table (or add a product or county to a given row).
Except for removing the extra spaces (Trim after splitting the columns), all can be done via the GUI. But you can just paste the M-Code into the Advanced Editor and then explore the GUI to study the individual steps.
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Products", type text}, {"Counties", type text}}),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Changed Type", {{"Counties", Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Counties"),
#"Changed Type1" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"Counties", type text}}),
#"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Changed Type1", {{"Products", Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Products"),
#"Changed Type2" = Table.TransformColumnTypes(#"Split Column by Delimiter1",{{"Products", type text}}),
#"Added Custom" = Table.AddColumn(#"Changed Type2", "Prod", each Text.Trim([Products])),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "County", each Text.Trim([Counties])),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom1",{"Products", "Counties"}),
#"Grouped Rows" = Table.Group(#"Removed Columns", {"County", "Prod"}, {{"grouped", each _, type table [Prod=text, County=text]}, {"counts", each Table.RowCount(_), type number}}),
#"Removed Columns1" = Table.RemoveColumns(#"Grouped Rows",{"grouped"}),
#"Pivoted Column" = Table.Pivot(#"Removed Columns1", List.Distinct(#"Removed Columns1"[Prod]), "Prod", "counts", List.Sum)
in
#"Pivoted Column"
Original Data
Results

Just for "fun" I built a solution with AND() and FIND() :
IFERROR(AND(FIND(B$7,$A$2,1),FIND($A8,$B$2,1)),0)+IFERROR(AND(FIND(B$7,$A$3,1),FIND($A8,$B$3,1)),0)+IFERROR(AND(FIND(B$7,$A$4,1),FIND($A8,$B$4,1)),0)
You could wrap this in an IF() so you only show results greater than 1 which may make it easier to spot the ones wanted.
IF(IFERROR(AND(FIND(B$7,$A$2,1),FIND($A8,$B$2,1)),0)+IFERROR(AND(FIND(B$7,$A$3,1),FIND($A8,$B$3,1)),0)+IFERROR(AND(FIND(B$7,$A$4,1),FIND($A8,$B$4,1)),0)<2,"")

Oh I get it . Its the way it has been tabulated by the form-filler(s) or has been given the data (I'm explaining it to others here so they get where I'm coming from). He probably wants to change the way these forms are filled to make it easier to read, follow and more efficient logical process going forward.
he has received confusing/badly compiled information/tables and wants to make them more straightforward/logical.
I think I got it working according to the way I understood the way you worded the question, what i know how to do in excel and your information given. Way I did it works like a "count the number of occurrences any specific word appears in a string." .
version 1:
=(LEN(lookupall("*"&B$8&"*",$A$2:$B$4,2))-LEN(SUBSTITUTE(lookupall("*"&B$8&"*",$A$2:$B$4,2),$A9,"")))/LEN(B$8)
and drag across and down.
or better:
version 2:
=(LEN(lookupall("*"&B$7&"*",$A$2:$B$4,2))-LEN(SUBSTITUTE(lookupall("*"&B$7&"*",$A$2:$B$4,2),$A8,"")))/LEN($A8)
[now 14:00 edited above - 5am written and it was off by some cells]
my results are :
version 1 results table:
&
version 2 results table: I think this is exactly what you wanted.
Notes: yes, in both source tables, A2:B4 I've called them those names (but the data is the same. war = Warwickshire. app=apples etc. )
Which one does what your seeking most?
lookupall is a UDF you can find on the net if you search around. It gives all vlookup results including duplicate lookups, concatenated together. It occurred to me that you can then look at the number of times your values in A (counties) appear in each of the results (fruit look ups), and then divide by the number of letters in that word (the counties im version 1, the fruit in version 2) to get precise number.
in version 1, I think you have to round down/up the results (because when I get rid of the glos (Gloucestershire) in b2 for instance, the result in b12 becomes 0 which would be precise given those numbers). But version 2 is better - more accurate.
Is this kinda going in the right direction for you? Might be worth more tweaking, but I think given the approximate nature of question (the way I read it), its the best I can do. It would have been more accurate to tie ...
Though I am sure there are better, more versatile, generic-scientifically accurate, other-similar-table-applicable and precise formulas out there which would do it better in just 1 formula or 1 single UDF.
The lookupall UDF I use:
Function LookupAll(vVal, rTable As Range, ColumnI As Long) As Variant
Dim rFound As Range, lLoop As Long
Dim strResults As String
With rTable.Columns(1)
Set rFound = .Cells(1, 1)
For lLoop = 1 To WorksheetFunction.CountIf(.Cells, vVal)
Set rFound = .Find(what:=vVal, After:=rFound, LookIn:=xlFormulas, lookAt _
:=xlWhole, SearchOrder:=xlByRows, SearchDirection:=xlNext, MatchCase:= _
False, SearchFormat:=False)
strResults = strResults & ";" & rFound(1, ColumnI)
Next lLoop
End With
LookupAll = Trim(Right(strResults, Len(strResults) - 1))
End Function
This actually does this and (many other) jobs actually and has been a life-saver for much of my own work. (p.s. nobody asks me questions in the office & nobody gave me anything ! its all found, researched and discovered or made by me to survive!).
my Correct results table I am pleased to say is exactly the same as Solar Mikes! So Version 2 is correct
with
=(LEN(lookupall("*"&E$7&"*",$A$2:$B$4,2))-LEN(SUBSTITUTE(lookupall("*"&E$7&"*",$A$2:$B$4,2),$A11,"")))/LEN($A11)
in cell B8 and dragged down&across

HOw do you remove duplicate values from a single excel cell

How do you remove duplicate values from a single excel cell (A1) using power query
For example:
Anish,Anish,Prakash,Prakash,Prakash,Anish~,Anish~
Result wanted as like:
Anish,Prakash,Anish~

Using Power Query, you can refer to a single cell in the current workbook if it is a named range. You could then use something like this, to list the distinct values:
let
Source = Excel.CurrentWorkbook(){[Name="MyCell"]}[Content],
#"Split List" = Text.Split(Source{0}[Column1],","),
#"Removed Duplicates" = List.Distinct(#"Split List"),
#"Combine Values" = Text.Combine(#"Removed Duplicates",",")
in
#"Combine Values"

I am new to M code. However, for others who might has similar experience as me, I studied a little bit and I think the following might be easier for others to understand:
#"Added Custom1" = Table.AddColumn(#"Extracted Values1", "Split1", each Text.Split([#"Cust"],",")),
#"Added Custom2" = Table.AddColumn(#"Added Custom1", "RemoveDuplicate1", each List.Distinct([#"Split1"])),
#"Added Custom3" = Table.AddColumn(#"Added Custom2", "CombineValue1", each Text.Combine([#"RemoveDuplicate1"],",")),
Just simply copy above code in Advanced Editor, and change the column name respectively. In my case, the column name is Cust, Split1, RemoveDuplicate1,CombineValue1. Of course, the added column name might also be different.
Basically, the 3 rows means we need to create 3 columns, if we manually create 3 columns, then we just need to copy and paste the code after "each" of each row of above.
See below:

Power Query Skewed data

I have a problem in power query where my data is coming from a report that is split into pages and some of the pages skew the data to different columns. I think there may be an error based solution, but I would like it to be more redundant and not rely on text vs. number error correction. Mainly because sometimes the data that could be alphabetic in some instances, can be numeric in others. I've prepared a data set that has randomly generated replacements for names and codes. I also had to butcher the data a little to give examples of the different shifts, and to account for records split from different pages.
https://drive.google.com/file/d/0B2qUbAWJXgfyNlByV2RHODJzQjA/view?usp=sharing
There are 12 records in the data set that will eventually contain one row per record.
1st page is the Raw data stripped from the source document. These are Check History records (masked) that need to be moved to a single row per record with separate columns for four specific areas:
[Names, Dates, Check numbers, etc][Earnings][Deductions][Taxes]
Record Info including Names, Dates, Record ID Numbers, and amounts is the fist thing extracted and formatted from the raw data. The steps I applied in NameData and CheckData will show how those records are extracted and formatted, also some of the skewed data in this section was easy to reconcile with merge functions and conditional columns.
Each individual Pay Item (An earning code, Deduction Code, Or Tax Code) is formatted then pivoted to it's own column. You can see an example of this maneuver in the Earnings Query. The PayItemReference query is some basic filters I use as a starting point to My Pay Items. You can see in that Query that the codes will shift from column to column, with Text and Numbers mixed. There can be spaces between the codes and their values, or there can be no space, it can also shift columns completely.
I am working on consolidating codes and their values to regular columns, then I can merge, unpivot, pivot etc to get to the final formatting. I have tried using conditional columns and errors, but there are always small issues with either on the original data set. I just need some fresh eyes and new approaches to the data.

This was a challenging task.
First it is good idea to split table back into pages, since column structure for each page is probably unique. Thus I form list of tables, each table for one page. Then I have to process each page: extract column names, add summary information for each row, filter not needed rows, and set column names. This is done for each table in the list by using custom function ConvertTable. Afterwards you just combine resulting tables.
Here:
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
AddRowNum = Table.AddColumn(Table.AddIndexColumn(Source, "Index", 1, 1), "RowNum", each Number.Mod([Index]-1, 52)+1, type number),
CountTables = {1..(Number.RoundUp(Table.RowCount(AddRowNum)/52, 0))},
ListTables = List.Transform(CountTables, (ListItem)=>Table.SelectRows(AddRowNum, each [Index] > 52 * (ListItem - 1) and [Index] <= 52 * ListItem)),
ConvertTable = (tbl as table) as table =>
let
hdr1 = Table.Transpose(Table.FillDown(Table.Transpose(Table.FromRecords({tbl{6}})), {"Column1"})),
hdr2 = Table.FromRecords({tbl{7}}),
ColNames = Table.Transpose(Table.SelectColumns(Table.FirstN(Table.AddColumn(Table.Transpose(Table.Combine({hdr1, hdr2})), "ColumnName", each [Column1] & ": " & [Column2]), 19), {"ColumnName"})),
AddPayDate = Table.AddColumn(tbl, "Pay Date", each if [RowNum] > 8 and Text.Trim(tbl{[RowNum]-2}[Column9]) = "Pay Date" then [Column9] else null, type date),
AddPeriodEndDate = Table.AddColumn(AddPayDate, "Period End Date", each if [RowNum] > 8 and Text.Trim(tbl{[RowNum]-2}[Column12]) = "Period End Date" then [Column12] else null, type date),
AddJobCode = Table.AddColumn(AddPeriodEndDate, "Job Code", each if [RowNum] > 8 and Text.Trim(tbl{[RowNum]-2}[Column14]) = "Job Code" then [Column14] else null, Int64.Type),
AddCheckInfo = Table.AddColumn(AddJobCode, "Check Info", each if [RowNum] > 8 and Text.Trim([Column1]) = "Check Printed:" then Table.Transpose(Table.SelectRows(Table.Transpose(Table.FromRecords({_})), each [Column1] <> null)) else null),
ExpandedCheckInfo = Table.ExpandTableColumn(AddCheckInfo, "Check Info", {"Column4", "Column6", "Column8"}, {"Check Amount", "Direct Deposit", "Net"}),
FillUp = Table.FillUp(ExpandedCheckInfo, {"Column3", "Check Amount", "Direct Deposit", "Net"})//Table.AddColumn(AddJobCode, "tmp2", each if [RowNum] < 9 then "" else (if Text.Trim([Column1]) = "Check Printed:" then (if [Column3] = null then -1 else [Column3]) else null), type text), {"tmp2"}),
FillDown = Table.FillDown(FillUp, {"Column1", "Column5", "Pay Date", "Period End Date", "Job Code"}),
AddCheckEEIDfixed = Table.AddColumn(FillDown, "Check:EEID.fixed", each Text.From([Column5]) & ":" & Text.From([Column3]), type text),
FilteredExtraRows = Table.SelectRows(AddCheckEEIDfixed, each [RowNum] > 8 and Text.Trim([Column1]) <> "Check Printed:" and Text.Trim([Column7]) <> "PerControl" and Text.Trim(tbl{[RowNum]-2}[Column7]) <> "PerControl" and [#"Check:EEID.fixed"] <> null),
DemotedHeaders = Table.DemoteHeaders(FilteredExtraRows),
GetColumnNames1 = Table.Combine({Table.FromRecords({DemotedHeaders{0}}), ColNames}),
GetColumnNames2 = Table.PromoteHeaders(Table.FillDown(GetColumnNames1, Table.ColumnNames(GetColumnNames1))),
SetColumnNames = Table.PromoteHeaders(Table.Combine({GetColumnNames2, FilteredExtraRows}))
in
SetColumnNames,
ConvertedList = List.Transform(ListTables, (t) => ConvertTable(t)),
GetWholeTable = Table.Combine(ConvertedList)
in
GetWholeTable

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string