How to turn comma separated list into Venn Diagram Format in Power Query - pivot

My goal is to create a ven diagram out of database data. I am happy to achieve this in 2 ways
Using Flurish.com which requires me to solve this power query problem.
Being suggested other software that can automatically create Ven Diagrams
The database table looks like this:
Name
Shared Interests
Person 1
Camping, Road Trips, Acro, Firetwirling
Person 2
Camping
Person 3
Road Trips
Person 4
Road Trips
Person 5
Acro
Person 6
Firetwirling
Person 7
Camping, Road Trips
etc
Flurish.com requires me to get it in the format
Name
Value
Camping
3
Road Trips
4
Acro
2
Firetwirling
2
Camping-Road Trips
2
Camping-Acro
1
Camping-Firetwirling
1
Camping-Road Trips-Acro
1
Camping-Road Trips-Firetwirling
1
Camping-Road Trips-Firetwirling-Acro
1
... (etc with every possible combination separated by a -)
etc
How can this be achieved?
I have attempted to create a unique list of all 'Shared Interests' but I don't know how to add new rows of every unique combination separated by a dash -
I then don't know how to calculate the value column of these overlapping groups. I suspect using pivot functions are the way to go here but so far I have not had any luck.

In powerquery try
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
// Combo algo adapted from from Bill Szysz 2017
process=(Items as list) as list =>
let AddIndex = Table.AddIndexColumn(Table.FromList(List.Sort(Items)), "Index", 0, 1),
ReverseIndeks = Table.AddIndexColumn(AddIndex, "RevIdx", Table.RowCount(AddIndex), -1),
Lists = Table.AddColumn(ReverseIndeks, "lists", each List.Repeat(List.Combine({List.Repeat({[Column1]}, Number.Power(2,[RevIdx]-1)),List.Repeat( {null}, Number.Power(2,[RevIdx]-1))}), Number.Power(2, [Index]))),
ResultTable = Table.FromColumns(Lists[lists]),
AllCombos = List.Sort(List.RemoveLastN(Table.AddColumn(ResultTable, "Custom", each Text.Combine(List.RemoveNulls(Record.ToList(_)),"-"))[Custom],1))
in AllCombos,
#"Added Custom" = Table.AddColumn(Source, "combos", each process(Text.Split([Shared Interests],", "))),
#"Expanded combos" = Table.ExpandListColumn(#"Added Custom", "combos"),
#"Grouped Rows" = Table.Group(#"Expanded combos", {"combos"}, {{"Count", each Table.RowCount(_), Int64.Type}})
in #"Grouped Rows"

Related

How to speed up dynamic columns with formulas in Power Query

The Question (How do I make it faster)
I have been playing around with Power Query in Excel for over a year now but for the first time, I have a query that takes 20+ minutes to run.
I am sure there is something here I can learn!
While it does currently work I believe if it was well-written it would run much faster.
Data Structure
There are two databases here
Database of Company (Aka attendees) - About 400 rows
Company Title
Rita Book
Paige Turner
Dee End
etc
Database of Events - About 500 rows
An Event can have many Company (Attendees). The database exports this as a comma-separated list in the column [#"Export CSV - Company"]
Event Title
Export CSV - Company
Date
Year
Event 1
Rita Book, Dee End
1/1/2015
2015
Event 2
Paige Turner
2/1/2015
2015
Event 3
Dee End
3/1/2015
2015
Event 4
Rita Book, Paige Turner, Dee End
1/1/2016
2016
etc
...
...
...
Note that I also have a separate query called #"Company Event Count - 1 Years List" which is a list of all years that events have been run.
The Goal
For a visualization, I need to get the data into the following structure:
Company Title
2015
2016
etc
John Smith
10
20
...
Jane Doe
5
14
...
etc
...
...
...
The Code
I have done my best to comment on my code below. Feel free to ask any questions.
let
// This is a function. It was the only way I could figure out how to use [Company Title] from #"Keep only names column" and "currentColumnTitleYearStr" from the dynamically created columns in the same scope
count_table_year_company = (myTbl, yearStr, companyStr) =>
Table.RowCount(
Table.SelectRows(
myTbl,
each Text.Contains([#"Export CSV - Company"], companyStr)
)
),
Source = #"Company 1 - Loaded CSV From Folder", // Grab a list of all Company
#"Keep only names column" = Table.SelectColumns(Source,{"Company Title"}), // Keep only the [Company Title] field
// Dynamically create columns for each year. Example Columns: [Company Title], [2015], [2016], [2017], etc
#"Add Columns for each year" =
List.Accumulate(
#"Company Event Count - 1 Years List", // Get a table of all events
#"Keep only names column",
(state, currentColumnTitleYearStr) => Table.AddColumn(
state,
currentColumnTitleYearStr, // The Year becomes the column title and is also used in filters
let // I hoped that filting the table by Year at this point would mean it only has to do it once per column, instead of once per cell.
eventsThisYearTbl = Table.SelectRows(
#"Event 1 - Loaded CSV From Folder",
each ([Year] = Number.FromText(currentColumnTitleYearStr))
)
in(
// Finally for each cell, calculate the count of events. E.g How many events did 'John Smith' attend in 2015
each count_table_year_company(eventsThisYearTbl, currentColumnTitleYearStr, [Company Title]) //CompanyTitleVar
)
)
),
FinalStep = #"Add Columns for each year"
in
FinalStep
My Theries
I believe one of a few things may be making it slow
I am using "List.Accumulate(" to dynamically create a column for each year. While this does work I think it may be the wrong formula for the job. Especially because the state field which is like a running total of each cell must be a huge number.
I worry that I have an 'each' where I dont need it but I cant seem to remove any. Its my understanding that every 'each' is effectively a nested loop so removing one may have a dramatic impact on performance.
In Conclusion
While it does currently work I know there is something for me to learn here.
Thank you so much any guidance or suggested readings you can provide :)
Does this do what you want? Converts from left to right. If not please explain more clearly
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
SplitNames = Table.TransformColumns(Source,{{"Names", each Text.Split(_,", ")}}),
#"Expanded Names" = Table.ExpandListColumn(SplitNames, "Names"),
#"Removed Columns" = Table.RemoveColumns(#"Expanded Names",{"Event Title", "Date"}),
#"Added Custom" = Table.AddColumn(#"Removed Columns", "Count", each 1),
#"Pivoted Column" = Table.Pivot(Table.TransformColumnTypes(#"Added Custom", {{"Year", type text}}, "en-US"), List.Distinct(Table.TransformColumnTypes(#"Added Custom", {{"Year", type text}}, "en-US")[Year]), "Year", "Count", List.Sum)
in #"Pivoted Column"

Having trouble using power query to compare two excel csv's and showing differences (additions/deletions) between two sheets

I have a daily task where I need to compare two .csv files that have data on users and which keysets they are assigned. On a daily basis people I receive the new .csv via email and check to see which users are added or removed and see which users had their keysets changed. Each file has around 1000 users on it. I have it set up where the two files are in a folder where the power query runs off of. It has been my first project with it so I'm seeing if it can be used for this.
I played with power query and was able to show the differences between the users of the lists. I imported and transformed the data, removed unnecessary columns, grouped the user column to count how many of each name was in that column, if two were found then they were unselected. This showed me the differences but lacked the comparison I am attempting to reach.
I plan on attempting other methods but am not experienced with all the things power query can do. I have made a test dataset to see if anyone has a good idea on how to have this report created
Day 1
Keys
Dave 1
Key 1/ Key 2/ Key 3
Dave 2
Key 4/ Key 5
Dave 3
Key 1
Dave 4
Key 3/ Key 5
Day 2
Keys
Dave 2
Key 1/ Key 5
Dave 3
Key 1
Dave 4
Key 3/ Key 5
Dave 5
Key 1
The results should show that Dave 1 was removed, Dave 5 was added, and Dave 2 had a key change and shows change in keys.
If anyone has an idea on how to create this please let me know or point me in a direction on where to find a result. I only have access to excel on my work computer so I am trying to find a way to use the software available rather than convincing the bosses to purchase anything new.
Tutorialized Answer
This approach collects data from a folder called Key Files that contains a set of daily csv files. It summarizes every file in the folder into a normalized table of change events from one day to the next:
Deleted - name has been deleted
New - name has been added
Key Added - name has a key added
Key Removed - name has key removed
The output table will contain four fields:
Date - taken from the CSV
Name - related to the event
Event - calculated as defined above
Key - if a key is added or removed, otherwise null
Step 1 - Get files from folder
Because you asked also how to get the data into Power Query, here is an illustration. You can research more techniques like this once you see how it is done and then go from there. I put the csv files into a folder called Key Files within my Documents folder for illustration. I made three files so that the example is clear.
For reference, the files contain the following data:
keyfile0.csv
15/10/2021
Keys
Dave 0
Key 2/ Key 3
Dave 1
Key 1/ Key 2/ Key 3
Dave 2
Key 4/ Key 5
Dave 3
Key 5
Dave 4
Key 3/ Key 5
keyfile1.csv
16/10/2021
Keys
Dave 1
Key 1/ Key 2/ Key 3
Dave 2
Key 4/ Key 5
Dave 3
Key 1
Dave 4
Key 3/ Key 5
Dave 6
Key 2/ Key 3
keyfile2.csv
17/10/2021
Keys
Dave 2
Key 1/ Key 5
Dave 3
Key 1
Dave 4
Key 3/ Key 5
Dave 5
Key 1
Dave 6
Key 3/ Key 5
To get these files, you need to Get Data >> "From Folder** from the Data tab as follows:
My test folder path for this example is: C:\Users\Admin\Documents\Key Files
You can also get file From Text/CSV, but if your two CSV files will have constantly changing names, you would need to modify your Power Query Script each time you run it. From the description in your comments, I think it would be easier to put all csv's into a folder and let the script adapt.
You will be given a window that looks like this:
You will want to choose Combine & Transform Data. After that, it will bring up a table based on the first file it sees and you can click OK. Now this needs an explanation - PQ created a script and a function to read all of the files in that folder and append them into a single table. This approach lets you eat all of the files at once and never worry about their names. The price you pay for this convenience is that you have to split these back into logical day values as you can see from this screenshot Key Files table:
Somehow, row 8 has to be matched against row 3 to see if Dave 1 changed keys and then you need to be able to detect that on 17 Oct, Dave 1 was deleted. At the same time, Dave 2 in row 14 needs to be compared to Dave 2 in row 9 and NOT in row 4. So you need some way of knowing the sequence of days. Either:
the filenames must be serialized in some order OR
your header Day 1, Day 2, etc. must have a serialized value like the actual dates.
I have chosen 2 because I cannot guess at how your filenames will be structured and 2 is harder to implement, so it is better for a tutorialized answer. I will do that in the next step, so let's stop here and show the script that creates the Key Files table above:
let
Source = Folder.Files("C:\Users\Admin\Documents\Key Files"),
#"Filtered Hidden Files1" = Table.SelectRows(Source, each [Attributes]?[Hidden]? <> true),
#"Invoke Custom Function1" = Table.AddColumn(#"Filtered Hidden Files1", "Transform File", each #"Transform File"([Content])),
#"Renamed Columns1" = Table.RenameColumns(#"Invoke Custom Function1", {"Name", "Source.Name"}),
#"Removed Other Columns1" = Table.SelectColumns(#"Renamed Columns1", {"Source.Name", "Transform File"}),
#"Expanded Table Column1" = Table.ExpandTableColumn(#"Removed Other Columns1", "Transform File", Table.ColumnNames(#"Transform File"(#"Sample File"))),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded Table Column1",{{"Source.Name", type text}, {"Column1", type text}, {"Column2", type text}})
in
#"Changed Type"
Step 2 - Transform the Key File Table
As shown above, this step is required simply because of the choice above to read from a folder. If I were to do this in practice, instead of as a tutorial, I would streamline this, but instead, I will do it in separable steps. For illustration, I am going to create two tables from the Key Files table:
filedates contains the file name and an associated date
nTable is the normalized table that will be used in the final step to deliver the calculated outcomes.
To create these, right-click the Key Files table and select Reference.
Do this twice. It will create two tables called Key Files (2) and Key Files (3).
Rename these to filedates and nTable. Each of these must be transformed. I won't go into the details as that would make this a lot longer post, but here is the M script for each:
For filedates you just want to make a table of filenames and their dates. There are many ways to do that, but I just filtered for the work "Keys" because it computes fast.
let
Source = #"Key Files",
#"Filtered Rows" = Table.SelectRows(Source, each ([Column2] = "Keys"))
in
#"Filtered Rows"
For nTable you want to remove the "Keys" headers and then merge the result back with filedates so that you can have a serialized reference. As stated above, I chose to use the Date as the serial reference. I then split the Keys by row and renamed/removed columns.
let
Source = #"Key Files",
#"Filtered Rows" = Table.SelectRows(Source, each ([Column2] <> "Keys")),
#"Merged Queries" = Table.NestedJoin(#"Filtered Rows", {"Source.Name"}, filedates, {"Source.Name"}, "filedates", JoinKind.LeftOuter),
#"Expanded filedates" = Table.ExpandTableColumn(#"Merged Queries", "filedates", {"Column1"}, {"filedates.Column1"}),
#"Renamed Columns" = Table.RenameColumns(#"Expanded filedates",{{"filedates.Column1", "Date"}, {"Column1", "Name"}, {"Column2", "Keys"}}),
#"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"Date", type date}}),
#"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"Source.Name"}),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Columns", {{"Keys", Splitter.SplitTextByDelimiter("/ ", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Keys"),
#"Changed Type1" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"Keys", type text}}),
#"Trimmed Text" = Table.TransformColumns(#"Changed Type1",{{"Keys", Text.Trim, type text}})
in
#"Trimmed Text"
The result of nTable looks like this:
Step 3 - Calculating the Result
So this is the answer to your question. Again, I will produce it as a separate step for illustration and modularity.
NB: this is where Ron Rosenfeld said you could simply push this out and then do all the processing in Excel. The remaining steps are complex. I did not create PQ Functions because it would be harder to show and understand. This is more of a tutorial to see how you could do things. With more Power Query knowledge, you can modify this to suit your needs.
The following is the M script that takes in the nTable and produces a table I named output. You can go into Advanced Editor and paste this as a new source. Afterwards you can Close & Load To your Excel sheet to see the table results.
let
Source = nTable,
SortedNTable = Table.Sort(Source,{{"Name", Order.Ascending}, {"Keys", Order.Ascending}, {"Date", Order.Ascending}}),
UniqueNameDates = Table.Distinct(SortedNTable, {"Name", "Date"}),
CalculatedLatest = List.Max(SortedNTable[Date]), //
CalculatedEarliest = List.Min(SortedNTable[Date]),
NamesFirstSeen = Table.Group(SortedNTable, {"Name"}, {{"Date", each List.Min([Date]), type nullable date}}),
NamesAdded = Table.AddColumn(NamesFirstSeen, "Event", each "Added"),
NamesLastSeen = Table.Group(SortedNTable, {"Name"}, {{"LSDate", each List.Max([Date]), type nullable date}}),
NamesDeleted = Table.AddColumn(NamesLastSeen, "Event", each "Deleted"),
AdjNamesDeleted = Table.AddColumn(NamesDeleted, "Date", each Date.AddDays([LSDate],1)), //names are deleted on the day after last seen
NameKeysFirstSeen = Table.Group(SortedNTable, {"Name", "Keys"}, {{"Date", each List.Min([Date]), type nullable date}}),
KeysAdded = Table.AddColumn(NameKeysFirstSeen, "Event", each "Key Added"),
NameKeysLastSeen = Table.Group(SortedNTable, {"Name", "Keys"}, {{"LSDate", each List.Max([Date]), type nullable date}}),
KeysDeleted = Table.AddColumn(NameKeysLastSeen, "Event", each "Key Deleted"),
AdjKeysDeleted = Table.AddColumn(KeysDeleted, "Date", each Date.AddDays([LSDate],1)), //keys are deleted on the day after last seen
// bring it all together
#"Appended Query" = Table.Combine({NamesAdded, AdjNamesDeleted, KeysAdded, AdjKeysDeleted}),
#"Removed Columns" = Table.RemoveColumns(#"Appended Query",{"LSDate"}),
//filter out first day adds and last day deletes
#"Filtered Rows" = Table.SelectRows(#"Removed Columns", each [Date] <> CalculatedEarliest or not Text.Contains([Event], "Added")),
#"Filtered Rows1" = Table.SelectRows(#"Filtered Rows", each [Date] <> Date.AddDays(CalculatedLatest,1) or not Text.Contains([Event], "Deleted")),
#"Changed Type" = Table.TransformColumnTypes(#"Filtered Rows1",{{"Name", type text}, {"Date", type date}, {"Event", type text}, {"Keys", type text}}),
#"Sorted Rows" = Table.Sort(#"Changed Type",{{"Name", Order.Ascending}, {"Date", Order.Ascending}})
in
#"Sorted Rows"
The script above uses logical variable names in order to make the steps clear and it has some limited // comments inside. Pasting it into the advanced editor (after you have done Steps 1 & 2) will let you see it more clearly and examine the output table at each step.
Summary
Here is the result based on the fake data I made above:
Name
Date
Event
Keys
Dave 0
16/10/2021
Deleted
Dave 0
16/10/2021
Key Deleted
Key 3
Dave 0
16/10/2021
Key Deleted
Key 2
Dave 1
17/10/2021
Deleted
Dave 1
17/10/2021
Key Deleted
Key 3
Dave 1
17/10/2021
Key Deleted
Key 2
Dave 1
17/10/2021
Key Deleted
Key 1
Dave 2
17/10/2021
Key Added
Key 1
Dave 2
17/10/2021
Key Deleted
Key 4
Dave 3
16/10/2021
Key Added
Key 1
Dave 3
16/10/2021
Key Deleted
Key 5
Dave 5
17/10/2021
Added
Dave 5
17/10/2021
Key Added
Key 1
Dave 6
16/10/2021
Added
Dave 6
16/10/2021
Key Added
Key 2
Dave 6
16/10/2021
Key Added
Key 3
Dave 6
17/10/2021
Key Deleted
Key 2
Dave 6
17/10/2021
Key Added
Key 5
So, as Ron pointed out in the comments, that last step is very involved. It delivers the events in a way that matches your criteria, but it may still not be exactly what you are looking for. In any case, this 3 step approach allows you to dump all of the CSV files that you want to process into a single folder and then process all of them, no matter how large or how many there are.
Here is the previous answer which is quite simplistic, but shows the basic idea.
Basic Answer
To produce the M code in Power Query that mimics your situation, I am using Table1 as the Day1 table and Table2 as the Day2 table. Assuming that you have ingested these into Power Query, the script could be:
let
Source = Table.NestedJoin(Table2, {"Day 2"}, Table1, {"Day 1"}, "Table1", JoinKind.FullOuter),
#"Expanded Table1" = Table.ExpandTableColumn(Source, "Table1", {"Day 1", "Keys"}, {"Table1.Day 1", "Table1.Keys"}),
#"Added Conditional Column" = Table.AddColumn(#"Expanded Table1", "Status", each if [Day 2] = null then "Deleted" else if [Table1.Day 1] = null then "New" else if [Keys] <> [Table1.Keys] then "Changed Keys" else null),
#"Filtered Rows" = Table.SelectRows(#"Added Conditional Column", each ([Status] <> null)),
#"Added Conditional Column1" = Table.AddColumn(#"Filtered Rows", "Name", each if [Status] = "New" then [Day 2] else null),
#"Merged Columns" = Table.CombineColumns(#"Added Conditional Column1",{"Name", "Table1.Day 1"},Combiner.CombineTextByDelimiter("", QuoteStyle.None),"Name"),
#"Removed Other Columns" = Table.SelectColumns(#"Merged Columns",{"Name", "Status"})
in
#"Removed Other Columns"
This does a full outer join of the two tables, then it used a conditional column to determine if each row is either New, Deleted, Changed Keys or no change (null). It then filters out the no change and applies another conditional column to identify the New instances and copies the name from Day2. It merges that conditional column with the Table1.Day1 names to make a consolidated list of names. It removes the unnecessary columns and you are left with a normalized table of names and status. I don't know how you want it presented, but with such a normalized table, you can either further shape it in Power Query or in Excel.

Find and append dates in multiple columns based on a date range

I'm looking for a way to find dates in multiple columns that fall within the past week and output those dates. Goal is to make it expandable through many more columns.
We will assume the week is 1-Sep through 7-Sep
EX:
Name
Box 1
Box 2
Bill
1-Sep-21
3-Sep-21
Bob
30-Aug-21
3-Sep-21
Jeff
31-Aug-21
4-Sep-21
Sam
31-Aug-21
29-Aug-21
Output
Name
Item Sold
Date
Bill
Box 1
1-Sep-21
Bill
Box 2
3-Sep-21
Bob
Box 2
3-Sep-21
Jeff
Box 2
4-Sep-21
My first thought would be to unpivot the date columns using Power Query and then apply a date filter to the result.
The coding in the advanced query editor would look something like:
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(Source, {"Name"}, "Attribute", "Value"),
#"Filtered Rows" = Table.SelectRows(#"Unpivoted Other Columns", each [Value] >= #datetime(2021, 9, 1, 0, 0, 0) and [Value] <= #datetime(2021, 9, 7, 0, 0, 0))
in
#"Filtered Rows"
By using unpivot other columns, you'd be able to add as many date columns to the original table as you like.

VERY messy data: How do I clean up horizontal data that is all inconsistent? [VBA] [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I keep trying ways to fix a dataset, but keep running into problems because of how inconsistent it is.
Here's what the data looks like:
Entry1
Age
45
Occupation
Scientist
Phone Number
408-283-3721
User I.D.
390842
Housing Type
Condo
Square Footage
1073.29
Floors
2
Bathrooms
2.5
Budget Max
$289,287
Household Size
3
Pets?
Yes
Entry2
Floors
2
Square Footage
1974.19
User I.D.
379733
Phone Number
312-246-9121
Pets?
No
Budget Max
$481,621
Household Size
4
Bathrooms
3
Housing Type
Apartment
Occupation
Pilot
Age
32
Entry3
User I.D.
379621
Floors
1
Square Footage
1223.12
Pets?
No
Occupation
Managing Director
Budget Max
$402,342
Phone Number
714-343-1358
Household Size
2
Age
31
Bathrooms
2
Housing Type
House
I want to create a new, cleaned dataset with headers along the top (e.g. "Age", "Occupation", etc) and the values associated (to the right of each variable name cell) as the row, underneath each column.
The variable names are all mixed up, not always on the same column or relative row, so it's not only transposing into a clean new dataset but finding the appropriate values depending on where the variable is (so, I'm thinking something like .Cells.Find(What:="the variable name") for each one and somehow returning the value next to it in a loop). Then, there's the issue where some entries have 3 rows and 8 columns and others 4 rows and 6 columns (not all rows being full too). I also struggle with placing the values under the appropriate column header and not replace the former value. (i.e not just changing one cell but adding to the one below and so on)
There are over 400 records like this, so doing it manually would be super tedious. I'm fairly certain these are all the variations though.
Loop through the data row by row.
If only the first column has data it is the header of an entry. Write that to a new workbook column A.
Enrty Name
Entry1
Then go to the next row. If more than 2 columns have data it is a data row to the previous entry. Data rows contain data in blocks of 2 cells, where the first block is the data description and the second cell the data value.
So you need to loop through the columns of the data rows in blocks of 2:
Take the first block which is Age | 45
Check if the column Age exists. Here it does not so we name the next free column Age and fill in the data to the last enty
Enrty Name
Age
Entry1
45
Then we move on to the next block Occupation | Scientist and do the same. Check if a column Occupation exists? No, ok insert next free column:
Enrty Name
Age
Occupation
Entry1
45
Scientist
We do this until the entire row is done, then we move over to the next one and if this is a data row too, we keep going until we find a new entry header.
So after the first entry your data would look like this:
Enrty Name
Age
Occupation
Phone Number
User I.D.
Housing Type
Square Footage
Floors
Bathrooms
Budget Max
Entry1
45
Scientist
408-283-3721
390842
Condo
1073.29
2
2.5
$289,287
Yes
Then you move over to the next entry
Enrty Name
Age
Occupation
Phone Number
User I.D.
Housing Type
Square Footage
Floors
Bathrooms
Budget Max
Entry1
45
Scientist
408-283-3721
390842
Condo
1073.29
2
2.5
$289,287
Yes
Entry2
The first data set here is Floors | 2, so you search in the first row for Floors it is found in column 8. So we write 2 into column 8.
Enrty Name
Age
Occupation
Phone Number
User I.D.
Housing Type
Square Footage
Floors
Bathrooms
Budget Max
Entry1
45
Scientist
408-283-3721
390842
Condo
1073.29
2
2.5
$289,287
Yes
Entry2
2
If you keep that going you have cleaned up data in the end.
If your real data corresponds to your example, where all the parameters are spelled identically, you can do this using Power Query.
If there are variations in your data that this table doesn't show, examples of these variations would be needed to craft a better solution.
Select some cell in your Data Table
Data => Get&Transform => from Table/Range
Select My data does NOT have headers
When the PQ Editor opens: Home => Advanced Editor
Make note of the Table name in Line 2
Paste the M Code below in place of what you see
Change the Table name in line 2 back to what was generated originally.
Read the comments and explore the Applied Steps to understand the algorithm
M Code (Modified to deal with missing Parameter Values)
let
Source = Excel.CurrentWorkbook(){[Name="Table3"]}[Content],
//Add grouping column Entries and Unpaivot
#"Added Custom" = Table.AddColumn(Source, "Entry", each
if Text.StartsWith([Column1],"entry",Comparer.OrdinalIgnoreCase) then [Column1] else null),
#"Filled Down" = Table.FillDown(#"Added Custom",{"Entry"}),
//Remove extra entry rows
remRows = Table.SelectRows(#"Filled Down", each [Entry] <> [Column1]),
//Table.ReplaceValue(#"Removed Columns1"," ",null,Replacer.ReplaceValue,{"Value"})
//Replace nulls with space so we don't lose one item of a "pair"
#"Replaced Value" = Table.ReplaceValue(remRows,null," ",Replacer.ReplaceValue,Table.ColumnNames(remRows)),
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Replaced Value", {"Entry"}, "Attribute", "Value"),
#"Removed Columns" = Table.RemoveColumns(#"Unpivoted Other Columns",{"Attribute"}),
#"Added Index" = Table.AddIndexColumn(#"Removed Columns", "Index", 0, 1, Int64.Type),
#"Inserted Integer-Division" = Table.AddColumn(#"Added Index", "Integer-Division", each Number.IntegerDivide([Index], 2), Int64.Type),
#"Removed Columns1" = Table.RemoveColumns(#"Inserted Integer-Division",{"Index"}),
//Group in pairs
//Mark blank subTables
//Extract Entry, Parameter and Value
#"Grouped Rows" = Table.Group(#"Removed Columns1", {"Integer-Division"}, {
{"Empties", each List.NonNullCount(List.ReplaceValue(_[Value]," ", null,Replacer.ReplaceValue))},
{"Entry", each _[Entry]{0}},
{"Parameter", each _[Value]{0}},
{"Value", each _[Value]{1}}
}),
#"Filtered Rows" = Table.SelectRows(#"Grouped Rows", each ([Empties] <> 0)),
#"Removed Columns2" = Table.RemoveColumns(#"Filtered Rows",{"Integer-Division", "Empties"}),
//Group by Entry, Pivot and expand
#"Grouped Rows1" = Table.Group(#"Removed Columns2", {"Entry"}, {
{"Pivot", each Table.Pivot(_, _[Parameter], "Parameter","Value")}
}),
//Column name list for the Pivot table.
//Could sort or order them any way you want
colNames = List.Distinct(#"Removed Columns2"[Parameter]),
#"Expanded Pivot" = Table.ExpandTableColumn(#"Grouped Rows1", "Pivot", colNames,colNames)
in
#"Expanded Pivot"
Original Data
Transformed

Calculate Overlap in Excel Dataset

I have an Excel file with ~500,000+ rows of data, each of which has (amongst other things) an ID and a certain value I'd like to manipulate. I'll use an example consisting of names and foods. The data looks something like this:
Name Food
Alex Melon
Alex Burger
Bruce Apple
Charlie Water
Alice Apple
Bruce Melon
Bruce Apple
Bruce Plum
I'd like to find the overlap in foods between any pair of names, giving me a result that would tell me (for example) that for the pairing of Bruce vs Alex, 2/3 of Bruce's data is unique and 1/3 is the same is Alex's list, whilst for Alex 1/2 his data in unique and 1/2 is the same as Bruce.
There is no consistency in the amount of foods a person can have listed alongside their name. And its entirely possible for some people to have foods not found alongside anyone else.
I thought to present it through something like this, where each percentage sign is the overlap for that pairing (read by row, so C2 would be the proportion of Alex's data also found in Alice's, whilst B3 would be the proportion of Alice's data also found in Alex's):
Alex Alice Bruce Charlie
Alex - % % %
Alice % - % %
Bruce % % - %
Charlie % % % -
I've been struggling to think and find a formulae or VBA script that would achieve this and calculate the overlap. I've considered creating (i) a helper column that concatenates the name and food, (ii) a new de-duplicated unique list of foods and maps this against the helper column. However, as far as I can tell, whilst that will help me summarise which foods go with which person, it won't help me find out the overlap between each person's list of foods.
Any help would be greatly appreciated!
This was fun!
You can use Power Query for this.
Highlight your data and insert a table, make sure you say the first row is columnm headers.
Go to the Table ribbon and change the table's name to "PersonTag"
Go to the Data ribbon and in the 'Get & Transform' section, click "From Table"
This opens up Power Query, with a new query called "PersonTag"
Highlight both columns, then on the "Home" ribbon, choose "Remove Rows - Remove Duplicates"
In the "Home" ribbon, click "Manage - Reference"
You've just created a new query that refers to the PersonTag query! Rename it to PersonCount.
Highlight the "Name" column, and in the "Transform" ribbon, click "Group By" and group by name, creating a new column called "PersonCount" that is the count of the rows.
Go back to editing the PersonTag query
Create a new query (or copy an existing one, it doesn't matter how), name it "PersonTagPersonTag", and then go to the "Home" ribbon, click "Advanced Editor" and replace whatever's there with the following.
let
Source = PersonTag,
// Recursion! Now each row contains it's own "PersonTag" table!
#"Added Custom" = Table.AddColumn(Source, "2nd", each PersonTag),
#"Expanded 2nd" = Table.ExpandTableColumn(#"Added Custom", "2nd", {"Name", "Food"}, {"2nd.Name", "2nd.Food"}),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded 2nd",{{"2nd.Name", type text}, {"2nd.Food", type text}}),
// We only want rows where the foods match and the names don't
#"Filtered Rows" = Table.SelectRows(#"Changed Type", each ([Name] <> [2nd.Name] and [Food] = [2nd.Food]))
in
#"Filtered Rows"
Now we're going to group by Name and 2nd.Name to get the row count of matches in a "PersonPersonCount" columns, bring in the PersonCount query we created earlier to get the total foods each Name, then create a PercentMatch column by dividing PersonPersonCount by PersonCount. Then we can get rid of the Count columns because we don't need them, and pivot by 2nd.Name! Create another query (I named mine "PersonvPerson").
let
Source = PersonTagPersonTag,
#"Grouped Rows" = Table.Group(Source, {"Name", "2nd.Name"}, {{"PersonPersonCount", each Table.RowCount(_), type number}}),
// Bring in PersonCount query
#"Merged Queries" = Table.NestedJoin(#"Grouped Rows",{"Name"},PersonCount,{"Name"},"PersonCount",JoinKind.LeftOuter),
// If you click the column type icon in the column title in the previous step, you get the dialog box you can fill out that does this step for you
#"Expanded PersonCount" = Table.ExpandTableColumn(#"Merged Queries", "PersonCount", {"PersonCount"}, {"PersonCount"}),
#"Added PercentMatch" = Table.AddColumn(#"Expanded PersonCount", "PercentMatch", each [PersonPersonCount] / [PersonCount]),
#"Changed Type" = Table.TransformColumnTypes(#"Added PercentMatch",{{"PercentMatch", Percentage.Type}}),
#"Removed Other Columns" = Table.SelectColumns(#"Changed Type",{"Name", "2nd.Name", "PercentMatch"}),
#"Sorted Rows" = Table.Sort(#"Removed Other Columns",{{"2nd.Name", Order.Ascending}}),
// I did this by highlighting the "2nd.Name" column, going to the "Transform" ribbon, and clicking "Pivot Column"
#"Pivoted Column" = Table.Pivot(#"Sorted Rows", List.Distinct(#"Sorted Rows"[#"2nd.Name"]), "2nd.Name", "PercentMatch", List.Sum)
in
#"Pivoted Column"
Exit the Power Query window and keep your changes. By default, creating new queries automatically create new tabs on the worksheet that contain the data. Delete the tabs you don't want to keep.
If you're anything like I was a few months ago, your jaw is dropping at what you can do with power queries. I gave you the code because I was too lazy to tell you every little click to create the code, but don't be overwhelmed!!! I just clicked around to create each next step and it created the code for me! They made it easy to click around and try/undo things.

Resources