The Question (How do I make it faster)
I have been playing around with Power Query in Excel for over a year now but for the first time, I have a query that takes 20+ minutes to run.
I am sure there is something here I can learn!
While it does currently work I believe if it was well-written it would run much faster.
Data Structure
There are two databases here
Database of Company (Aka attendees) - About 400 rows
Company Title
Rita Book
Paige Turner
Dee End
etc
Database of Events - About 500 rows
An Event can have many Company (Attendees). The database exports this as a comma-separated list in the column [#"Export CSV - Company"]
Event Title
Export CSV - Company
Date
Year
Event 1
Rita Book, Dee End
1/1/2015
2015
Event 2
Paige Turner
2/1/2015
2015
Event 3
Dee End
3/1/2015
2015
Event 4
Rita Book, Paige Turner, Dee End
1/1/2016
2016
etc
...
...
...
Note that I also have a separate query called #"Company Event Count - 1 Years List" which is a list of all years that events have been run.
The Goal
For a visualization, I need to get the data into the following structure:
Company Title
2015
2016
etc
John Smith
10
20
...
Jane Doe
5
14
...
etc
...
...
...
The Code
I have done my best to comment on my code below. Feel free to ask any questions.
let
// This is a function. It was the only way I could figure out how to use [Company Title] from #"Keep only names column" and "currentColumnTitleYearStr" from the dynamically created columns in the same scope
count_table_year_company = (myTbl, yearStr, companyStr) =>
Table.RowCount(
Table.SelectRows(
myTbl,
each Text.Contains([#"Export CSV - Company"], companyStr)
)
),
Source = #"Company 1 - Loaded CSV From Folder", // Grab a list of all Company
#"Keep only names column" = Table.SelectColumns(Source,{"Company Title"}), // Keep only the [Company Title] field
// Dynamically create columns for each year. Example Columns: [Company Title], [2015], [2016], [2017], etc
#"Add Columns for each year" =
List.Accumulate(
#"Company Event Count - 1 Years List", // Get a table of all events
#"Keep only names column",
(state, currentColumnTitleYearStr) => Table.AddColumn(
state,
currentColumnTitleYearStr, // The Year becomes the column title and is also used in filters
let // I hoped that filting the table by Year at this point would mean it only has to do it once per column, instead of once per cell.
eventsThisYearTbl = Table.SelectRows(
#"Event 1 - Loaded CSV From Folder",
each ([Year] = Number.FromText(currentColumnTitleYearStr))
)
in(
// Finally for each cell, calculate the count of events. E.g How many events did 'John Smith' attend in 2015
each count_table_year_company(eventsThisYearTbl, currentColumnTitleYearStr, [Company Title]) //CompanyTitleVar
)
)
),
FinalStep = #"Add Columns for each year"
in
FinalStep
My Theries
I believe one of a few things may be making it slow
I am using "List.Accumulate(" to dynamically create a column for each year. While this does work I think it may be the wrong formula for the job. Especially because the state field which is like a running total of each cell must be a huge number.
I worry that I have an 'each' where I dont need it but I cant seem to remove any. Its my understanding that every 'each' is effectively a nested loop so removing one may have a dramatic impact on performance.
In Conclusion
While it does currently work I know there is something for me to learn here.
Thank you so much any guidance or suggested readings you can provide :)
Does this do what you want? Converts from left to right. If not please explain more clearly
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
SplitNames = Table.TransformColumns(Source,{{"Names", each Text.Split(_,", ")}}),
#"Expanded Names" = Table.ExpandListColumn(SplitNames, "Names"),
#"Removed Columns" = Table.RemoveColumns(#"Expanded Names",{"Event Title", "Date"}),
#"Added Custom" = Table.AddColumn(#"Removed Columns", "Count", each 1),
#"Pivoted Column" = Table.Pivot(Table.TransformColumnTypes(#"Added Custom", {{"Year", type text}}, "en-US"), List.Distinct(Table.TransformColumnTypes(#"Added Custom", {{"Year", type text}}, "en-US")[Year]), "Year", "Count", List.Sum)
in #"Pivoted Column"
Related
My goal is to create a ven diagram out of database data. I am happy to achieve this in 2 ways
Using Flurish.com which requires me to solve this power query problem.
Being suggested other software that can automatically create Ven Diagrams
The database table looks like this:
Name
Shared Interests
Person 1
Camping, Road Trips, Acro, Firetwirling
Person 2
Camping
Person 3
Road Trips
Person 4
Road Trips
Person 5
Acro
Person 6
Firetwirling
Person 7
Camping, Road Trips
etc
Flurish.com requires me to get it in the format
Name
Value
Camping
3
Road Trips
4
Acro
2
Firetwirling
2
Camping-Road Trips
2
Camping-Acro
1
Camping-Firetwirling
1
Camping-Road Trips-Acro
1
Camping-Road Trips-Firetwirling
1
Camping-Road Trips-Firetwirling-Acro
1
... (etc with every possible combination separated by a -)
etc
How can this be achieved?
I have attempted to create a unique list of all 'Shared Interests' but I don't know how to add new rows of every unique combination separated by a dash -
I then don't know how to calculate the value column of these overlapping groups. I suspect using pivot functions are the way to go here but so far I have not had any luck.
In powerquery try
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
// Combo algo adapted from from Bill Szysz 2017
process=(Items as list) as list =>
let AddIndex = Table.AddIndexColumn(Table.FromList(List.Sort(Items)), "Index", 0, 1),
ReverseIndeks = Table.AddIndexColumn(AddIndex, "RevIdx", Table.RowCount(AddIndex), -1),
Lists = Table.AddColumn(ReverseIndeks, "lists", each List.Repeat(List.Combine({List.Repeat({[Column1]}, Number.Power(2,[RevIdx]-1)),List.Repeat( {null}, Number.Power(2,[RevIdx]-1))}), Number.Power(2, [Index]))),
ResultTable = Table.FromColumns(Lists[lists]),
AllCombos = List.Sort(List.RemoveLastN(Table.AddColumn(ResultTable, "Custom", each Text.Combine(List.RemoveNulls(Record.ToList(_)),"-"))[Custom],1))
in AllCombos,
#"Added Custom" = Table.AddColumn(Source, "combos", each process(Text.Split([Shared Interests],", "))),
#"Expanded combos" = Table.ExpandListColumn(#"Added Custom", "combos"),
#"Grouped Rows" = Table.Group(#"Expanded combos", {"combos"}, {{"Count", each Table.RowCount(_), Int64.Type}})
in #"Grouped Rows"
Does anybody know how I can use Power Query to group sensor events by the gap between the previous datetime reading and the current one? Apologies, I wasn't sure how to phrase this question, which is probably why I'm having no luck with Google.
What I want to do is group some data by start and end datetimes based on the gap between the current reading and the previous reading.
If you consider the data below:
It's easy to group it like so:
But this would actually be two discrete events, not one - if you notice the gap between 10:00 and 13:00, I'm assuming that the events need to be grouped within two hours of each other, like:
Edit - So, to be clear, if a reading happens more than two hours after the last then it can be considered a new event.
Is an efficient solution to this possible within Power Query? As you can probably imagine when it comes to datetime readings and sensors, there are a tonne of data.
I could and would (and did) write code to do this in other situations, but this report already groups by the min and max datetime via Power Query, and so it would be much less work if I can just add the additional time cut-off group logic there.
Thanks for your help!
Phil.
Edit - I've seen a solution where you add an index and subtract the date to create island groups for dates (not datetimes), but I don't think (although I could be wrong) that this would work here as the readings don't come through at exactly regular intervals (they might be every 1 hr 2 mins or every 58 mins etc.).
See if this helps. Yes, I am creating island groups using an index
Merge row above each row into adjacent column
Take the duration between the two columns
Add index
If duration > 2 then put in the index
Fill down
Group on index and take min/max
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
// stolen from Imke Feldman via Ron Rosenfeld, great way to combine list next to itself, offset
shiftedList = {null} & List.RemoveLastN(Source[Reading Datetime],1),
custom1 = Table.ToColumns(Source) & {shiftedList},
custom2 = Table.FromColumns(custom1,Table.ColumnNames(Source) & {"Previous Row"}),
#"Added Index" = Table.AddIndexColumn(custom2, "Index", 0, 1),
// copy over index when duration > 2 hours between columns
#"Added Custom" = Table.AddColumn(#"Added Index", "Custom", each if [Previous Row]=null then [Index] else if Number.From([Reading Datetime]-[Previous Row])*24 >2 then [Index] else null),
#"Filled Down" = Table.FillDown(#"Added Custom",{"Custom"}),
#"Grouped Rows" = Table.Group(#"Filled Down", {"Custom"}, {{"Reading Start Datetime", each List.Min([Reading Datetime]), type datetime}, {"Reading End Datetime", each List.Max([Reading Datetime]), type datetime}}),
#"Removed Columns" = Table.RemoveColumns(#"Grouped Rows",{"Custom"})
in #"Removed Columns"
I have an Excel file with ~500,000+ rows of data, each of which has (amongst other things) an ID and a certain value I'd like to manipulate. I'll use an example consisting of names and foods. The data looks something like this:
Name Food
Alex Melon
Alex Burger
Bruce Apple
Charlie Water
Alice Apple
Bruce Melon
Bruce Apple
Bruce Plum
I'd like to find the overlap in foods between any pair of names, giving me a result that would tell me (for example) that for the pairing of Bruce vs Alex, 2/3 of Bruce's data is unique and 1/3 is the same is Alex's list, whilst for Alex 1/2 his data in unique and 1/2 is the same as Bruce.
There is no consistency in the amount of foods a person can have listed alongside their name. And its entirely possible for some people to have foods not found alongside anyone else.
I thought to present it through something like this, where each percentage sign is the overlap for that pairing (read by row, so C2 would be the proportion of Alex's data also found in Alice's, whilst B3 would be the proportion of Alice's data also found in Alex's):
Alex Alice Bruce Charlie
Alex - % % %
Alice % - % %
Bruce % % - %
Charlie % % % -
I've been struggling to think and find a formulae or VBA script that would achieve this and calculate the overlap. I've considered creating (i) a helper column that concatenates the name and food, (ii) a new de-duplicated unique list of foods and maps this against the helper column. However, as far as I can tell, whilst that will help me summarise which foods go with which person, it won't help me find out the overlap between each person's list of foods.
Any help would be greatly appreciated!
This was fun!
You can use Power Query for this.
Highlight your data and insert a table, make sure you say the first row is columnm headers.
Go to the Table ribbon and change the table's name to "PersonTag"
Go to the Data ribbon and in the 'Get & Transform' section, click "From Table"
This opens up Power Query, with a new query called "PersonTag"
Highlight both columns, then on the "Home" ribbon, choose "Remove Rows - Remove Duplicates"
In the "Home" ribbon, click "Manage - Reference"
You've just created a new query that refers to the PersonTag query! Rename it to PersonCount.
Highlight the "Name" column, and in the "Transform" ribbon, click "Group By" and group by name, creating a new column called "PersonCount" that is the count of the rows.
Go back to editing the PersonTag query
Create a new query (or copy an existing one, it doesn't matter how), name it "PersonTagPersonTag", and then go to the "Home" ribbon, click "Advanced Editor" and replace whatever's there with the following.
let
Source = PersonTag,
// Recursion! Now each row contains it's own "PersonTag" table!
#"Added Custom" = Table.AddColumn(Source, "2nd", each PersonTag),
#"Expanded 2nd" = Table.ExpandTableColumn(#"Added Custom", "2nd", {"Name", "Food"}, {"2nd.Name", "2nd.Food"}),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded 2nd",{{"2nd.Name", type text}, {"2nd.Food", type text}}),
// We only want rows where the foods match and the names don't
#"Filtered Rows" = Table.SelectRows(#"Changed Type", each ([Name] <> [2nd.Name] and [Food] = [2nd.Food]))
in
#"Filtered Rows"
Now we're going to group by Name and 2nd.Name to get the row count of matches in a "PersonPersonCount" columns, bring in the PersonCount query we created earlier to get the total foods each Name, then create a PercentMatch column by dividing PersonPersonCount by PersonCount. Then we can get rid of the Count columns because we don't need them, and pivot by 2nd.Name! Create another query (I named mine "PersonvPerson").
let
Source = PersonTagPersonTag,
#"Grouped Rows" = Table.Group(Source, {"Name", "2nd.Name"}, {{"PersonPersonCount", each Table.RowCount(_), type number}}),
// Bring in PersonCount query
#"Merged Queries" = Table.NestedJoin(#"Grouped Rows",{"Name"},PersonCount,{"Name"},"PersonCount",JoinKind.LeftOuter),
// If you click the column type icon in the column title in the previous step, you get the dialog box you can fill out that does this step for you
#"Expanded PersonCount" = Table.ExpandTableColumn(#"Merged Queries", "PersonCount", {"PersonCount"}, {"PersonCount"}),
#"Added PercentMatch" = Table.AddColumn(#"Expanded PersonCount", "PercentMatch", each [PersonPersonCount] / [PersonCount]),
#"Changed Type" = Table.TransformColumnTypes(#"Added PercentMatch",{{"PercentMatch", Percentage.Type}}),
#"Removed Other Columns" = Table.SelectColumns(#"Changed Type",{"Name", "2nd.Name", "PercentMatch"}),
#"Sorted Rows" = Table.Sort(#"Removed Other Columns",{{"2nd.Name", Order.Ascending}}),
// I did this by highlighting the "2nd.Name" column, going to the "Transform" ribbon, and clicking "Pivot Column"
#"Pivoted Column" = Table.Pivot(#"Sorted Rows", List.Distinct(#"Sorted Rows"[#"2nd.Name"]), "2nd.Name", "PercentMatch", List.Sum)
in
#"Pivoted Column"
Exit the Power Query window and keep your changes. By default, creating new queries automatically create new tabs on the worksheet that contain the data. Delete the tabs you don't want to keep.
If you're anything like I was a few months ago, your jaw is dropping at what you can do with power queries. I gave you the code because I was too lazy to tell you every little click to create the code, but don't be overwhelmed!!! I just clicked around to create each next step and it created the code for me! They made it easy to click around and try/undo things.
I am looking to insert a remove column step which removes any column where the header (which is a date) is before a certain date (older than X years prior to the current date). I receive a large data dump which is just a list of client names and fees they pay each month from 2012 to today, headed by the month they pay each fee, but as time goes on I don't need the oldest of the data.
So far I have tried producing a list from the headers (based on a previous response from another board member - thankyou #horseyride!) and then removing the columns which dont meet the criteria FROM that list. However it keeps breaking.
This is the latest line in the advanced Editor
#"Pivoted Column" = Table.Pivot(Table.TransformColumnTypes(#"Removed Columns", {{"Calendar Period", type text}}, "en-GB"), List.Distinct(Table.TransformColumnTypes(#"Removed Columns", {{"Calendar Period", type text}}, "en-GB")[#"Calendar Period"]), "Calendar Period", "Approved Invoice Amount", List.Sum)
This are the lines i am attempting to create:
"ColumnList" = List.Select(Table.ColumnNames(#"Pivoted Column"), each Text.Contains(_, " ")),,
"Delete Columns"= Table.Transform(#"Pivoted Column", Table.RemoveColumns(#"ColumnList", each {})as table)
in
#"Delete Columns"
the Second bit of code I cant seem to get right - that is what I believe it should look like for now. But essentially i want the table to remove any columns where their header (a date) is prior to X amount of years older than todays date.
EDIT - Screenshot of before and after IF the desired cut off was Dec 2012:
Example Data
Thank you in advance
Just use following code. For static date:
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
final = Table.SelectColumns(Source, List.Select(Table.ColumnNames(Source),
each try Date.From(_) >= #date(2012,12,1) otherwise true))
in
final
For dynamic date (older than 3 years prior to the current date):
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
final = Table.SelectColumns(Source, List.Select(Table.ColumnNames(Source),
each try Date.From(_) >= Date.AddYears(Date.From(DateTime.FixedLocalNow()),-3)
otherwise true))
in
final
I am trying to calculate number of days for particular year based on calendar table that i have created.
For Example: I have 3 columns.
Event, number of days and Date when this event started
Event DaysLost
Injury 30 25/12/2016
Injury 588 06/08/2012
Days in 2016 - 6
Days in 2017 - 24
For the second case:
Days in 2012 - 146
Days in 2013 - 365
Days in 2014 - 77
Now for above case there are only 6 days which need to be counted in 2016 and the rest of the days should automatically be counted in 2017. But i cannot figure out how to do it.
In my output i would like to put years in one column and days lost for year in front of that particular year.
I have a calendar table and i want sum of days to populate for a particular year.
I tried calculating it by getting end date, by adding number of days to First start date and then if days were more that remaining days in that year. subtract remaining days from total days and remaining days should move to next year. But i cannot figure out how to keep adding days for next years if days extends for many years and list them after words.
Sept 4, 2017
Please see the excel solution below
Excel solution of the problem
0) Importing the data from your Excel screenshot into Power BI results in this.
1) Create a new column in that table using the following formula for end date (to help with future formulas).
EndDate = Injuries[First Start Date] + Injuries[Days]
You stated that you have a calendar table, so you can skip to step 3
2) Create a new table by clicking on Modeling -> New Table and entering the following formula. This gives a single column table with a list of years.
Years = GENERATESERIES(2000, 2020, 1)
3) Create another new table using the following formula. This gives a table with all of the fields from the initial data table crossjoined with the Year table that was just created. The formula also filters the resulting table to only return rows where the value in the Year column is between the First Start Date and the First Start Date plus Days. To learn more about the CROSSJOIN function, check of the documentation here.
InjuriesByYear = FILTER(
CROSSJOIN(Years, Injuries),
Years[Year] >= Injuries[First Start Date].[Year] &&
Years[Year] <= Injuries[EndDate].[Year]
)
4) Create relationships from the InjuriesByYear table back to the initial data table and the Year table. This will help facilitate nicer reporting efforts.
5) In the InjuriesByYear table, create a new column by clicking on Modeling -> New Column and entering the following formula. The first IF checks if all of the days lost are in a single year. The second IF handles when the days are spread across multiple years, with the True clause handling the first year, and the False clause handling all other years.
DayPerYear = IF(
InjuriesByYear[Year] = InjuriesByYear[First Start Date].[Year] && InjuriesByYear[Year] = InjuriesByYear[EndDate].[Year], InjuriesByYear[Days],
IF(
InjuriesByYear[Year] = InjuriesByYear[First Start Date].[Year], DATEDIFF(InjuriesByYear[First Start Date], DATE(InjuriesByYear[First Start Date].[Year], 12, 31), DAY),
DATEDIFF(DATE(InjuriesByYear[Year], 1, 1), MIN(InjuriesByYear[EndDate], DATE(InjuriesByYear[Year], 12, 31)), DAY) + 1
)
)
6) To test it all out, create a pivot table as configured in below. Following these steps, the pivot table should match your Excel solution.
This is a Power Query based approach...
I started with this:
Then I added a custom column by clicking the Add Column tab and Custom Column button and completing the pop-up window like this:
...and clicking OK.
Then I changed the type for that new column by selecting it and then clicking the Transform tab and then Data Type and Date.
Then I added another custom column, completing the pop-up like this:
Then I added another custom column, completing the pop-up like this:
Then I added yet another custom column, completing the pop-up like this:
Then I expanded that last column I added by clicking on the at the top of the column and Expand to New Rows.
Then I added a final custom column, completing the pop-up like this:
Finally, I grouped by the Event, DaysLost, Started, and Year columns and summed the DaysLostForYear column by clicking the Transform tab and Group By button and completing the pop-up like this:
I end up with this:
You might want a different grouping, but this should get you close. It shows how many days were lost in the years associated with each instance of an injury's total days lost. For instance, the first injury, which was 30 days in duration, started on 12/25/2016: 7 of those days occurred in 2016 and 23 in 2017. The second injury was 588 days, started on 8/6/2012: 148 days were in 2012, 365 in 2013, and 75 in 2014.
Note that I count the started date as a lost day.
Note also that I account for leap years.
I hope this helps.
Here's the query code:
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Event", type text}, {"DaysLost", Int64.Type}, {"Started", type date}}),
#"Added Custom" = Table.AddColumn(#"Changed Type", "Ended", each Date.AddDays([Started],[DaysLost]-1)),
#"Changed Type1" = Table.TransformColumnTypes(#"Added Custom",{{"Ended", type date}}),
#"Added Custom3" = Table.AddColumn(#"Changed Type1", "DaysYearStarted", each Number.From(Date.From(Text.From(Date.Year([Started]))&"/12/31")-[Started])+1),
#"Added Custom4" = Table.AddColumn(#"Added Custom3", "DaysYearEnded", each Number.From([Ended]-Date.From(Text.From(Date.Year([Ended])-1)&"/12/31"))),
#"Added Custom5" = Table.AddColumn(#"Added Custom4", "Year", each List.Numbers(Date.Year([Started]),Date.Year([Ended])-Date.Year([Started])+1)),
#"Expanded Custom" = Table.ExpandListColumn(#"Added Custom5", "Year"),
#"Added Custom1" = Table.AddColumn(#"Expanded Custom", "DaysLostForYear", each if [Year]=Date.Year([Started]) then [DaysYearStarted] else
if [Year]=Date.Year([Ended]) then [DaysYearEnded] else
if Date.IsLeapYear([Year]) then 366 else 365),
#"Grouped Rows" = Table.Group(#"Added Custom1", {"Event", "DaysLost", "Started", "Year"}, {{"DaysLostForYear", each List.Sum([DaysLostForYear]), type number}})
in
#"Grouped Rows"