Power Query: Include all unique columns found in files from folder

Power Query: Include all unique columns found in files from folder - excel

I've recently run into an issue which hopefully is solvable.
Currently, I have power query pointing at a folder containing several CSV files. This is normally no issue, however, in this instance not all of the files have the same columns.
Is there a way to have power query return every unique column found in the folder populating empty data observations with null values?
Assume that my folder has csv files similar to the following (note that the rows are indexed using letters for easy reference):
I would like my final table to look something like:
This seems like it should be pretty simple, but I can't figure it out for the life of me! Any help would be greatly appreciated!

Assuming you're using Folder.Files, I think you can:
Grab the Content column of the table returned by Folder.Files -- which should give you a list of binary values.
Parse each item in the list as a CSV document using List.Transform and Csv.Document -- which should give you a list of tables.
Then merge your list of tables with Table.Combine -- which should give you one single table. Table.Combine should take care of the details (like aligning column names).
You've not provided any code in your question, so it's hard to give a relevant example, but I think the code below gives me your expected output.
I've turned the row indexes into an ID column, just to make the final table easier to verify/follow.
let
firstCsv =
"ID,one,two,three
A,1,4,7
B,2,5,8
C,3,6,9",
secondCsv =
"ID,one,two,three,four
D,1,6,11,16
E,2,7,12,17
F,3,8,13,18
G,4,9,14,19
H,5,10,15,20",
thirdCsv =
"ID,one,two,yes,no,maybe
I,1,1,1,1,1
J,2,2,2,2,2
K,3,3,3,3,3
L,4,4,4,4,4
M,5,5,5,5,5",
// For example's sake, let's suppose that the contrived table below was
// returned by calling Folder.Files
filesInFolder = Table.FromColumns({
List.Transform({firstCsv, secondCsv, thirdCsv}, Text.ToBinary),
List.Transform({"1".."3"}, each "CSV file " & _ & ".csv"),
List.Repeat({"someFolderPath"}, 3)
}, type table [Content = binary, Name = text, Folder = text]),
parsed = List.Transform(filesInFolder[Content], each
let
csv = Csv.Document(_, [Delimiter = ",", QuoteStyle = QuoteStyle.Csv]),
promoted = Table.PromoteHeaders(csv, [PromoteAllScalars = true])
in promoted
),
// The step below should match the expected output in your question.
combined = Table.Combine(parsed)
in
combined
Obviously, you'll need to adjust for your own folder path and actually call Folder.Files as you presumably already are in your own code.

I've always used something like this
//read all files in specified directory you fill in here
let Source = Folder.Files("C:\directory\subdirectory"),
//filter only csv files
#"Filtered Rows" = Table.SelectRows(Source, each ([Extension] = ".csv")),
//Pull contents of each file into table with an index
#"Added Custom1" = Table.AddColumn(#"Filtered Rows", "Custom", each Table.AddIndexColumn(Csv.Document(File.Contents([Folder Path]&"\"&[Name]),[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.None]),"Index")),
#"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom1", "Custom", {"Column1", "Index"}, {"Column1", "Index"}),
#"Removed Other Columns" = Table.SelectColumns(#"Expanded Custom",{"Column1", "Index", "Name"}),
#"Pivoted Column" = Table.Pivot(#"Removed Other Columns", List.Distinct(#"Removed Other Columns"[Name]), "Name", "Column1"),
#"Removed Columns" = Table.RemoveColumns(#"Pivoted Column",{"Index"})
in #"Removed Columns"

Related

extract data from another excel in power query advanced editor

I'm trying to extract the data from another excel file, but I tried to make it dinamyc.
This is the context so I'm extractig data with a survey for different cities and the questions are the same but the data is different in each city, so I'm trying to create visualisations for each city but just replacing the data in the file, so this data is exported as a file we can call it "results-city.xlsx" and my goal is just placing this document with another with the same name and columns but obviusly different responses in each column so I'm trying to use power query and the advanced editor and this is my formula but is not getting success also the path will be dynamic that's the reason I included folder in the formula. Help please to achive this
let
Source = Excel.Workbook(File.Contents("C:\Users\iotal\OneDrive\Desktop\stack\folder\results-city.xlsx"), null, true),
Sheet1_Sheet = Source{[Item="Sheet1",Kind="Sheet"]}[Data],
#"Changed Type" = Table.TransformColumnTypes(Sheet1_Sheet,{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}})
in
#"Changed Type"
Update 2:
Example of data:
File with data city 1
So here is an example with the data from two cities File with data city 2, when I tried to use the formula the power query don't import the data to excel.
My desired output is have the data imported in a sheet in excel that doesn't mind which of the two is in the folder but can be updated.
This is my desired output with images:
Image 1 is the first data that I will import
And the second image is when I replace the data for city2 should look like this just replacing the file for another one with the same name
And what is not working is the formula that doesn't import to a sheet the data as a table
When I replace the file from city1 for the one from city2 I got this error:

In powerquery, something like this will combine all tabs in all xlsx files in a specified hardcoded directory
let Source = Folder.Files("C:\subdirectory\directory"),
#"Filtered Rows" = Table.SelectRows(Source, each ([Extension] = ".xlsx")),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Name", "Content"}),
#"Added Custom" = Table.AddColumn(#"Removed Other Columns", "GetFileData", each Excel.Workbook([Content],true)),
#"Expanded GetFileData" = Table.ExpandTableColumn(#"Added Custom", "GetFileData", {"Data", "Hidden", "Item", "Kind", "Name"}, {"Data", "Hidden", "Item", "Kind", "Sheet"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded GetFileData",{"Content", "Hidden", "Item", "Kind"}),
List = List.Union(List.Transform(#"Removed Columns"[Data], each Table.ColumnNames(_))),
#"Expanded Data" = Table.ExpandTableColumn(#"Removed Columns", "Data", List,List)
in #"Expanded Data"
Alternatively, give a cell a range name, like path in excel and put your filepath in that cell, like C:\temp\a.xlsx
Then in powerquery, use that range name in place of hardcoding the file name like
let location= Excel.CurrentWorkbook(){[Name="path"]}[Content]{0}[Column1],
Source = Excel.Workbook(File.Contents(location), null, true),
...
this second alternative assumes that the tab name is constant. Otherwise I recommend the first method
You can combine the two alternatives if you want

To make the folder location dynamic with respect to a cell with named range "file_path" (B2, below/ here), modify an ordinary power query data import for one of the files/locations (here, "C:\temp\Folder1\File1.txt") as follows:
ordinary power query:
let
Source = Csv.Document(File.Contents("C:\temp\Folder1\File1.txt"),[Delimiter=" ", Columns=3, Encoding=1252, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
in
#"Promoted Headers"
updated power query:
let
MyFolder = Excel.CurrentWorkbook(){[Name="file_path"]}[Content][Column1]{0},
Source = Csv.Document(File.Contents(MyFolder),[Delimiter=" ", Columns=3, Encoding=1252, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
in
#"Promoted Headers"
i.e. file name XYZ in [Name ="XYZ"] replaced with variable MyFolder, defined as follows:
MyFolder = Excel.CurrentWorkbook(){[Name="file_path"]}[Content][Column1]
(optional)
Include VB code to refresh query every time the drop down list changes in the cell with named range ("file_path") (i.e. cell B2 above gif) - as follows:
(you don't have to do this - you could just as easily make the query refreshable periodically via the properties UI associated with the query, for instance)..
Private Sub Worksheet_Change(ByVal Target As Range)
If Not Intersect(Target, Range("file_path")) Is Nothing Then
Application.ScreenUpdating = False
Calculate
ActiveWorkbook.Connections("Query - dynamic_file").Refresh
Application.ScreenUpdating = True
Calculate
End If
End Sub
Kudos to Cristiano Galvão (here) for more details RE: dynamic links per above.
ta

Having issues with List.Contains - Not loading

Oi,
So i'm having issues with the List.Contains (x,x)=false function.
Context
I have 32 Excel files where i retrieve data from in a first query, this data gets filtered so only the columns needed (Each of those 32 Excel files is about 2MB) - This query then gets transformed into a "list" (ListofJustifWBS) so i only have the WBS's of that particular Query.
I also have another query, where i import a huge data excel file including WBS's - Actuals - Best estimates,...
What i want to do is : only keep the WBS's from the second Query that are not included in the first query.
The code i'm using is = Table.SelectRows(#"Changed Type", each (List.Contains(ListOfJustifWBS,[WBS])=false))
Whenever i run the query in the editor, the data get processed.. However, when i track the "progress" in the bottom right cornor i see all 32 excel file getting progressed, but excel sometimes "retrieved" (?) 20MB worth of data in each excel file while the excel file itself is only 2MB?
Whenever i try to run the query in an Excel Sheet tabl, Excel goes "Not responding".
Any idea how to fix this?

If you replace
= Table.SelectRows(#"Changed Type", each (List.Contains(ListOfJustifWBS,[WBS])=false))
with
#"Merged Queries" = Table.NestedJoin(#"Changed Type", {"WBS"}, Table.FromList(ListOfJustifWBS ), {"Column1"}, "ListOfJustifWBS", JoinKind.LeftOuter),
#"Expanded ListOfJustifWBS" = Table.ExpandTableColumn(#"Merged Queries", "ListOfJustifWBS", {"Column1"}, {"Column1.1"}),
#"Filtered Rows" = Table.SelectRows(#"Expanded ListOfJustifWBS", each ([Column1.1] = null))
is this any faster ?

Getting lastest update file in Power BI Folder Source

I have a folder with 3 different subfolders where I store some daily exports. I want to get only the latest updated file from each folder path. As I show in the pic.
I need to keep the binary and the attributes columns of the latest "date modified" file for each same path. The latest date modified is different for each different subfolder. My subfolders refere to each month of the year (their name are May, Jun, July and soon it will be incorporated August).
In this way, as I have 3 different folder paths values in my column Folder Path, I will only have 3 csv to expand.
I have tried to group by folder path and maximum Date modified, but I lost the other fields. If I add, maximun of attributes or all rows, does not solve me problem either.
I also tried this solution: https://community.powerbi.com/t5/Desktop/Keep-only-the-latest-date-for-duplicate-entries/td-p/638447
But I get stuck in an error: Function type value cannot be converted to Table type. Details: Value = [Function], Type = [Type].
M Query code:
let
Source = AzureStorage.DataLake(".../usersDailyData"),
#"Filtered Hidden Files1" = Table.SelectRows(#"Sorted Rows", each [Attributes]?[Hidden]? <> true),
#"Invoke Custom Function1" = Table.AddColumn(#"Filtered Hidden Files1", "Transformar archivo", each #"Transformar archivo"([Content])),
#"Renamed Columns1" = Table.RenameColumns(#"Invoke Custom Function1", {"Name", "Source.Name"}),
#"Removed Other Columns1" = Table.SelectColumns(#"Renamed Columns1", {"Source.Name", "Transformar archivo"}),
#"Expanded Table Column1" = Table.ExpandTableColumn(#"Removed Other Columns1", "Transformar archivo", Table.ColumnNames(#"Transformar archivo"(#"Archivo de ejemplo"))),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded Table Column1",{ Columns name here })
in
#"Changed Type"
Any ideas to solve it?
Thank you very much.

I'd suggest adding a column that has the max datetime for each Folder Path, then comparing each rows datetime to that max, and filtering out non-matches. Along the lines of below. I think it goes after the Invoke Custom Function1 step , but you can move it where needed
...
#"Added Custom" = Table.AddColumn(#"Invoke Custom Function1","MaxDate",(i)=>List.Max(Table.SelectRows( #"Invoke Custom Function1", each [Folder Path]=i[Folder Path]) [Date modified]), type datetime ),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom", each if [Date modified]=[MaxDate] then "keep" else null),
#"Filtered Rows" = Table.SelectRows(#"Added Custom1", each ([Custom] = "keep")),
...

How to get Column Names dynamically in Excel Power Query

I am using Power Query in Excel to read JSON files. I have a sample working script, as follows:
let
Source = Json.Document(File.Contents("E:\laureates.json")),
#"Converted to Table" = Record.ToTable(Source),
#"Expanded Value" = Table.ExpandListColumn(#"Converted to Table", "Value"),
#"Expanded Value1" = Table.ExpandRecordColumn(#"Expanded Value", "Value", {"id", "firstname", "surname", "born", "died", "bornCountry", "bornCountryCode", "bornCity", "diedCountry", "diedCountryCode", "diedCity", "gender", "prizes"}, {"Value.id", "Value.firstname", "Value.surname", "Value.born", "Value.died", "Value.bornCountry", "Value.bornCountryCode", "Value.bornCity", "Value.diedCountry", "Value.diedCountryCode", "Value.diedCity", "Value.gender", "Value.prizes"})
in
#"Expanded Value1"
The second and third arguments of the expression for #"Expanded Value1" has the names of the columns hard-coded; this code is generated via the user interface.
I would like to reuse the script. The problem is that whenever the source file changes - has different column names or new column names - I encounter an error. A workaround is to regenerate the script.
I can avoid the issue if I can specify the second and third arguments as expressions that will be evaluated dynamically at runtime.
So far, my attempts have failed: I would appreciate any hints regarding how I can replace the second and third arguments as code or expressions.

I think you can use Record.FieldNames to generate that list dynamically.
Something like this:
Table.ExpandRecordColumn(
#"Expanded Value",
"Value",
Record.FieldNames([Value]),
List.Transform(Record.FieldNames([Value]), each "Value." & _)
)
Edit: As AAsk points out, the above syntax is incorrect since it's attempting to pull row context on a table-level operation. Instead of [Value] for each row, we need to use a representative one to be applied to the whole column and picking #"Expanded Value"{0}[Value], the record from the first row, should work.
Table.ExpandRecordColumn(
#"Expanded Value",
"Value",
Record.FieldNames(#"Expanded Value"{0}[Value]),
List.Transform(Record.FieldNames(#"Expanded Value"{0}[Value]), each "Value." & _)
)
The List.Transform is there to prepend "Value." to the start of each column name but it works just fine to use Record.FieldNames(#"Expanded Value"{0}[Value]) twice instead.

Record.FieldNames in your (Alexis Olson) answer made me realise that I can get the column names from a single record (the logic being that all records should have the same column names). There the code that gets the column names dynamically is:
let
Source = Json.Document(File.Contents("E:\laureates.json")),
#"Converted to Table" = Record.ToTable(Source),
#"Expanded Value" = Table.ExpandListColumn(#"Converted to Table", "Value"),
#"Expanded Value1" = Table.ExpandRecordColumn(#"Expanded Value", "Value",
Record.FieldNames(#"Expanded Value"{0}[Value]),
Record.FieldNames(#"Expanded Value"{0}[Value])
)
in
#"Expanded Value1"
Now I can change the file name and its contents of files with different content are displayed correctly.

Convert column to cell string Power Query

I need to fit all the values of a column in Power Query into a 1-cell string separated by commas, as the example below:
To do this, I have the following piece of code:
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Transposed Table" = Table.Transpose(Source),
#"Merged Columns" = Table.CombineColumns(#"Transposed Table",{"Column1", "Column2", "Column3"},Combiner.CombineTextByDelimiter(",", QuoteStyle.None),"Merged"),
#"KeepString" = #"Merged Columns"[Merged]{0}
in
#"KeepString"
The problem with this code is that it assumes there will always be 3 columns, which is not always the case. How can I merge all columns (regardless of how many there are) into one?

You can do this with List.Accumulate:
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
KeepString = List.Accumulate(Source[User], "", (state, current) => if state = "" then current else state & "," & current)
in
KeepString
You can also use Table.ColumnNames to get the list of all the column names. You can pass this into Table.CombineColumns, so your modified solution would be:
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Transposed Table" = Table.Transpose(Source),
#"Merged Columns" = Table.CombineColumns(#"Transposed Table", Table.ColumnNames(#"Transposed Table"),Combiner.CombineTextByDelimiter(",", QuoteStyle.None),"Merged"),
#"KeepString" = #"Merged Columns"[Merged]{0}
in
#"KeepString"

You can also use a shorter code, like this:
let
Source=Excel.CurrentWorkbook( {[Name="Table1"]}[Content],
Result = Text.Combine(Source[User], ",")
in
Result

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Power Query: Include all unique columns found in files from folder - excel

Related

extract data from another excel in power query advanced editor

Having issues with List.Contains - Not loading

Getting lastest update file in Power BI Folder Source

How to get Column Names dynamically in Excel Power Query

Convert column to cell string Power Query

Categories

Resources