PowerQuery duplicate rows from external source - excel

I have prepared a master Excel file which pulls data by means of a Power Query from several smaller Excel worksheets, all containing the same set of data (same columns) - one per employee.
Today I noticed that for some employees, some of the data is duplicated in the master table, even though said duplicates do not exist in their separate worksheets.
The master query is made up of separate "Connection Only" queries, pointing to each individual file. Regardless of how many times I click Refresh All, Manage Data Model, the duplicates still stay there.
Has anyone encountered anything similar or would you have any ideas what could be the reason behind this and how to get it sorted out?
Thank you!

You havent really provided enough info about your design, but I'm guessing you are using Merge Query steps to combine the "smaller Excel worksheets" ? If so then the typical issue is that you have not specified the correct columns to match on in the Merge Queries Step definition.
If the combination of columns you have chosen on at least one side of the Merge are not unique, then duplicated rows will appear on the subsequent Expand step.
The way to find these is to start a new Query against each source table in turn, select the columns you are matching on and use Keep Rows / Keep Duplicates. You should see no rows resulting - any rows that do appear are the source of your duplicates.
I usually save such queries and include them in the Refresh as an automated test going forwards. I put them in a separate Query Group e.g. "Tests - should return 0 rows".

Related

Excel: find and order matches by column

I´m currently working with a huge epidemiological dataset with several Excel-files. The files contain pathology and clinical report for almost 30k patients. Each patient can have several pathology and clinical reports. The patients are assigned an unique ID.
I want combine all files into one so that ID for patient X001 would contain all the information form all the files. I cannot just copy/paste because the number of rows (IDs) in the files vary.
Here is an example of what I want to accomplish.
I want to combine two lists as follows.
As you can see that List1 and List 2 vary in row numbers. Also there are IDs in list1 that are not found in list2 and vice versa.
I want to merge them so that they align and match, see image below. Can someone provide a code for this? I cannot do this manually since I have 100k rows in list1 and 30k rows in list2...that would take several weeks to do with a risk of errors.
You can merge tables combined utilizing Excels built in Power Query, which can be found under the Data tab.
Note: Photos are taken from Excel 2016
The first step is to create the queries:
Within the Get & Transform section under the Data click on New Query -> From File -> From Workbook and select the appropriate workbook that has the table you want to merge
Select the appropriate sheets in which your tables are found, and confirm that they are displaying properly
If you notice that the table is not correct, you can make changes to it via the Edit button below.
For example, if you notice that your Column headers are being treated as a normal value, you can click Use First Row as Headers under the Power Query Editor Home -> Transform
I would also recommend changing the name of the query so it makes more sense down the line
Once you are happy with the way the query is looking, click on the Close and Load Dropdown menu under the Power Query Editor Home and select Close and Load To...
Select Only Create Connection to add it into your Workbook Queries without duplicating the table.
Repeat the above steps for each table in which you are looking to merge.
Once you have all of your tables linked via Queries, you can now move on to merging them:
Under the same section of New Query select Combine Queries -> Merge
Select the two queries you are looking to merge in each of the respective boxes
Confirm that they are correct via the preview window (don't worry if not all rows show)
Rule of thumb would also be to select your largest query first, and the smaller second
Next, highlight the columns in which you are looking to merge based on. For your example it would be the ID. This is done simply by clicking on the column within the preview
Finally change the Join Kind to Full Outer and click OK
From here you should be back in the Power Query Editor
The final steps are modifying this merged query to your desired output
You should notice that there is a new column added next to your first original table with the name of the query at the top, next to the name is a button that allows you to expand out this query.
Select the appropriate columns you would like to merge into the other table and click OK
If at any point you make a mistake, you can retrace your changes under Applied Steps within the Query Settings Pane
Once you are happy with the way your newly merged query looks, go ahead and click on Close and Load
Your should now have access to your new merged query that will update based on changes made to the original connected files
If you want to make any additional changes going forward from this point just click anywhere inside of the table and you should see both the Table Tools and Query Tools tabs appear at the top

Table doesn't expand when adding new data (from .csv files in my case)

Lets say I got some external .csv files which I got updated and I just need to hit the refresh button in Power Query to make some magic - that works fine, BUT, there some columns which are information about some parts, and I need to lookup values for them in another .csv file. What I did here is, I didnt convert all 4 Columns in a Table, but I separated them, each column has another name (table name) because I had some issues with refreshing from Power Query, and seemed easier to do calculation first and then convert to table.. maybe that was not smart tough??
My question is and issue actually, I am not getting new rows with new data beneath my "tables" I must drag it down to populate. Why that occurred?
These are functions I used from starting Column:
=INDEX(Matrix[[#All];[_]];ROW())
Then others are just lookup ones depending which info I am looking for:
=INDEX(variantendb[Vartext];MATCH(C2;variantendb[Variante];0))
And last column and calculation is concatinating to have Info name and Code together:
='0528 - info'!$D2 & " "& "("&'0528 - info'!$C2&")"
And of all of them I made in 5x Tables SEPARATELY, not as one table. Maybe I should do with one table, and then do the calculations and then it will be dynamically updated?
It is automatically updated only when I add new data somewhere in the middle of .csv but not when is in a last row, then it is not expanding!
Well, I solved it. How? Using Power Query at its best, I played around and actually gave me complete another approach to my problem, using Merge function and a bit of formatting. Works flawlessly, with minimum functions afterwards. What is important it refreshes in a milisecond - PROPERLY!!!
I am amazed by PQ and its functionality.

How do I lock an additional column to rows imported from Power Query in Excel 2016 without a unique key column?

I am using Power Query in Excel 2016 to combine data from 12 different workbooks within the same folder system into one table, and need to add an additional column in the master table that tracks the status of each row. However, when I refresh the data, the Status column does not follow the rows to which it is initially applied.
I have already looked at [ Inserting text manually in a custom column and should be visible on refresh of the report ] but this solution only works with a unique ID column. Because each of the 12 workbooks is edited separately and because there is no single column that can be guaranteed to have unique values between all of the different spreadsheets, I don't have a key to join the data to the additional column.
I believe there is always a way of finding a Unique ID. If you can get your head around this, it is not that difficult to solve your problem.
See my below example, I used three sample workbooks saved in a Test folder. Depends on the way you add them to the query editor, in my example I used From Folder and follow the prompts without making any changes and combined the tables automatically. Once combined there is a Source.Name column automatically added. I suggest to leave this column in your output table as it can form part of the Unique ID if your data is highly identical across the workbooks.
An optional step (not in my screenshot) is to add an Index column and concatenate the index number with a product/task name so it can make that specific line of data entry even more unique.
Once you added the Status column with data entered manually on the master table, load the master table back to query editor.
Then go back to the original query (Test (Input) in my example) and merge it with the reloaded output query. See my screen-shot for how to 'uniquely' merge the two tables.
The rest is self-explanatory. I think the key is finding elements of the Unique ID and incorporate it in the merge part.
Let me know if you have any questions. Cheers :)

How to make a file in Excel, that refreshes from Query Editor and work on it

I am using Power Query Editor to create a working file, using multiple tables from several sources.
After I combine these and make my working file, I am using it to make some work on columns I add later on the working file.
I have noticed that the values I enter in the working file are not bound to the main key, lets assume the first column, but they are independent values in a column.
The result is that if one table changes, for example one line is deleted or I change the sorting of the Query, my working file is wrong, since the data changed but the added columns remain as they were.
Is there a way to have the added columns to be bound with a value, as it is for example with VLOOKUP?
How can I make a file that will update from different sourcesbut stil I can work on it without the risk of misplacing the work I do.
I hope I am clear.
Thank you in advance!
This is fairly simple if each line in your table is unique (which in your example you say the first column can serve as a key). Setup your working columns on the table and then load the table into PQ (as a connection only). Then go to your original query that is combining your data and add a merge at the end where you merge against the table you just loaded into PQ and match on your key. Then expand only your working columns from the merge.
This way whenever you refresh your table, it will match lines against it's existing output in your work before updating, so data in your work columns will be maintained. However note this is only going to retain values, not any formulas you may be using in your work columns.

Cannot delete a column that contains multiple tables

Whenever I have two tables in the same column, I get this error.
Create a table in columns (ie B1:C3)
Create another table below that table (ie B5:C7)
Right-click on column B
Is the "Delete" option grayed out (unavailable)?
Convert the second table (B5:C7) back to a normal area
Right-click on column B
Is the "Delete" option active (black) now?
It is for me.
I don't understand why it happens but I'd really appreciate if someone could confirm that I'm not alone on this one. This actually seems like a bug.
Unfortunately this is 'behavior by design'. A ListObject (aka structured ) table has many internal mechanisms. The Delete (column) command is not designed to enumerate through all of the ListObjects on the worksheet to see if any intersect with the column being deleted and then spawn subprocesses that deal with deleting table columns specifically while simultaneously keeping in mind how that will affect other ListObject tables. Instead, it simply does not allow the Delete command when more than a single ListObject table is involved.
This is not allowed may be because deleting a column will Shift Cells Why Dont you try deleting by selecting one column of a Table Like this
see the screenshot you can do it if you select one column of a table at a time
Thanks
Try organizing your data in a different way, so these problems don't occur.
There is no compelling reason to have several tables on ONE sheet. If table placement presents a problem with row/column management, consider moving tables to separate sheets.
Tables can be referenced in formulas by the table name. Ditto for table columns, so there really is no reason to keep several tables on one sheet if you need flexibility with row and column management.
Edit after comment The fact that users are working with several tables and cannot be expected to change sheets to maintain data on different sheets can be addressed in different ways:
Educate your user. I'm a big fan of teaching people how to use software. If they understand what they are doing, they feel positive. If you keep them dumb and tell them to "just click there and shut up" they may feel negative.
You may want to re-consider your data architecture. Provide your users with an interface to add/edit/delete records that is independent of where the data is stored. This is 2016. Data input and data storage are not married to the same page.
You are posting your question in a site for enthusiast programmers. A little bit of VBA will separate your data entry/data storage issues, if you are interested to work it out.

Resources