I´m currently working with a huge epidemiological dataset with several Excel-files. The files contain pathology and clinical report for almost 30k patients. Each patient can have several pathology and clinical reports. The patients are assigned an unique ID.
I want combine all files into one so that ID for patient X001 would contain all the information form all the files. I cannot just copy/paste because the number of rows (IDs) in the files vary.
Here is an example of what I want to accomplish.
I want to combine two lists as follows.
As you can see that List1 and List 2 vary in row numbers. Also there are IDs in list1 that are not found in list2 and vice versa.
I want to merge them so that they align and match, see image below. Can someone provide a code for this? I cannot do this manually since I have 100k rows in list1 and 30k rows in list2...that would take several weeks to do with a risk of errors.
You can merge tables combined utilizing Excels built in Power Query, which can be found under the Data tab.
Note: Photos are taken from Excel 2016
The first step is to create the queries:
Within the Get & Transform section under the Data click on New Query -> From File -> From Workbook and select the appropriate workbook that has the table you want to merge
Select the appropriate sheets in which your tables are found, and confirm that they are displaying properly
If you notice that the table is not correct, you can make changes to it via the Edit button below.
For example, if you notice that your Column headers are being treated as a normal value, you can click Use First Row as Headers under the Power Query Editor Home -> Transform
I would also recommend changing the name of the query so it makes more sense down the line
Once you are happy with the way the query is looking, click on the Close and Load Dropdown menu under the Power Query Editor Home and select Close and Load To...
Select Only Create Connection to add it into your Workbook Queries without duplicating the table.
Repeat the above steps for each table in which you are looking to merge.
Once you have all of your tables linked via Queries, you can now move on to merging them:
Under the same section of New Query select Combine Queries -> Merge
Select the two queries you are looking to merge in each of the respective boxes
Confirm that they are correct via the preview window (don't worry if not all rows show)
Rule of thumb would also be to select your largest query first, and the smaller second
Next, highlight the columns in which you are looking to merge based on. For your example it would be the ID. This is done simply by clicking on the column within the preview
Finally change the Join Kind to Full Outer and click OK
From here you should be back in the Power Query Editor
The final steps are modifying this merged query to your desired output
You should notice that there is a new column added next to your first original table with the name of the query at the top, next to the name is a button that allows you to expand out this query.
Select the appropriate columns you would like to merge into the other table and click OK
If at any point you make a mistake, you can retrace your changes under Applied Steps within the Query Settings Pane
Once you are happy with the way your newly merged query looks, go ahead and click on Close and Load
Your should now have access to your new merged query that will update based on changes made to the original connected files
If you want to make any additional changes going forward from this point just click anywhere inside of the table and you should see both the Table Tools and Query Tools tabs appear at the top
Related
I am looking for assistance on this project of mine. I want to be able to create a dynamic table that can be update based on the number of inputs.
The goal is to create a fully built out table that follows the tier structure. I have attached an example of what I am trying to do with the inputs above and intended output below and I want the table to be updated if I were to add a new selection to each tier e.g. D to tier 1 or 3 to tier 2
Example of input and output desired
I have had to map this manually previously which is time consuming and error prone so I am looking for a way to do this automatically. Thank you in advance for any help provided :)
Separate your tier values into three separate tables. In the image below I have created three tables - tier1, tier2 and tier3. In each table, there is one column called "values" containing the values for that tier.
For each table, create a query using Data>Get & Transform Data>From Table/Range.
You should then have three queries in the Power Query Editor, like this:
In the Power Query Editor, select the "tier1" query, then use Add Column>Custom Column and configure it like this:
When you hit OK, you will see this:
Hit the double-headed arrow at the top of the new column then hit OK on the dialog to expand the column. You'll see this at the end:
Repeat the above steps for adding a column for tier3, so at the end you have this:
You can now right-click any of the columns and use 'Rename' to rename them as you want.
Finally, click 'Close & Load' to put the result back to the workbook.
Now, you only need to put your tier values into the three tables, then right-click the final query and select 'Refresh' to run the steps again.
I have two sets of data; the first (Wind Claims) contains a StartDate, EndDate, and Zip Code field. The second (PLRB Wind) contains a Date, Zip Code, and Wind Speed field.
My goal is to get the Wind Speed from the PLRB Wind tab to the Wind Claims tab if the Date from the PLRB Wind tab is between the StartDate and EndDate on the wind Claims tab AND the Zip Code from the PLRB Wind tab matches the Zip Code on the Wind Claims tab. The point is to identify the wind speed where damage was reported.
I have tried a couple formulas; this one I actually got results but only 1227 out of 16822. I wouldnt expect to have a 100% match but definitely much more than what I am getting. I think the reason is because this formula is looking for the specific date and not looking at the date range:
=XLOOKUP(Z2&N2,'PLRB Wind'!$I$2:$I$78525&'PLRB Wind'!$D$2:$D$78525,'PLRB Wind'!$M$2:$M$78525,"")
I also tried an Index Match (this is just the Match piece of the formula)
=MATCH(1,IF('PLRB Wind'!D2>=$B$2:$B$16823,IF('PLRB Wind'!D2<='Wind Claims'!$C$2:$C$16823,IF('PLRB Wind'!I2='Wind Claims'!$Z$2:$Z$16823,1))),0)
Thank you in advance for looking at this. I appreciate any help you might be able to provide!
I'd use power query for this. Do you know what power queries are? I was upset when I found out because of all the useful ways I could have been using it before.
You might feel differently, though. Create a new copy of your workbook for this just in case you hate it. :-)
In the "Data" ribbon of Excel, in the Get & Transform section, there's a "From Table" button. Highlight your PLRB table (including the column titles) and click that "From Table" button to create a new query from it. It will create the table and the query.
A power query editor window will pop up, presenting your query as two steps, listed in the middle of the right sidebar. The first step is to get the information from your worksheet. The second step changes the data types. Click the icon to the left of each date column's title to change the type from datetime to date because why not. On the right sidebar, change the query name to PLRB.
Now click "Close & Load" on the home ribbon. It will create a new tab with the results of your table. Leave it for now. You can delete that tab later and it won't delete the query.
So, back to your worksheet, highlight the column-title row and data rows for first three columns of the wind claims table. Create another query from table. Call it WindClaimsInput. Again, correct the datetime columns to date columns
OKAY, so now you have two queries. They both read from your workbook but they could have been from another file or text file, etc. If you like this solution then your final form might be a worksheet that doesn't actually have any source data in it, just queries that gets the raw data from elsewhere and a tab that presents the third query we're about to make.
Now for the fun part.
While still in the power query editor editing your WindClaimsInput query, near the left edge of the "Home" ribbon there's a button named "manage...". Click it, then click "Reference" to create a third query that starts with the old one. Remember, queries are only instructions. We aren't copying data until we run the queries.
Now, find the button to add a column. It should open a dialog box asking the column name and formula. Name it "PLRB" and use this formula: Table.SelectRows(PLRB, (r) => (r[Date] >= [CATFromDt] and r[Date] <= [CATThruDt] and r[ZipCode] = [ClaimZip])) Table.SelectRows is a power query function that takes two arguments:
The table (or query that returns a table), and,
A function to run on each record (aka row) of the table and return true/false. In this case, we created a function that takes one argument (r) and returns true or false.
So the above formula says "Give me a table of all rows in PLRB for the given ClaimZip zip code that also has a Date between CATFromDt and CATThruDt." Since it's a column formula, it runs once per row in. Wind Claims.
Now you have a table where the last column is another table! Specifically, the rows from PLRB that are relevant for that Wind claims row. You can single-click on any of those cells in that last column to see the subtable.
To right of the last column's title will be a little "expand" icon. Click it, choose to aggregate by max wind speed. (The right edge of the "wind speed" choice will let you change it to maximum, or average, or whatever you like.) Unclick "Use original column name as prefix". Click okay. Don't worry, you can delete this new step and try again if I didn't describe it well.
Hit "Close and Load" to see it in your workbook. If it looks right, great! Otherwise, feel free to go back and edit some more.
And now you're done! Unlike formulas it doesn't automatically refresh but when you want to refresh your output based on your input tables you can refresh that query or, in the "Data" ribbon you can click "refresh all".
In the data ribbon of Excel, in the "Get & Transform" section, there's a "Show Queries" button that toggles a sidebar that displays your queries you've made. You probably only want to keep loading your third query, so you can change the "Load to..." of the other two queries to "Connection Only".
Sorry I can't do screenshots right now.
Above is a picture of my Excel sheet. I have 2 columns of data that have multiple data points in them (separated by commas). This is how my data is spit out after running an online psychology experiment. I'm hesitant to split text to columns because some lines only have 3 values and other lines have 20+. Essentially, I need to match values in one column to values in the second column. For example, the first value in column G needs to match with the first value in column H. The second value needs to match with the second value, etc. I don't need to match up every value in both columns, however. I only need a (defined) subset of values.
I'm not sure if this is possible to do in Excel (or any Excel add-on) without separating the values into separate columns, but any help is appreciated!
I've seen this before in survey data - the output uses "packed data" where each cell contains many values. You will need Excel 2010+ for Windows (or Excel 365) for this solution. Otherwise, there a solution that is also Mac compatible that does not involve VBA, but it takes time to construct. This approach should take you 10 mins to do - a lot of steps, but it is just clicking.
Let's say that these are your data in two columns in a table.
Click anywhere inside the table. Open the Data tab and click on From Table/Range:
This will convert your data into an Excel Table and ask you if your table has headers - yes it does. Click OK.
This will open the Power Query (PQ) editor (congratulations, you are now a step closer to data scientist, so take a selfy with this screen in the back and share on social media).
You will see in the Applied Steps on the right hand side that PQ has helpfully detected the data type in a step called Changed Type. You need to undo that because it will likely think that your comma separated numbers are just one giant number. So click the X on the left side of that step.
On the right side, you can expand out Queries as shown above. Right click on your table and select Duplicate.
NB: This is not the most efficient way to do this, but I think this is something you just want do one time and you probably don't want to go hacking through the Advanced Editor.
So now you have two tables:
Rename Table1 (2) to Output in the box on the right hand side just to create some clarity.
Right Click on the Response RT column in Output and Remove it. Click on Table1 and do the same thing to the Response column. So now you have Table1 with only the Response RT and Output with only the Responses. Now we will parse these into rows of cleaned data.
Parse Table1
First, in Table1, click on the Response RT column and in the Home tab you will see Split Column. 1) Click on that and choose By Delimiter.
2) It will default to Comma, but you need to click on Advanced options and choose the Rows radio button.
Click OK and it should turn your data into rows of separated numbers and change to the type (this time helpfully) to decimal.
Now you need to add an index. 3) Go to the Add Column tab and click on Add Index, starting from 1.
Parse Ouput Table
Now go back to Output and repeat steps 1), 2) and 3) for it as well. Then you will have to take an extra step to clean up your text column. Right-Click on the Response column and choose Transform > Trim on the data.
That will get rid of those spurious spaces.
Merge Them Back Together
While you still have the Output table selected, go to the Home tab and choose Merge Queries.
It will bring up this window:
Choose Table1 from the bottom dropdown. Click on Index on both tables and click OK. You will get something like this:
Click on the button on the top right of the Table1 column and then unselect Index and Use original column name as prefix.
Click OK. Right click the Index column and Remove it. You now have your answer, but you still need to bring it back to Excel.
Putting it back in Excel
Click on Close and Load to on the left hand of the Home tab. To keep things simple, just click OK.
It will put both Output and Table1 as worksheets into your workbook, (this is where I said it is not the most efficient approach - you can always delete the Table1 worksheet. Excel will complain when you do, but you can ignore it.) Output is your answer.
Congratulations, you just did an ETL (extract transform and load) operation in data analytics. Do another selfy with the answer and share on social media.
I am using Power Query in Excel 2016 to combine data from 12 different workbooks within the same folder system into one table, and need to add an additional column in the master table that tracks the status of each row. However, when I refresh the data, the Status column does not follow the rows to which it is initially applied.
I have already looked at [ Inserting text manually in a custom column and should be visible on refresh of the report ] but this solution only works with a unique ID column. Because each of the 12 workbooks is edited separately and because there is no single column that can be guaranteed to have unique values between all of the different spreadsheets, I don't have a key to join the data to the additional column.
I believe there is always a way of finding a Unique ID. If you can get your head around this, it is not that difficult to solve your problem.
See my below example, I used three sample workbooks saved in a Test folder. Depends on the way you add them to the query editor, in my example I used From Folder and follow the prompts without making any changes and combined the tables automatically. Once combined there is a Source.Name column automatically added. I suggest to leave this column in your output table as it can form part of the Unique ID if your data is highly identical across the workbooks.
An optional step (not in my screenshot) is to add an Index column and concatenate the index number with a product/task name so it can make that specific line of data entry even more unique.
Once you added the Status column with data entered manually on the master table, load the master table back to query editor.
Then go back to the original query (Test (Input) in my example) and merge it with the reloaded output query. See my screen-shot for how to 'uniquely' merge the two tables.
The rest is self-explanatory. I think the key is finding elements of the Unique ID and incorporate it in the merge part.
Let me know if you have any questions. Cheers :)
I am trying to do the following and am stuck.
I have an excel spreadsheet with 3 tabs.
One tab is an input file
Second tab is a set of data
Third tab is a set of data
For #1, the first tab contains has a list of file names and where they are located.
I then use power query to combine those two columns, FileNames and QuickCheck here to produce my table that I want to run "Quick Checks" against.:
For #2 and #3, those tabs contain customer data
Basically, with power query, how do I run a search where If the Custom column in 1 matches the quick check column in #1, pull that row of data and output it to another tab? My desired output file needs to look like this:
You can merge the query #1 and #2 using the Merge Queries button and then expanding the new column it generates by clicking on the button with the two arrows in the column header. You can do the same merge with queries #1 and #3, and then append those two queries together using the Append Queries button.
If you want to keep queries 1-3 unmodified, you can Duplicate or Reference the query you're going to use as the left table by expanding the queries pane (next to the table preview in the editor), right-clicking on the query name in the queries pane and selecting the relevant context menu item. You can then do the merge step on that without modifying the original query.