Merge data from excel files - excel

I have about 70,000 excel files each of size about 300kb. The first column is date and time and rest columns are all doubles.
How do I merge them into 1 single csv file or bring them all together into one sheet of an excel work book. I was thinking about using Matlab but it runs out of memory.

You could try RDBMerge. It's a free add-in for Excel built for this sort of work.
Alternatively, you may find the following info useful:
http://ask.metafilter.com/106144/Combining-a-ton-of-Excel-files-into-one-Excel-file

If it's a one off this is what I would do:
Copy all details from both sheets to a single sheet in a new workbook
Sort descending by your date and time column
Duplicate the data and time column on the end (For vlookup, if you are comfortable using index you can avoid this)
Duplicate the sheet
Delete the data in the date and time column
Data / Remove duplicates to get a distinct set
Use a vlookup on some referential column to get the date and time (Make the last argument in the vlookup 0)
This will pull in the first instance it finds for the data (ie the highest date / time).
Should take about 2 mins all up.

Related

How do I sort a spreadsheet by dates from different columns?

I have a a spreadsheet that contains a number of deadlines, as well as columns indicating whether the activity has been completed. I want the sheet to sort itself by "the next due date" which will come from a different column for each row.
For example, row 3 might need to be sorted by the date in column H, whereas row 2 might need to be sorted by the date in column J (since 2's colH activity is completed).
More specifically, a row might need to be sorted by its "45N" due date; once the 45N is finished, I'd want it to be sorted by the "45R" due date.
One option I read about is to create a 'dummy' column after all the data and populate it with the "sortBy" date for each row, and then simply sort by this column (and maybe delete it afterward). I think this would be fairly straightforward, and I have the conditional logic I'd need.
Is there a way to do this in VBA without populating a column? Just sort of saving a variable FOR EACH ROW in VBA, rather than making it a temporary entry in the spreadsheet itself, and then performing a sort on that?

Remove Duplicates Excel - From Each Day

I have a spreadsheet with columns of the following records:
Caller ID (Customer's number each call came from)
Date (the dates are when the calls came in since January 1 2017)
How can I efficiently remove duplicate Caller ID's within each day?
If I simply use the Remove Duplicates tool, it will remove duplicates across the entire year so far.
So I pretty much want to remove instances where a customer called more than once in a day.
Here is an example of the data.
How can I make it so only the first record from each day shows?
My actual sheet has over 100k rows
Why delete the duplicates? Just create a pivot table and it can show the unique values and a count of the duplicates.
Remove duplicates tool should work, Are you sure you have only those 2 columns (number and date) ticked in the Remove duplicates tool?
If not, check the date format
A quick and dirty solution:
Assuming the data is sorted (ascending) by the date to begin with, you could do a primary sort on date and secondary sort on phone numbers. In an empty column, enter the following formula =(b2-b1)+(c2-c1) (where phone is column b, date is column c and row 2 is your first row of data. Copy and paste to last row of data. Now filter that column for 0 only and then hide or delete those rows.
Now if the data wasn't already sorted by date, and you need it restored to the original order, you could first add a column for a numeric order (i.e. 1,2,3) with formulas. First row (Row 2) of data is a 1, next row is a formula (=a2+1) and copy and paste that to the last row. Now when you run the above process, you would delete the rows with zeroes, then run a final sort on Column A to get back to the original order.
Hope that helps. If this is a one-time thing that should suffice. If not, a macro could do the same thing and shouldn't be too hard to write. Or you could leave the extra columns in place or just hide them, for future use.

Calculate the average of columns, then remove the columns using Google Spreadsheet

I have a data set in a Google Spreadsheet with each column representing a year. The data I want from this spreadsheet is the average over five years, so I create a column for each five year interval containing the average (calculated with =AVERAGE(B2:F2) for the second row and first interval). Now I do not want the columns with specific years when I create a .csv file of the data and use it in my program, but only the columns containing the average. If I remove those columns that I don't want I will get an error saying that elements are missing.
Is there a way, in Google Spreadsheet, to "save" the calculated average (or any function for that matter) in the column so that it only contains a single numeric value and has no dependencies?
If i'm understanding your question correctly, you could highlight all the cells with the averages, copy and paste special - values. that would remove all the formulas you have in those cells but would keep the values when you delete the data columns.

To match time stamp in excel

In spreadsheet A, I have hourly data corresponding to a particular set of sample numbers (e.g. 1-10). Then I have 7 other data spreadsheets with 7 different time stamps (mm/dd/yyy hh:mm:ss) format. Out of 7, six spreadsheets have data every 2 minutes and the seventh has data every hour.
My objective is to match corresponding data values of 7 spreadsheets (1-7) with spreadsheet A and calculate the mean value of the data for each set of samples.
I have only basic knowledge of working with excel, so I first created a grand time stamp (called it as GTS, which has data entries every second) and then converted each entry of GTS and other time stamp entries into "serial date" format. I am trying to match these entries using functions like "IF", "Match" etc. but haven't found an appropriate method to execute it correctly.
Any advice would be appreciated. Thanks!
ok, i'm not exactly sure what you're trying to say but this is what i imagine,
Sheet 1 is your summary sheet, with column A having the times and data to the right. and sheets 2-7 with the source data in a similar set up.
1st question do you need all the data in 1 column or can you have 7 columns with the last one being the mean? if so then
=index(Sheet2!B:B,match(A2,Sheet2!A:A,0),0)
and sheet3 or whatever he sheet name is for the next column
or you can get creative and have the sheet names in the summary sheet as column headings then you can just do
=index(Indirect("B:$1"&"!B:B"),match(A2,Indirect("B:$1"&"!A:A"),0),0)
Also if the data isn't available for every time stamp then i'd add this around the formula to return 0s
=Iferror(......,0)
if you need to do all 6 source sheets in 1 cell then you gotta do some nesting

how to optimize speed excel 2007 (±20,000 rows)

I'm in the process of working with an Excel file that contains two columns (old URL and new URL). But it contains about 20,000 rows.
And I have another file containing about 400 old/new URL that needs to be imported in the big ±20,000 rows file.
I have to do all kinds of processing, like:
- Find all duplicate rows (same two columns more than once...). That functionnality would be in a column and it would be good to run that function each time I add 1 row to check if that URL combination already exists in the file
Note that I already turned the sheet into a table.
2 questions now:
1) should I do some kind of vlookup from the ±20,000 rows sheet and the ±400 rows sheet, or VBA? I don't know what would be the best way to do this (i.e.: if that row from the ±400 rows sheet is not in the ±20,000 rows sheet, add it...). Should I use vlookups or populate arrays in VBA (speed-wise)? If I use vlookup, it is true that it is possible to put the vlookup function in a sheet and refer to it in every row instead of puting a vlookup function directly in every row?
2) How can I optimize the 20,000 rows sheet because now, each time I want to sort or filter, it takes an eternity to redraw and it freeze my PC for that time!
Thanks for you help.
Firstly to ommit the dupes from the 400ish row sheet that need to be added in, use a COUNTIFS formula against the big sheet, then sort by this value and only copy in things where there is < 1 for the value (or error).
Secondly I would probably do the same thing in the big sheet but referencing itself, anything with a value above 1 is a dupe.
Lastly, are there formulas in the 20,000 row sheet? I could set up a 20,000 row sheet with just a "1" in range A1:A20,000 and doing anything on it would be super quick. It all comes down to what data you have in there and what you can do to reduce it's load on the system (ie convert formulas to values if they no longer need to calculated)
Excel 2007 has a built-in feature and VBA you can use for your situation: Range.RemoveDuplicates or Data tab -> Data Tools group -> Remove Duplicates
For example data:
Click the Remove Duplicates button:
And you are done!
The VBA equivalent is:
ActiveSheet.Range("$A$1:$B$10").RemoveDuplicates Columns:=Array(1, 2), Header:=xlYes
Note the 1 & 2 does not mean Columns A & B. It means the Columns of the selected Range.
If your worksheet only contains 2 columns, you could use UsedRange instead.

Resources