Excel: Remove Duplicates based on time condition - excel

I'm looking to remove duplicates from a 250,000 row excel sheet based on a 3 month rolling time condition.
We have a lot of usersIDs and the dates which they visited but a lot of these visits are very far apart (sometimes over a year) and a lot of them are within the same day/couple day period.
The best way to explain what I want to do is with an example. So if they first visited on 1st Jan, 1st Jan, 3rd Jan, 8th Feb, 4th June, 5th June, 1st Dec, 1st Dec, 2nd Dec, I would want to grab that first date of 1st Jan, 4th June and 1st Dec.
If they visited 1st Jan, 1st Jan, 3rd Jan, 8th Feb, 9th Apr then 1st August, 1st Sept, I would want 1st Jan and 8th August.
So we want to grab the first date, then see how often they visit within 3 months of each visit and if they leave for more than a 3 month period, grab the first date that they return. Sometimes they come back 4 or 5 times after 3 months and the data can span several years.
Is there a way for me to achieve this? It would be great to get some help as this is driving me mad.
Cheers

If the UserID is in column A and the VisitDate is in B with the headings in row 1 and then a blank row in 2 and the data starting in row 3 then try this (explanation below):
Array Formula version:
sort the rows ascending by VisitDate
in B2 put 1/1/1900 so it won't match anything (but it has to be a date)
in C3 put this array formula (press control-shift-enter instead of just enter):
=SUM((B$2:B2<DATE(YEAR(B3),MONTH(B3)-3,DAY(B3)))*(A$2:A2=A3))=SUM((A$2:A2=A3)*1)
Copy the formula in C3 down to every row of data
Filter on Unique = TRUE
if you want to resort you will need to copy and paste back column C by values
New non-array formula version:
sort the rows ascending by VisitDate
in B2 put 1/1/1900 so it won't match anything (but it has to be a date)
in C3 put this normal formula (just press enter):
=COUNTIFS(B$2:B2,"<"&DATE(YEAR(B3),MONTH(B3)-3,DAY(B3)),A$2:A2,A3)=COUNTIF(A$2:A2,A3)
Copy the formula in C3 down to every row of data
Filter on Unique = TRUE
if you want to resort you will need to copy and paste back column C by values
This produces the following with my sample data (array formulas may take a very long time to calculate for lots of rows):
| A | B | C
---+--------+------------+--------
1 | UserID | VisitDate | Unique
2 | | 1/01/1900 |
3 | a | 1/01/2017 | TRUE
4 | a | 1/01/2017 | FALSE
5 | b | 2/01/2017 | TRUE
6 | b | 2/01/2017 | FALSE
7 | a | 3/01/2017 | FALSE
8 | c | 3/01/2017 | TRUE
9 | c | 3/01/2017 | FALSE
10 | b | 4/01/2017 | FALSE
11 | c | 5/01/2017 | FALSE
12 | a | 8/02/2017 | FALSE
13 | b | 9/02/2017 | FALSE
14 | c | 10/02/2017 | FALSE
15 | a | 4/06/2017 | TRUE
16 | a | 5/06/2017 | FALSE
17 | b | 5/06/2017 | TRUE
18 | b | 6/06/2017 | FALSE
19 | c | 6/06/2017 | TRUE
20 | c | 7/06/2017 | FALSE
21 | a | 1/12/2017 | TRUE
22 | a | 1/12/2017 | FALSE
23 | a | 2/12/2017 | FALSE
24 | b | 2/12/2017 | TRUE
25 | b | 2/12/2017 | FALSE
26 | b | 3/12/2017 | FALSE
27 | c | 3/12/2017 | TRUE
28 | c | 3/12/2017 | FALSE
29 | c | 4/12/2017 | FALSE
Because the formula compares the current row with all the rows above looking for rows with dates in the past the data needs to be sorted with the oldest dates first.
How the array formula works:
=SUM((B$2:B2<DATE(YEAR(B3),MONTH(B3)-3,DAY(B3)))*(A$2:A2=A3))=SUM((A$2:A2=A3)*1)
DATE(YEAR(B3),MONTH(B3)-3,DAY(B3)) is 3 months ago (even if it is 92 days)
(B$2:B2<DATE(YEAR(B3),MONTH(B3)-3,DAY(B3))) is an array of TRUE/FALSE values which has a TRUE for every row above that is older than 3 months ago
(A$2:A2=A3) is an array of TRUE/FALSE values which has a TRUE for every row above that matches the user ID
(B$2:B2<DATE(YEAR(B3),MONTH(B3)-3,DAY(B3)))*(A$2:A2=A3) does an AND of the arrays so 1 is returned (TRUE*TRUE=1) for each row above that has the same name and a date that is older than 3 months ago
SUM((B$2:B2<DATE(YEAR(B3),MONTH(B3)-3,DAY(B3)))*(A$2:A2=A3)) adds all the TRUE rows above that have the same name and a date that is older than 3 months ago
SUM((A$2:A2=A3)*1) adds the number of rows above that have the same name (TRUE*1=1)
=SUM((B$2:B2<DATE(YEAR(B3),MONTH(B3)-3,DAY(B3)))*(A$2:A2=A3))=SUM((A$2:A2=A3)*1) compares the two sums and returns TRUE if all the rows above that have the same name are all older than 3 months ago
Methodology:
I originally just played with a column of dates - no userID. I wanted to find a way to know if the date on a particular was more than 3 months after all the dates before it (I implicitly assumed that the dates were sorted). I reasoned that if a count of the dates before the current row matched a count of the dates before the current row that were older than 3 months in the past then I would have the answer I wanted. So I originally put this formula in C3 and copied it down:
=COUNTIF(B$2:B2,"<"&(B3-90))=COUNTA(B$2:B2)
Then change it to 3 months instead of 90 days:
=COUNTIF(B$2:B2,"<"&DATE(YEAR(B3),MONTH(B3)-3,DAY(B3)))=COUNTA(B$2:B2)
And then to add the userID we need a way to compare multiple criteria - this is where COUNTIFS comes in (if you have Excel 2007 or better):
=COUNTIFS(B$2:B2,"<"&DATE(YEAR(B3),MONTH(B3)-3,DAY(B3)),A$2:A2,A3)=COUNTIF(A$2:A2,A3)
And then I converted it to this array formula:
=SUM((B$2:B2<DATE(YEAR(B3),MONTH(B3)-3,DAY(B3)))*(A$2:A2=A3))=SUM((A$2:A2=A3)*1)
In retrospect I don't know if giving the array formula was a good idea or not: I don't know whether the array formula would be better/faster than COUNTIFS or not. So use whichever you prefer.

Related

Sum only values which fall Monday to Friday

I receive a statement (as a .xls) each month which list a bunch billable items with an associated date. I want to create a formula (using either =sum() or =sumifs() to total the billable items, but only those which fall Monday to Friday (i.e., not weekends). Is that possible?
A B
------+--------------+-------------
1 | 05/12/2016 | $10.00
2 | 06/12/2016 | $10.00
3 | 07/12/2016 | $10.00
4 | 08/12/2016 | $10.00 dates are formatted as
5 | 09/12/2016 | $10.00 dd/mm/yyyy
6 | 10/12/2016 | $10.00
7 | 11/12/2016 | $10.00
8 | 12/12/2016 | $10.00
------+--------------+-------------
| Sum | $80.00
------+--------------+-------------
| Sum |
| (no weekends)| $60.00
------+--------------+-------------
EDIT:
I've just looked closer at the excel doc, and it's actually a datetime field, e.g. 31/10/2016 12:44:00 pm (displayed as 31/10/16 12:44).
I'm also not looking for a formula which works line by line, I'd like something which I can just copy and paste into a single cell at the bottom of the doc each month which examines A:A.
You need to use this formula:
=SUMPRODUCT(B1:B8,--(WEEKDAY(A1:A8,2)<6))
This is a hack which behaves like SUMIF but lets you use a function in your criteria. Otherwise, you would need to create an auxiliary column with WEEKDAY (in C for example) and then use =SUMIF(C1:C8,"<6",B1:B8).
WEEKDAY by default returns 1-7 for SUN-SAT. As this doesn't help, you can change the return type to type 2 with the optional second parameter to make the function return 1-7 for MON-SUN, which lets you do the easy <6 comparison. You can also use type 3, which returns 0-6 for MON-SUN, and then obviously use <5 instead.
More about the -- hack here.

How to select the last cell in a range in Excel

I've got a spreadsheet that updates throughout the day with data, I need to be able to grab the last cell in a column but for certain date ranges, not just the last cell in the column.
Column C contains the data I need, column A and B update with the date and time, (some cells in column A could be blank too). Column D I can change to make column E display the latest data for the selected date.
Here's what I've got so far to put in column E:
VLOOKUP(D1, $A:$C,3,FALSE)
I've managed to get data from my formula but only the first entry. For example if I enter the date 17/05/2016 it will return '5'. Whereas I need the more recent data '28'.
Example sheet:
A | B | C | D | E
16/05/2016 | 08:00:00 | 3 | date | data
16/05/2016 | 12:00:00 | 7
16/05/2016 | 18:00:00 | 15
16/05/2016 | 22:00:00 | 27
17/05/2016 | 08:00:00 | 5
17/05/2016 | 12:00:00 | 11
17/05/2016 | 18:00:00 | 21
17/05/2016 | 22:00:00 | 28
18/05/2016 | 08:00:00 | 4
18/05/2016 | 12:00:00 | 13
18/05/2016 | 18:00:00 | 19
18/05/2016 | 22:00:00 | 30
I've only just started getting my head around excel formulas so any help would be greatly appreciated!
=INDEX(C2:C13,MATCH(D3,A2:A13,1))
INDEX/MATCH is a very powerful combination. It can perform the same job as VLOOKUP and then a bit more. VLOOKUP is restricted to searching the first column and returning information to the right. With MATCH you can search any column, and you can return information from any column (even to the left which vlookup cant do)
If you start reading with the MATCH function, it searches for the value in D3 within the range A2:A13 and return an integer representing the row the value of D3 was found it. The 1 at the end of match tell match to look for that last entry that D3 exceeded. This means that column A needs to be sorted in ASCENDING order
INDEX uses the integer from MATCH and goes down that many rows in in specified range. so if match returned 1, then it would read C2.

3 condition IF formula Excel

I'm looking to create an excel formula with 3 conditions.
Here's what I'm looking for:
D11 has a number (it is number of working hours). If the number is less than 4 (i.e. <=4), then I want it to show a value in cell B5,
If the number is between 4 and 8 (i.e. >4 and <=8), then I want it to show a value in cell B6.
If the number is over 8, then I want it to show a value in cell B7.
The cells in B5, B6 and B7 contain the relevant renimeration for 4-hours shift, 8-hours shift and for overtime.
This is what I have made:
IF(D11<4,"$B$5",IF(AND(D11>=4,E9<=8),"$B$6","$B$7")).
The Formula always gives a message :
"The formula you typed conains an error: - for information about fixing....;-to get assistance.....; - if you are not trying.......
Please advise!
I have tested and the follwoing is working for me:
=IF(B2<4,$E$2,IF(AND(B2>=4,B2<=8),$F$2,$G$2))
Here is the example of the data I was working with (replacing FORMULA with the above):
+-----+----------+----------+------------+-------------+-------------+-----------+--------+
| | A | B | C | D | E | F | G |
+-----+-----------------------------------------------------------------------------------+
| 1 | name | Overtime | Due | | | | |
+-----+-----------------------------------------------------------------------------------+
| 2 | bob | 4 | FORMULA | | 10 | 20 | 30 |
+-----+-----------------------------------------------------------------------------------+
Effectively if B2 is 4 then C2 should show 10.
Thank all of you. The problem was with the local settings that expect ; rather than , in Excel formulas.
Still I have problem with the formula, because I foud out that I should include one more condition: the case when the person is not working D11=0, because then he/she should receieve 0 or in the cell should be written a text "free day".

How to automatically delete rows in Excel

Consider the following (partial) Excel worksheet:
A | B | C | D
---+-------------+-------+-------
id | date | var_a | var_b
1 | 2011-03-12 | 200 | 34.22
1 | 2011-03-13 | 203 | 35.13
1 | 2011-03-14 | 205 | 34.14
1 | 2011-03-15 | 207 | 54.88
1 | 2011-03-16 | 208 | 12.01
1 | 2011-03-18 | 203 | 76.10
1 | 2011-03-19 | 210 | 14.86
1 | 2011-03-20 | 200 | 25.45
. | . | . | .
. | . | . | .
2 | 2011-03-12 | 200 | 34.22
2 | 2011-03-13 | 203 | 35.13
2 | 2011-03-14 | 205 | 34.14
2 | 2011-03-15 | 207 | 54.88
2 | 2011-03-16 | 208 | 12.01
2 | 2011-03-18 | 203 | 76.10
2 | 2011-03-19 | 210 | 14.86
2 | 2011-03-20 | 200 | 25.45
. | . | . | .
. | . | . | .
In reality, there are over 5.000 rows. I need to delete all rows which date falls on a saturday or sunday. In the example, March 12 and 13 (2011-03-12/13) and March 19 and 20 are Saturdays and Sundays. I cannot just delete every nth rows, since there might be days missing in the list (as is the case here with 2011-03-17).
Is this possible to do with either a formula or VBScript? I have never written a VBScript macro before (I have never had a use for it) so I would appreciate some help.
If you only need to do this once, this is what I would do. This should preserve the order, but if you're really worried about it, read very end of the post:
Add a new column, call it "Is Weekend". In it, put =if(WEEKDAY(B2, 2) > 5, 1, 0). Drag that formula down for the entire table.
Filter the columns. To do that, select the entire table (click on any table cell then hit Ctrl-A), then
On Excel 2007+, go to Data-> click "Filter"
On Excel 2003, go to Data->Filter->Auto Filter.
Sort everything by last column (Is Weekend) in descending order. This should put all weekend rows together without altering the order among the other rows.
Delete all rows with 1 in "Is Weeked" column. Delete that column.
If you're really worried about preserving order, before you do the above, you can do the following:
Add a new column called "Position". Put 1 in the first row, 2 in the second row, select them and drag it down to the bottom so every row has its own position number in increasing order.
Perform the filtering as above.
After you're done, sort everything in ascending order by "Position" column.
The trick is that you don't need to delete those rows, you need to replace their values for C and D with 0. This is easiest done with IF() and WEEKDAY() within two new columns C' and D' referencing C and D. Feel free to then just delete C and D.
You can do this in one go using an array formula. In cell E2, enter the following formula (on one line), and confirm with Ctrl-Shift-Enter (as opposed to the regular Enter)
=INDEX($A$2:$D$5000, SMALL(IF(WEEKDAY($B$2:$B$5000,2)>5, "",
ROW($B$2:$B$5000)-MIN(ROW($B$2:$B$5000))+1), ROW(A1)),COLUMN(A1))
5000 indicates the number of rows in your spreadsheet. After this, the formula should have curly braces around it to indicate it is an array formula. E2 should have the value 1. Then select cell E2 and drag the lower-right corner of the cell to the right until 4 cells are covered. Then drag the lower-right corner of the 4-cell-selection all the way down. At the bottom you will see rows containing #NUM!, one for each deleted row. You can delete those in the regular way.
In stead of starting off in cell E2, you could start off in cell A2 of a new sheet. In that case, you need to prepend the original sheet name to each reference in the formula, as in OriginalSheet!$A$2
This formula is an adaption from the one given in Excel: Remove blank cells
In case you decide to delete the rows, please make sure to run the VBA code from the last row to the first row. Here is a piece of code just written from memory to show you the idea of running from bottom to the top.
For i = Selection.Rows.Count To 1 Step -1
If WEEKDAY(Cells(r, 2),2) > 5 Then
Selection.Rows(i).EntireRow.Delete
End If
Next i

Excel formula to get ranking position

I have a table of people with points. The more points, the higher your position. If you have the same points you are equal first, second etc.
| A | B | C
1 | name | position | points
2 | person1 | 1 | 10
3 | person2 | 2 | 9
4 | person3 | 2 | 9
5 | person4 | 2 | 9
6 | person5 | 5 | 8
7 | person6 | 6 | 7
Using an Excel formula, how can I automatically determine the position? I'm currently using an IF statement that works fine for 5 or 6 matching positions, but I can't add 30+ if statements because there's a limit to the formula.
=IF(C7=C2,B2,IF(C7=C3,B2+5,IF(C7=C4,B3+4,....
So if the points column is the same as the position above then it's the same position value. If the points are less than above then it drops a position so the previous row position +1. But if the row above that is the same then it's the previous position +2 and so on.
You could also use the RANK function
=RANK(C2,$C$2:$C$7,0)
It would return data like your example:
| A | B | C
1 | name | position | points
2 | person1 | 1 | 10
3 | person2 | 2 | 9
4 | person3 | 2 | 9
5 | person4 | 2 | 9
6 | person5 | 5 | 8
7 | person6 | 6 | 7
The 'Points' column needs to be sorted into descending order.
Type this to B3, and then pull it to the rest of the rows:
=IF(C3=C2,B2,B2+COUNTIF($C$1:$C3,C2))
What it does is:
If my points equals the previous points, I have the same position.
Othewise count the players with the same score as the previous one, and add their numbers to the previous player's position.
You can use the RANK function in Excel without necessarily sorting the data. Type =RANK(C2,$C$2:$C$7). Excel will find the relative position of the data in C2 and display the answer. Copy the formula through to C7 by dragging the small node at the right end of the cell cursor.
Try this in your forth column
=COUNTIF(B:B; ">" & B2) + 1
Replace B2 with B3 for next row and so on.
What this does is it counts how many records have more points then current one and then this adds current record position (+1 part).
If your C-column is sorted, you can check whether the current row is equal to your last row. If not, use the current row number as the ranking-position, otherwise use the value from above (value for b3):
=IF(C3=C2, B2, ROW()-1)
You can use the LARGE function to get the n-th highest value in case your C-column is not sorted:
=LARGE(C2:C7,3)
The way I've done this, which is a bit convoluted, is as follows:
Sort rows by the points in descending order
Create an additional column (D) starting at D2 with numbers 1,2,3,... total number of positions
In the cell for the actual positions (D2) use the formula if(C2=C1), D2, C1). This checks if the points in this row are the same as the points in the previous row. If it is it gives you the position of the previous row, otherwise it uses the value from column D and thus handle people with equal positions.
Copy this formula down the entire column
Copy the positions column(C), then paste special >> values to overwrite the formula with positions
Resort the rows to their original order
That's worked for me! If there's a better way I'd love to know it!

Resources