Counting unique occurrences - excel

I am trying the count the number of unique sensors (column 1) that are present by visit duration (column 2). Here is a small portion of the data:
Sensor ID Implant duration
13113 1
13113 1
13113 1
13144 1
13144 1
13144 2
13144 2
13144 2
13144 2
13144 2
14018 1
12184 2
13052 1
13052 1
12155 2
12155 3
12155 3
13069 2
13069 2
13018 1
13018 1
13019 1
13019 1
13049 1
13054 3
13060 3
13108 2
13108 2
So the count for:
Visit 1 should be 6 (13113, 14018, 13052, 13018, 13019, 13049),
Visit 2 should be 5 (13144, 12184, 12155, 13069, 13108), and
Visit 3 should be 3 (12155, 13054, 13060).
I tried DCOUNTA but it doesn't return the count for the first occurrence, just the total number of entries with an implant duration of 1, 2, or 3. So for example it returns 13 for Visit 1, 11 for Visit 2, and 4 for Visit 3.
I have a lot of data that needs to be preserved and counted so I don't want to apply a filter or remove duplicates.

I can do it in a step-wise way .. not sure if this helps:
I put your data in columns A and B.
In Column C:
=CONCATENATE(A2,"#",B2)
Column D:
=IF(C2=C1,"",A2)
Column E:
=IF(C2=C1,"",B2)
Column F:
=IF(D2="","",COUNTIF(D:D,D2))
Column G:
=IF(E2="","",COUNTIF(E:E,E2))
At that point, you have the data "flagged" as you need it .. just need to extract it.
If I understand the issue, the results should be:
Visit 1: 7 (13113,13144,14018,13052,13018,13019,13049)
Visit 2: 5 (13144,12184,12155,13069,13108)
Visit 3: 3 (12155,13054,13060)

There would appear to be 7 unique Sensors in your sample data (13113, 13144, 14018, 13052, 13018, 13019, 13049) for Visit ID=1, not 6.
=SUMPRODUCT((B2:B29=1)/(COUNTIFS(B2:B29, 1, A2:A29, A2:A29&"")+(B2:B29<>1)))
=SUMPRODUCT((B2:B29=2)/(COUNTIFS(B2:B29, 2, A2:A29, A2:A29&"")+(B2:B29<>2)))
=SUMPRODUCT((B2:B29=3)/(COUNTIFS(B2:B29, 3, A2:A29, A2:A29&"")+(B2:B29<>3)))
It would probably be best to put the Visit ID into a cell and reference the cell in all three places.

You might use a PivotTable with Sensor ID for ROWS and VALUES (Count of) and Implant Duration for COLUMNS then apply =COUNT() on the columns. Shows which sensor (in order), which duration and the instances of the combinations:

Related

How to write function to extract n+1 column in pandas

I have a excel file with 200 columns. The first column is no. of visits, and other columns are the data with number of people for that number of visits
Visits A B C D
2 10 0 30 40
3 5 6 0 1
4 2 3 1 0
I want to write a function so that I have multiple dataframes with Visit column and A; visit column and B, and so on (I want to write a function, as the number of columns will increase in the future and I want to automatize the process). Also, I want to remove the data with 0.
Desired output:
dataframe 1:
visits A
dataframe 2:
Visits B
3 6
4 3
This is my first question. So sorry, if it is not properly framed. Thank you for your help.
Use DataFrame.items:
for i,col in df.set_index('visits').items():
print(col[col.ne(0)].to_frame(i).reset_index())
You can create a dict to save by the name of columns
dfs={i:col[col.ne(0)].to_frame(i).reset_index() for i,col in df.set_index('visits').items()}

FCR calculations in excel

I am trying to work the (First Contact Resolution) FCR in excel however so success.
Does anyone know an automated formula used in excel?
We have been trying to use formulas to check how many times a customer with the same customer ID calls us in 7 days.
Assuming your requirement
ID Date Calls Count in last 7 days from today
1 13-Mar-17 2
2 14-Mar-17 0
5 15-Mar-17 1
1 16-Mar-17 2
3 17-Mar-17 1
5 18-Mar-17 1
7 19-Mar-17 1
9 20-Mar-17 1
1 21-Mar-17 2
13 22-Mar-17 1
2 23-Mar-17 0
17 24-Mar-17 0
5 25-Mar-17 1
21 26-Mar-17 0
Putting ID in Column A and Dates in Column B and records starts from row 2
put this formula in Cell "C2":
=COUNTIFS(A:A,A2,B:B,">="&TODAY()-6,B:B,"<="&TODAY())
and drag it all way down to see the number of calls
or put this formula if you want to calculate last 7 days from the date in sheet:
=COUNTIFS(A:A,A2,B:B,">="&B2-6,B:B,"<="&B2)
Please let me know if you meant something else as I tried my best to assume the senario.

Compare multiple data from rows

I'm looking for a way to compare multiple rows with data to each other, trying to find the best possible match. Each number in every column must be an approximately match the other numbers in the same column.
Example:
Customer #1: 1 5 10 9 7 7 8 2 3
Customer #2: 10 5 9 3 5 7 4 3 2
Customer #3: 1 4 10 9 8 7 6 2 2
Customer #4: 9 5 6 7 2 1 10 5 6
In this example customer #1 and #3 is quite similar, and I need to find a way to highlight or sort the rows so I can easily find the best match.
I've tried using conditional formatting to highlight the numbers that are the similar, but that is quite confusing, because the amount of data is quite big.
Any ideas of how I could solve this?
Thanks!
The following formula entered in (say) L1 and pulled down gives the best match with the current row based on the sum of the absolute differences between corresponding cells:-
=MIN(IF(ROW($C$1:$K$4)<>ROW(),(MMULT(ABS($C1:$K1-$C$1:$K$4),TRANSPOSE(COLUMN($C$1:$K$4))^0))))
It is an array formula and must be entered with CtrlShiftEnter.
You can then sort on column L to bring the customers with lowest similarity scores to the top or use conditional formatting to highlight rows with a certain similarity value.
EDIT
If you wanted to penalise large differences in individual columns more heavily than small differences to try and avoid pairs of customers which are fairly similar except for having some columns very different, you could try something like the square of the differences:-
=MIN(IF(ROW($C$1:$K$4)<>ROW(),(MMULT(($C1:$K1-$C$1:$K$4)^2,TRANSPOSE(COLUMN($C$1:$K$4))^0))))
then the scores for your test data would come out as 7,127,7,127.
I'm assuming you want to compare customers 2-4 with customer 1 and that you are comparing only within each column. In this case, you could implement a 'scoring system' using multiple IFs. For example,:
A B C D E
1 Customer 1 1 1 2
2 Customer 2 1 2 2
3 Customer 3 0 1 0
you could use in E2
=if(B2=$B$1,1,0)+if(C2=$C$1,1,0)+if(D2=$D$1,1,0)
This will return a 'score' of 1 when you have a match and a 'score' of 0 when you don't. It then adds up the scores and your highest value will be your best match. Copying down would then give
A B C D E
1 Customer 1 1 1 2
2 Customer 2 1 2 2 2
3 Customer 3 0 1 0 1
so customer 2 is the best match.

Find summation and count only if they are EQUAL in Excel

In EXCEL sheet I have 1728 rows and 2 columns (L and O). I am doing addition of these 2 columns in column P. Further I want to count the occurrence in this column if addition is EQUAL to 2 or 4 or 6 or 8 BUT condition here is that The COUNT should be such that BOTH the columns L and O are EQUAL and Their addition is either 2 or 4 or 6 or 8.
This means that only the columns in L and O with values "1+1" , "2+2", "3+3", "4+4" should be counted. The addition of "1+3", "4+2" should not be counted.
=COUNTIF(P:P,4)
does not work.
L O P M
===========================
1 1 2 1 (NO OF 2'S)
2 2 4 1 (NO OF 4'S)
3 3 6 1 (NO OF 6'S)
1 3 4* NO TO BE COUNTED
4 4 8 1 (NO OF 8'S)
2 4 6* NOT TO BE COUNTED
4 2 6*
AS SEEN ABOVE RESULT OF COUNTING IS STORED IN M. Let me know the formula
=IF(L29=M29,SUMPRODUCT(--($L$29:$L$35=$M$29:$M$35)*(L29=$L$29:$L$35)),"Not Counted")
My data started in row 29 so you will need to adjust the references. It counts the entire table in 1 shot. So if you added a row to the bottom that had 1 and 1 and 2, the results in column M in your first row would become 2 and the same for the row you just added.
Will this formula help...?
=IF(AND(A1=B1,OR(SUM(A1,B1)=2,SUM(A1,B1)=4,SUM(A1,B1)=6,SUM(A1,B1)=8)),SUM(A1,B1),"NOT TO BE COUNTED")
Just drag the formula till you have data. You will need to adjust the references.
Here is the reference data.

Pandas Link Rows to Rows Based on Multiple Criteria

This is the ultimate Pandas challenge, in my mind, though it may be elementary to some of you out there...
I am trying to link a particular job position with the survey items to which it corresponds. For example, the president of site A would be attributed to results from a survey item for which respondents from site A provide feedback (i.e "To what degree do you agree with the following statement?: "I think the quality of site A is sufficient overall"").
Each site has 5 stations (0 through 4). Each job position is assigned to one or more station(s) at one or more site(s).
For example, the president of a site works at all stations within that site, while a contractor might only work at a couple of stations, maybe at 2 different sites.
Survey data was collected on the quality of each station within each site.
Some survey items pertain to certain stations within one or more sites.
For example, the "Positions" table looks like this:
import pandas as pd
import numpy as np
pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'],
'Position':['Contractor','President'],
'Site(s)':['A,B','A'],
'Item(s)':['1','1,2']
})
pos[['Position','Site(s)','Station(s)','Item(s)']]
Position Site(s) Station(s) Item(s)
0 Contractor A,B ,1,2,, 1
1 President A 0,1,2,3,4 1,2
The survey data table looks like this:
sd = pd.DataFrame({'Site(s)':['A','B','B','C','A','A'],
'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
'Item 1':[1,1,0,0,1,np.nan],
'Item 2':[1,0,0,1,0,1]})
sd[['Site','Station(s)','Item 1','Item 2']]
Site Station(s) Item 1 Item 2
0 A ,1,2,, 1 1
1 B ,1,2,, 1 0
2 B ,,,, 0 0
3 C ,1,2,, 0 1
4 A 0,1,2,, 1 0
5 A ,,2,, NaN 1
2 side notes:
The item data has been coded to 1 and 0 for unimportant reasons.
The comma separated responses are actually condensed from columns (one column per station and item). I only mention that because if it is better to not condense them, that can be done (or not).
So here's what I need:
This (I think):
Contractor President Site(s) Station(s) Item 1 Item 2
0 1 1 A ,1,2,, 1 1
1 1 0 B ,1,2,, 1 0
2 0 0 B ,,,, 0 0
3 0 0 C ,1,2,, 0 1
4 0 1 A 0,1,2,, 1 0
5 1 1 A ,,2,, NaN 1
Logic:
The contractor works at site A and B and should only be associated with respondents who work either of those sites.
Within those respondents, he should only associated with those who work at stations 1 or 2 but none that also work at
any others (i.e. station 0).
Therefore, the contractor's rows of interest in df2 are indices 0, 1, and 5.
The president's rows of interest are from indices 0, 4, and 5.
...and, ultimately, this:
Position Overall%
0 Contractor 100
1 President 80
Logic:
Because the president is concerned with items 1 and 2, there are 5 numbers to consider: (1 and 1) from item 1 and (1, 0, and 1) from item 2.
The sum across items is 4 and the count across items is 5 (again, do not count 'NaN'), which gives 80%.
Because the contractor is only concerned with item 1, there are 2 numbers to consider: 1 and 1 - 'NaN' should not be counted - (from the rows of interest, respectively). Therefore, the sum is 2 out of the count, which is 2, which gives 100%
Thanks in advance!
Update
I know this works (top answer just under question), but how can that be applied to this situation? I tried this (just to try the first part of the logic):
for i in pos['Position']:
sd[i]=[x for x in pos.loc['Site(s)'] if x in sd['Site']]
...but it threw this error:
KeyError: 'the label [Site(s)] is not in the [index]'
...so I'm still wrestling with it.
If I understand correctly, you want to add one column to sd for each job position in pos. (This is the first task.)
So, for each row index i in pos (we iterate over rows in pos), we can create a unique boolean column:
# PSEUDOCODE:
sd[position_name_i] = (sd['Site'] IS_CONTAINED_IN pos.loc[i,'Site(s)']) and (sd['Station(s)'] IS_CONTAINED_IN pos.loc[i,'Station(s)'])
I hope the logic here is clear and consistent with your goal.
The expression X IS_CONTAINED_IN Y may be implemented in many different ways. (I can think of X and Y being sets and then it's X.subset(Y). Or in terms of bitmasks X, Y and bitwise_xor.)
The name of the column, position_name_i may be simply an integer i. (Or something more meaningful as pos.loc[i,'Position'] if this column consists of unique values.)
If this is done, we can do the other task. Now, df[df[position_name_i]] will return only the rows of df for which position_name_i is True.
We iterate over all positions (i.e. rows in pos). For each position:
# number of non-nan entries in 'Item 1' and 'Item 2' relevant for the position:
total = df.loc[df['position_name_i'], ['Item 1', 'Item 2']].count().sum()
# number of 1's among these entries:
partial = df.loc[df['position_name_i'], ['Item 1', 'Item 2']].sum().sum()
The final Overall% for the given position is 100*partial/total.

Resources