Counting data in Excel to calculate probabilities of intersections - excel

I have the following problem and hope someone can give me a hint: I have an Excel sheet with three columns. In the first column I have a country code, in the second column I have a sector code (~50 sector codes per country and more than 30 countries). The third column includes a 0/1-Dummy. I would like to know the probability that the Dummy is one for sector 1 AND sector 2 (intersection). For that I need to know how often a 1 occurs in sector 1 and in sector 2.
The final output should be a conditional probability, and I think calculating it with the well know formula P(A|B)=P(intersection A and B)/P(B) is the easiest way - however, if there are easier ways to calculate the conditional probability, I would be very grateful as well.
In a simplified version the problem looks as follows, where I would like to know the probability that a AND b are 1:
screenshot of simplified table
Thanks in advance!

Just to get things started, I suggest you pivot the data first, then divide the number of rows with a=1 and b=1 by the number of rows (countries) in the table using
=COUNTIFS(G3:G5,1,H3:H5,1)/COUNT(G3:G5)

Related

Excel - What is the easiest way to calculate incidence plus prevalence over time?

Say I have the dataset below, what is the most efficient formula to fill the cells in column D, where the number of patients alive are calculated?
Example data set in excel
The way it should calculate is:
month 1: 8*100% = 8
Month 2: 8*80%+6*100% = 12.4
Month 3: 8*75%+6*80%+9*100% = 19.8
...
Month 10: etc.
The problem that I have is that which each row, the formula becomes longer. It is feasible to just manually enter the formulas for small datasets, but as datasets become larger, this task becomes unfeasible.
I have been able to use VBA to code the survival of the number of new patients column (C). But then I would have to rerun the VBA code as soon as I change a single value in that column.
I have a feeling it should be possible with some combination of the INDEX function in excel, I just haven't been able to figure it out.
Who can help me out here?
Kind regards,
Sander
If moving the data a bit is allowed at least for the calculation, you could do something like this:
=SUMPRODUCT($F$11:$F$20,B2:B11)
It uses a reversed list of your current list of new patients. That list is created with (formula obtained from this site):
=INDEX($C$11:$C$20,COUNTA($C$11:$C$20)+ROW($C$11:$C$20)-ROW())
Result:
The added space is necessary for the formula to work (so that it gets 0% for patients not present yet).
Or one where you don't have to leave spaces (everything from above is reversed however):
=SUMPRODUCT($C$2:$C$11,G11:G20)

Excel Rolling Mean of 3 Similar Consecutive Observations

I'm trying to find the rolling mean of time series while ignoring values that do not follow the trend.
x
869
1570
946
0
1136
So, what I would want the result to look like is...
x | y
869 | 0
1570 | 0
946 | 1128.33
3 | 0
1136 | 1217.33 ([1136+1570+946]/3)
900 | 2982 ([946+1136+900]/3)
860 | 2896
The tough part here is if the row I'm on is a trending value I want to take the 3 previous trending values and find them mean of them, but if it's a non-trending value I want it to just zero out. Sometimes I might have to skip 2 or 3 previous lines to get 3 trending values to take the average as well.
So far I've been using array, RC formulas in a VBA macro form, but I'm not sure I could use RC here or if it has to be something else completely. Any help would be greatly appreciated.
I believe I can help you with your problem. First three notes:
1) It appears to me that you are trying to do DCA on smoothed production profiles, ignoring months without a complete record or no data. I'm making this assumption since you mentioned this was time series data but didn't give a sample rate. 2) I've added some extra 'data' for the sake of demo-ing. 3) In your example you shared, the last two values in your 'Y' column it looks like you may have summed but have forgotten to divide.
The solution I came up with has three parts: 1) create a metric to identify 'outliers'; 2) flag 'outliers'; 3) smooth non-flagged data. Let's establish some worksheet infrastructure and say that your production values are in column B and the associated time is in column A as follows:
Part 1) In column 'C', estimate a rough data value based on a trend approximated from two points on either side of your current time step. Subtract the actual value from this approximation. The result will always be positive and quite large for a timestep with little or no production.
=(INTERCEPT(B1:B6,A1:A6)+(A4*SLOPE(B1:B6,A1:A6)))-B4
Part 2) In column 'D', add a condition for when the value computed above is larger than the actual data point. Have it use '0' to identify a point that shouldn't be included in your average. Copy this down to the end of your data as well.
=IF(C4>B4,0,1)
Our sheet now looks like this:
3) Your three element average can now be computed. In the last cell of column 'E', enter the following array formula. You have to accept this formula by pressing ctrl + shift + enter. Once that is done fill the column from bottom to top:
=IFERROR(IF(D17=1,AVERAGE(INDEX(B12:B17,MATCH(2,1/(FIND(1,D12:D17)))),INDEX(B12:B16,MATCH(2,1/(FIND(1,D12:D16)))-COUNTIF(D17,"=0")),INDEX(B12:B15,MATCH(2,1/(FIND(1,D12:D15)))-COUNTIF(D16:D17,"=0"))),0),"")
This takes averages the most recent three values and allows for a skip of up to three time steps of outlier data per your problem statement. For an idea of how the completed sheet looks:
This was a fun challenge, I have some ideas for a more efficient formula but this should get the job done. Please let me know how this works for you!
Cheers
[EDIT]
An alternative approach which allows the user to specify the number of previous entries to include is detailed below. This is a more general (preferred alternative) and picks up in place of the previously described step 3.
3Alt) In cell G2 enter a number of previous values to average, for this example I am sticking with 3. In cell E4 enter the following array expression (ctrl+shift+enter) and drag to the end of column E:
=IFERROR(IF(D4=1,SUM(INDEX(D:D,LARGE(($D$4:D4=1)*ROW($D$4:D4),$G$2)):D4 * INDEX(B:B,LARGE(($D$4:D4=1)*ROW($D$4:D4),$G$2)):B4)/$G$2,0),"")
This uses the LARGE function to find the 'nth' largest value, where n is the number of preceding values from the current time-step to average. Then it builds a range that extends from the found cell to the current time step. Then it multiplies the flags (0's and 1's) by each month's production value, sums them and divides by n. In this way months flagged as bad are set to 0 and not included in the sum.
This is a much cleaner way to achieve the desired result and has the flexibility to average different periods of time. See example of the final value below.

Excel: Dividing numbers and using the remainders

Little issue I'm having that I'm hoping someone can help me with please?
So I have 3 columns in Excel. Each Column (A/B/C) contains either "high" / "Medium" / "Low" scored issues. However, if you have 3 Low issues, this is grouped together, and this becomes 1 Medium Issue for example.
The difficulty I'm having is writing a formula that will do this for me. Obviously I could just divide the number of Low issues I have by 3, but in the case where I have 7 Low issues, It should result with 2 Mediums and 1 remaining Low. I've tried using the "Mod" function, but that only returns the remainder.
What I need is a formula that will say "If you have 7 Low Issues, (3 low = 1 medium), therefore you have 2 medium and 1 Low). The medium issues would then be added to the Medium Column (Col B), and the remaining low issue is counted in the Low issue column (Col C).
I hope this explanation makes sense, fingers crossed one of you might be able to help me! Thank you in advance
As requested, a screenshot!
If I understand you correctly, I think you should be able to adapt the following formulas to meet your needs.
To get the number of occurrences of the word "Low" in column A:
=COUNTIF(A:A, "=Low")
To get the number of "Mediums" from 3 occurrences of "Low" in column A, round down the above number divided by 3:
=FLOOR(COUNTIF(A:A, "=Low")/3,1)
To get the remaining "Lows" after groupings of 3 into "Mediums", use MOD:
=MOD(COUNTIF(A:A, "=Low"),3)
Putting this into a worksheet:
Values
Formulas
Finally, if you wanted one "Mediums" count, i.e. adding the remaining "Mediums" which aren't grouped into "Highs", you would use a combination of the above formulas for what is left after grouping to "Highs" with what is gained from grouping of "Lows".
Edit:
Now you've included an image, I can show how these formulas are directly applicable...
Values
Formulas
Sounds like you were already nearly there with using =MOD() just needed a little tweak:
For the high column:
=COUNTA(A2:A8)+FLOOR(COUNTA(B2:B8)/3,1)
For the medium column:
=FLOOR(COUNTA(C2:C8)/3,1)+MOD(COUNTA(B2:B8),3)
For the low column:
=MOD(COUNTA(C2:C8),3)
It's exactly like a long addition that you do at school where each column carries over to the one to the left of it (except base 3 instead of base 10). I'm not clear that existing answers cover the case where there is a carry from one column and that causes a further carry from the next column so here is another answer
In the totals row (e.g. for the medium column) in (say) C12
=COUNTA(C2:C10)+INT(D12/3)
Then use mod as before
=MOD(C12,3)
except that in the high column you don't want to use MOD so it's just
=B12

Excel AVERAGEIFS else statement

I'm trying to perform an AVERAGEIFS formula on some data, but there are 2 possible results and as far as I can tell AVERAGEIFS doesn't deal with that situation.
I basically want to have an ELSE inside it.
At the moment I have 2 ranges of data:
The first column only contains values 'M-T' and 'F' (Mon-Thurs and Fri).
The second column contains a time.
The times on the rows with an 'F' value in column 1 are an hour behind the rest.
I want to take an average of all the times, adjusting for the hour delay on Fridays.
So for example I want it to take an average of all the times, but subtract 1 hour from the values which are in a row with an 'F' value in it.
The way I've been doing it so far is by having 2 separate results for each day, then averaging them again for a final one:
=AVERAGEIFS(G3:G172, B3:B172, "M-T")
=AVERAGEIFS(G3:G172, B3:B172, "F")
I want to combine this into just one result.
The closest I can get is the following:
=AVERAGE(IF(B3:B172="M-T",G3:G172,((G3:G172)-1/24)))
But this doesn't produce the correct result.
Any advice?
Try this
=(SUMPRODUCT(G3:G172)-(COUNTIF(B3:B172,"=F")/24))/COUNTIF(B3:B172,"<>""""")
EDIT
Explaining various steps in the formula as per sample data in the snapshot.
SUMPRODUCT(G3:G17) sums up all the value from G3 to G17. It gives a
value of 4.635416667. This after formatting to [h]:mm gives a value
of 111.15
OP desires that Friday time be one hour less. So I have kept one hour less for Friday's in the sample data. Similar SUMPRODUCT on H3:H17 leads to a value of 4.510416667. This after formatting to [h]:mm gives a value
of 108.15. Which is exactly three hours less for three occurrences of Fridays in the sample data.
=COUNTIF(B3:B17,"=F") counts the occurrences of Friday's in the B3:B17 range which are 3 occurrences.Hence 3 hours have to less. These hours are to be represented in terms of 24 hours hence the Function COUNTIF() value is divided by 24. This gives 0.125. Same is the difference of 4.635416667 and 4.510416667 i.e. 0.125
Demonstration column H is for illustrative purposes only. Infact Friday accounted values that is 108.15 in sample data has to be divided by total data points to get the AVERAGE. The occurrences of data points are calculated by =COUNTIF(B3:B17,"<>""""") with a check for empty columns.
Thus 108:15 divided by 15 data points give 7:13 in the answer.
Revised EDIT Based upon suggestions by #Tom Sharpe
#TomSharpe has been kind enough to point the shortcomings in the method proposed by me. COUNTIF(B3:B172,"<>""""") gives too many values and is not advised. Instead of it COUNTA(B3:B172) or COUNT(G3:G172) are preferable. Better Formula to get AVERAGE as per his suggestion gives very accurate results and is revised to:
=AVERAGE(IF(B3:B172="M-T",G3:G172,((G3:G172)-1/24)))
This is an Array Formula. It has to be entered with CSE and further cell to be formatted as time.
If your column of M-T and F is named Day and your column of times is named TIME then:
=SUMPRODUCT(((Day="M-T")*TIME + (Day="F")*(TIME-1/24)))/COUNT(TIME)
One simple solution would be to create a separate column that maps the time column and performs the adjustment there. Then average this new column.
Is that an option?
Ended up just combining the two averageifs. No idea why I didn't just do that from the start:
=((AVERAGEIFS(G$3:G171, $B$3:$B171, "F")-1/24)+AVERAGEIFS(G$3:G171, $B$3:$B171, "M-T"))/2

AverageIf and Multiple data strings

I'm involved with a youth football tournament on the referee side, with assessing/coaching the referees. I've just taken over doing the data entry for the referees assessment scores which we then use to determine who gets finals etc and am looking to extract more usable information from the data to help us identify trends.
I've got (up to) 200 referees, each receiving from none to two assessment scores each day for 5 days. The scores are entered as both the raw mark and the weighted mark based on match difficulty (along with a host of other data about the match that isn't relevant to this issue.
I can extract the average mark (raw and weighted) across all referees without issues and have done so using the below formula, which is the raw average mark:
=AVERAGE(Working!AK4:AK200,Working!BK4:BK200,Working!CL4:CL200,Working!DL4:DL200,Working!EM4:EM200,Working!FM4:FM200,Working!GN4:GN200,Working!HN4:HN200,Working!IO4:IO200,Working!JO4:JO200)
But I also want to extract the average mark (raw and weighted) across two subsets - Academy and non academy referees, to help plot trends and determine where resources need to be utilised.
I've attempted to use an AVERAGEIF formula, but am getting a #VALUE! return. This is the formula that I've attempted to use to return the average raw mark for those referees in the academy:
=AVERAGEIF(Working!G4:G200,Working!G4:G200="Yes",(Working!AK4:AK200,Working!BK4:BK200,Working!CL4:CL200,Working!DL4:DL200,Working!EM4:EM200,Working!FM4:FM200,Working!GN4:GN200,Working!HN4:HN200,Working!IO4:IO200,Working!JO4:JO200))
If I do the same formula as above, but without the brackets around the [average_range], I get a 'you've used too many arguments, and it highlights BK200.
From what I've been able to find so far online, it seems that the formula I'm trying to use would only work if ALL the cells in (Working!G4:G200) returned "Yes". However if there are only 50 academy referees as indicated by "Yes" in G column, then I want those specific scores to be averaged, and the inverse for the non-academy referees.
I thought about having another sheet, which would simply contain populate from Column G (a simple =G4 and then populated down to =G200 next to all of the scores), consolidated into a block of raw marks columned under Assessment 1, 2, 3, 4.... and then the same for all of the weighted marks which would populate from the equivalent cell on the working sheet, but there's a lot of filtering, and re-sorting that goes on on the working sheet, and I'm not 100% certain that that wouldn't cause issues.
Any feedback on how to work through this problem, so that I can display the overall average mark for academy and non-academy referees in both raw and weighted form would be much appreciated, and I apologize if this post is rather convoluted.
I don't think there is a neat solution if the scores are in several columns which are not consecutive.
My suggestion is:-
(1) Work out the sum for each column separately and total them up
(2) Work out the count for each column separately and total them up
(3) Divide Sum by Count to get Average.
In my small example below with 3 referees and 3 columns:-
(1) In K2:-
=SUMIF(H2:H4,"Yes",B2:B4)+SUMIF(H2:H4,"Yes",D2:D4)+SUMIF(H2:H4,"Yes",F2:F4)
(2) In K3:-
=COUNTIFS(B2:B4,">=0",H2:H4,"Yes")+COUNTIFS(D2:D4,">=0",H2:H4,"Yes")+COUNTIFS(F2:F4,">=0",H2:H4,"Yes")
(3) In K4:
=K2/K3
This would include any zero scores (if this is possible) but exclude any blanks.
You can then scale it up to your data.
Beyond this, you would have to change the data structure either
(1) Add a row to label the columns that you want to average e.g.
Score 1 Score 2 Score 3
3 0 3
so you could pick up only the columns labelled 3 say
Here's how it would be in my small example:-
In K3:-
=SUM((B$2:F$2=3)*($H3:$H5="Yes")*B3:F5)
Which is an array formula and must be entered with Ctrl-Shift-Enter
In K4:-
=SUM((B$2:F$2=3)*($H3:$H5="Yes")*(B3:F5<>""))
another array formula
In K5:-
=K3/K4
This is how the columns you want are labelled with a 3 in row 2, so it ignores the other columns:-
(2) Consolidate them into another sheet as you suggest.

Resources