I need to find a threshold that can select 0.20 ~ 0.25 least frequent values of my data. This data is the signal's energy, which I'm trying to apply the threshold to.
I have used (mean + 3Sigma) and the (IQR = q3 + 1.5(q3 + q1) ) method. But none of these methods are compatible with my data and my use case as they select very small portion of the data.
What other approaches are there that I can try?
Related
I have the loan dataset below -
Sector
Total Units
Bad units
Bad Rate
Retail Trade
16
5
31%
Construction
500
1100
20%
Healthcare
165
55
33%
Mining
3
2
67%
Utilities
56
19
34%
Other
300
44
15%
How can I create a ranking function to sort this data based on the bad_rate while also accounting for the number of units ?
e.g This is the result when I sort in descending order based on bad_rate
Sector
Total Units
Bad units
Bad Rate
Mining
3
2
67%
Utilities
56
19
34%
Healthcare
165
55
33%
Retail Trade
16
5
31%
Construction
500
1100
20%
Other
300
44
15%
Here, Mining shows up first but I don't really care about this sector as it only has a total of 3 units. I would like construction, other and healthcare to show up on the top as they have more # of total as well as bad units
STEP 1) is easy...
Use SORT("Range","ByColNumber","Order")
Just put it in the top left cell of where you want your sorted data.
=SORT(B3:E8,4,-1):
STEP 2)
Here's the tricky part... you need to decide how to weight the outage.
Here, I found multiplying the Rate% by the Total Unit Rank:
I think this approach gives pretty good results... you just need to play with the formula!
Please let me know what formula you eventually use!
You would need to define sorting criteria, since you don't have a priority based on column, but a combination instead. I would suggest defining a function that weights both columns: Total Units and Bad Rate. Using a weight function would be a good idea, but first, we would need to normalize both columns. For example put the data in a range 0-100, so we can weight each column having similar values. Once you have the data normalized then you can use criteria like this:
w_1 * x + w_2 * y
This is the main idea. Now to put this logic in Excel. We create an additional temporary variable with the previous calculation and name it crit. We Define a user LAMBDA function SORT_BY for calculating crit as follows:
LAMBDA(a,b, wu*a + wbr*b)
and we use MAP to calculate it with the normalized data. For convenience we define another user LAMBDA function to normalize the data: NORM as follows:
LAMBDA(x, 100*(x-MIN(x))/(MAX(x) - MIN(x)))
Note: The above formula ensures a 0-100 range, but because we are going to use weights maybe it is better to use a 1-100 range, so the weight takes effect for the minimum value too. In such case it can be defined as follow:
LAMBDA(x, ( 100*(x-MIN(x)) + (MAX(x)-x) )/(MAX(x)-MIN(x)))
Here is the formula normalizing for 0-100 range:
=LET(wu, 0.6, wbr, 0.8, u, B2:B7, br, D2:D7, SORT_BY, LAMBDA(a,b, wu*a + wbr*b),
NORM, LAMBDA(x, 100*(x-MIN(x))/(MAX(x) - MIN(x))),
crit, MAP(NORM(u), NORM(br), LAMBDA(a,b, SORT_BY(a,b))),
DROP(SORT(HSTACK(A2:D7, crit),5,-1),,-1))
You can customize how to weight each column (via wu for Total Units and wbr for Bad Rates columns). Finally, we present the result removing the sorting criteria (crit) via the DROP function. If you want to show it, then remove this step.
If you put the formula in F2 this would be the output:
I'm fairly new to Python. I have a column with 10000 unique values and I would like to retain as many of those values as possible. My other fields are ratios between 0 and 1 but I don't know what would be ideal filters that would allow me to reduce the number of records and still retain most of my unique values.
x y z
a05 0.9 0.5
a06 0.5 0.4
a05 0.6 0.1
I have multiple duplicate records for each value of X. I would like my output to be a threshold for y and z (like y = 0.6 and z = 0.1) I'm trying to reduce the number of duplicates and not necessarily have just one field for each unique value of X. It's more important that I retain as many unique values for x with the filters. Is there a good solution to this problem?
I think you can use pandas drop_duplicates, so you can remove duplicate values based on a subset of columns.
I'm attempting to use pivot tables to interpolate a value from a large, multi variable table in excel. Is there any way to interpolate on a range within a Calculated Field? (Or any other good way to accieve this).
What I'm trying to do -
I have the following table, and I'm trying to interpolate the Delta for each combination of Alpha and Beta at a Gamma of 5. With this simplified data, the answer would 15 for each set Alpha/Beta. In my real data, the Gammas and Deltas are not always nice even arrays, though they are sorted properly.
Alpha Beta Gamma Delta
A X 0 10
A X 10 20
A Y 0 10
A Y 10 20
B X 0 10
B X 10 20
B Y 0 10
B Y 10 20
What I've tried
Formula -> Calculated Field -> Formula: = INTERP(5,Gamma,Delta)
*Note that INTERP is a custom linear interpolater. yv = INTERP(xv,x,y).
Excel won't let me do that, so I was wondering if there is a better way? I could obviously just manualy pull the interpolation arrays, but my data has ~50 Alphas and ~50 Betas, so that would take a very long time. I've tried automatically pulling the arrays with a combination of INDEX and MATCH (using array formulas), and it works. The only problem is that it becomes so slow that every time the cells recalculate, it ties up the program for ~30 seconds or so.
Any ideas would be appreciated. I'm thinking of using pivot tables, but a regular formula solution or VBA solution would work too. Thanks!
Having this:
Class Min Max
Alfa 0 16.5
Beta 16.5 18.5
Charlie 18.5 25
Delta 25 30
And this:
Value X
35.52600894
26.27816853
29.53159178
29.84528548
26.77130341
25.07792506
19.2850645
42.77156244
29.11485934
29.5010482
19.30982162
I want a cell to have something like an IF statement (it's got a few more values in it, not this small, it has 8 class). An IF statement this long would probably not work (IF limit of 7) and is an ugly way of doing it. I was thinking of using hlookup, but I'm not sure if that's the best bet.
I can also swap the columns within a table, so I could have "Min| Max| Class"
X values are in a column.
Basically: =IF(X>=0 && X<16.5, Alpha, IF(X>=16.5 && X<18.5, Beta, IF(...
I think you mean VLOOKUP and would be much better way to go.
Make a Ranges sheet like this
Min Class
0 Alfa
16.5 Beta
18.5 Charlie
25 Delta
30.5 Unidentified
In your detail sheet use formula "=VLOOKUP(A2,Ranges!A:B,2,TRUE)" [The True is important]
And you get
Value X Class
35.52600894 Unidentified
26.27816853 Delta
29.53159178 Delta
29.84528548 Delta
26.77130341 Delta
25.07792506 Delta
19.2850645 Charlie
42.77156244 Unidentified
29.11485934 Delta
29.5010482 Delta
19.30982162 Charlie
With your Max range named MaxVal and your Class range named Class, please try:
=IF(A2>30,"",INDEX(Class,MATCH(A2,MaxVal)))
(adjust references to suit).
=MATCH() here is using the match_type parameter of 1: “The MATCH function will find the largest value that is less than or equal to value. You should be sure to sort your array in ascending order.
If the match_type parameter is omitted, the MATCH function assumes a match_type of 1.”
Any X value greater than 30 returns a blank ("") but text may be inserted to suit (eg "Unidentified" instead of "").
The formula could be simplified by removing the error trap, if a row were inserted immediately under the labels with Alpha under Class and 0 under Max. Also by removing the condition, in a similar way.
It is not necessary to specify both bounds of each range.
INDEX/MATCH was chosen rather than say VLOOKUP for reasons as given here.
PS For the Greek *alpha*bet α is usually Alpha.
Edit re clarification
The easiest fix for 25 is Delta rather than Charlie may be to deduct a small amount from each Max value, eg change 25 to =25-1/1E100.
Suppose the NDCG score for my retrieval system is .8. How do I interpret this score. How do i tell the reader that this score is significant?
To understand this lets check an example of Normalized Discounted Cumulative Gain (nDCG)
For nDCG we need DCG and Ideal DCG (IDCG)
Lets understand what is Cumulative Gain (CG) first,
Example: Suppose we have [Doc_1, Doc_2, Doc_3, Doc_4, Doc_5]
Doc_1 is 100% relevant
Doc_2 is 70% relevant
Doc_3 is 95% relevant
Doc_4 is 20% relevant
Doc_5 is 100% relevant
So our Cumulative Gain (CG) is
CG = 100 + 70 + 95 + 20 + 100 ###(Index of the doc doesn't matter)
= 385
and
Discounted cumulative gain (DCG) is
DCG = SUM( relivencyAt(index) / log2(index + 1) ) ###where index 1 -> 5
Doc_1 is 100 / log2(2) = 100.00
Doc_2 is 70 / log2(3) = 044.17
Doc_3 is 95 / log2(4) = 047.50
Doc_4 is 20 / log2(5) = 008.61
Doc_5 is 100 / log2(6) = 038.69
DCG = 100 + 44.17 + 47.5 + 8.61 + 38.69
DCG = 238.97
and Ideal DCG is
IDCG = Doc_1 , Doc_5, Doc_3, Doc_2, Doc_4
Doc_1 is 100 / log2(2) = 100.00
Doc_5 is 100 / log2(3) = 063.09
Doc_3 is 95 / log2(4) = 047.50
Doc_2 is 75 / log2(5) = 032.30
Doc_4 is 20 / log2(6) = 007.74
IDCG = 100 + 63.09 + 47.5 + 32.30 + 7.74
IDCG = 250.63
nDCG(5) = DCG / IDCG
= 238.97 / 250.63
= 0.95
Conclusion:
In the given example nDCG was 0.95, 0.95 is not prediction accuracy, 0.95 is the ranking of the document effective. So, the gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranks.
Wiki reference
The NDCG is a ranking metric. In the information retrieval field you should predict a sorted list of documents and them compare it with a list of relevant documents. Imagine that you predicted a sorted list of 1000 documents and there are 100 relevant documents, the NDCG equals 1 is reached when the 100 relevant docs have the 100 highest ranks in the list.
So .8 NDCG is 80% of the best ranking.
This is an intuitive explanation the real math includes some logarithms, but it is not so far from this.
If you have relatively big sample, you can use bootstrap resampling to compute the confidence intervals, which will show you whether your NDCG score is significantly better than zero.
Additionally, you can use pairwise bootstrap resampling in order to significantly compare your NDCG score with another system's NDCG score