how to show that NDCG score is significant

how to show that NDCG score is significant - statistics

Suppose the NDCG score for my retrieval system is .8. How do I interpret this score. How do i tell the reader that this score is significant?

To understand this lets check an example of Normalized Discounted Cumulative Gain (nDCG)
For nDCG we need DCG and Ideal DCG (IDCG)
Lets understand what is Cumulative Gain (CG) first,
Example: Suppose we have [Doc_1, Doc_2, Doc_3, Doc_4, Doc_5]
Doc_1 is 100% relevant
Doc_2 is 70% relevant
Doc_3 is 95% relevant
Doc_4 is 20% relevant
Doc_5 is 100% relevant
So our Cumulative Gain (CG) is
CG = 100 + 70 + 95 + 20 + 100 ###(Index of the doc doesn't matter)
= 385
and
Discounted cumulative gain (DCG) is
DCG = SUM( relivencyAt(index) / log2(index + 1) ) ###where index 1 -> 5
Doc_1 is 100 / log2(2) = 100.00
Doc_2 is 70 / log2(3) = 044.17
Doc_3 is 95 / log2(4) = 047.50
Doc_4 is 20 / log2(5) = 008.61
Doc_5 is 100 / log2(6) = 038.69
DCG = 100 + 44.17 + 47.5 + 8.61 + 38.69
DCG = 238.97
and Ideal DCG is
IDCG = Doc_1 , Doc_5, Doc_3, Doc_2, Doc_4
Doc_1 is 100 / log2(2) = 100.00
Doc_5 is 100 / log2(3) = 063.09
Doc_3 is 95 / log2(4) = 047.50
Doc_2 is 75 / log2(5) = 032.30
Doc_4 is 20 / log2(6) = 007.74
IDCG = 100 + 63.09 + 47.5 + 32.30 + 7.74
IDCG = 250.63
nDCG(5) = DCG / IDCG
= 238.97 / 250.63
= 0.95
Conclusion:
In the given example nDCG was 0.95, 0.95 is not prediction accuracy, 0.95 is the ranking of the document effective. So, the gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranks.
Wiki reference

The NDCG is a ranking metric. In the information retrieval field you should predict a sorted list of documents and them compare it with a list of relevant documents. Imagine that you predicted a sorted list of 1000 documents and there are 100 relevant documents, the NDCG equals 1 is reached when the 100 relevant docs have the 100 highest ranks in the list.
So .8 NDCG is 80% of the best ranking.
This is an intuitive explanation the real math includes some logarithms, but it is not so far from this.

If you have relatively big sample, you can use bootstrap resampling to compute the confidence intervals, which will show you whether your NDCG score is significantly better than zero.
Additionally, you can use pairwise bootstrap resampling in order to significantly compare your NDCG score with another system's NDCG score

Related

Rank order data

I have the loan dataset below -
Sector
Total Units
Bad units
Bad Rate
Retail Trade
16
5
31%
Construction
500
1100
20%
Healthcare
165
55
33%
Mining
3
2
67%
Utilities
56
19
34%
Other
300
44
15%
How can I create a ranking function to sort this data based on the bad_rate while also accounting for the number of units ?
e.g This is the result when I sort in descending order based on bad_rate
Sector
Total Units
Bad units
Bad Rate
Mining
3
2
67%
Utilities
56
19
34%
Healthcare
165
55
33%
Retail Trade
16
5
31%
Construction
500
1100
20%
Other
300
44
15%
Here, Mining shows up first but I don't really care about this sector as it only has a total of 3 units. I would like construction, other and healthcare to show up on the top as they have more # of total as well as bad units

STEP 1) is easy...
Use SORT("Range","ByColNumber","Order")
Just put it in the top left cell of where you want your sorted data.
=SORT(B3:E8,4,-1):
STEP 2)
Here's the tricky part... you need to decide how to weight the outage.
Here, I found multiplying the Rate% by the Total Unit Rank:
I think this approach gives pretty good results... you just need to play with the formula!
Please let me know what formula you eventually use!

You would need to define sorting criteria, since you don't have a priority based on column, but a combination instead. I would suggest defining a function that weights both columns: Total Units and Bad Rate. Using a weight function would be a good idea, but first, we would need to normalize both columns. For example put the data in a range 0-100, so we can weight each column having similar values. Once you have the data normalized then you can use criteria like this:
w_1 * x + w_2 * y
This is the main idea. Now to put this logic in Excel. We create an additional temporary variable with the previous calculation and name it crit. We Define a user LAMBDA function SORT_BY for calculating crit as follows:
LAMBDA(a,b, wu*a + wbr*b)
and we use MAP to calculate it with the normalized data. For convenience we define another user LAMBDA function to normalize the data: NORM as follows:
LAMBDA(x, 100*(x-MIN(x))/(MAX(x) - MIN(x)))
Note: The above formula ensures a 0-100 range, but because we are going to use weights maybe it is better to use a 1-100 range, so the weight takes effect for the minimum value too. In such case it can be defined as follow:
LAMBDA(x, ( 100*(x-MIN(x)) + (MAX(x)-x) )/(MAX(x)-MIN(x)))
Here is the formula normalizing for 0-100 range:
=LET(wu, 0.6, wbr, 0.8, u, B2:B7, br, D2:D7, SORT_BY, LAMBDA(a,b, wu*a + wbr*b),
NORM, LAMBDA(x, 100*(x-MIN(x))/(MAX(x) - MIN(x))),
crit, MAP(NORM(u), NORM(br), LAMBDA(a,b, SORT_BY(a,b))),
DROP(SORT(HSTACK(A2:D7, crit),5,-1),,-1))
You can customize how to weight each column (via wu for Total Units and wbr for Bad Rates columns). Finally, we present the result removing the sorting criteria (crit) via the DROP function. If you want to show it, then remove this step.
If you put the formula in F2 this would be the output:

Resampling on non-time related buckets

Say I have a df looking like this:
price quantity
0 100 20
1 102 31
2 105 25
3 99 40
4 104 10
5 103 20
6 101 55
There are no time intervals here. I need to calculate a Volume Weighted Average Price for every 50 items in quantity. Every row (index) in the output would represent 50 units (as opposed to say 5-min intervals), the output column would be the volume weighted price.
Any neat way to do this using pandas, or numpy for that matter?? I tried using a loop splitting every row into one item prices and them group them like this:
def grouper(n, iterable):
it = iter(iterable)
while True:
chunk = tuple(itertools.islice(it, n))
if not chunk:
return
yield chunk
But it takes for ever and I run out of memory.. The df is a few million rows.
EDIT:
The output I want to see based on the above is:
vwap
0 101.20
1 102.12
2 103.36
3 101.00
Each 50 items gets a new average price.

I struck out on my first at-bat facing this problem. Here's my next plate appearance. Hopefully I can put the ball in play and score a run.
First, let's address some of the comments related to the expected outcome of this effort. The OP posted what he thought the results should be using the small sample data he provided. However, #user7138814 and I both came up with the same outcome that differed from the OP's. Let me explain how I believe the weighted average of exactly 50 units should be calculated using the OP's example. I'll use this worksheet as an illustration.
The first 2 columns (A and B) are the original values given by the OP. Given those values, the goal is to calculate a weighted average for each block of exactly 50 units. Unfortunately, the quantities are not evenly divisible by 50. Columns C and D represent how to create even blocks of 50 units by subdividing the original quantities as needed. The yellow shaded areas show how the original quantity was subdivided and each of the green bounded cells sum to exactly 50 units. Once 50 units are determined the weighted average can be calculated in column E. As you can see, the values in E match what #user7138814 posted in his comment, so I think we agree on the methodology.
After much trial and error the final solution is a function that operates on the numpy arrays of the underlying price and quantity series. The function is further optimized using Numba decorator to jit-compile the Python code into machine-level code. On my laptop, it processed a 3 million row arrays in a second.
Here's the function.
#numba.jit
def vwap50_jit(price_col, quantity_col):
n_rows = len(price_col)
assert len(price_col) == len(quantity_col)
qty_cumdif = 50 # cum difference of quantity to track when 50 units are reached
pq = 0.0 # cumsum of price * quantity
vwap50 = [] # list of weighted averages
for i in range(n_rows):
price, qty = price_col[i], quantity_col[i]
# if current qty will cause more than 50 units
# divide the units
if qty_cumdif < qty:
pq += qty_cumdif * price
# at this point, 50 units accumulated. calculate average.
vwap50.append(pq / 50)
qty -= qty_cumdif
# continue dividing
while qty >= 50:
qty -= 50
vwap50.append(price)
# remaining qty and pq become starting
# values for next group of 50
qty_cumdif = 50 - qty
pq = qty * price
# process price, qty pair as-is
else:
qty_cumdif -= qty
pq += qty * price
return np.array(vwap50)
Results of processing the OP's sample data.
Out[6]:
price quantity
0 100 20
1 102 31
2 105 25
3 99 40
4 104 10
5 103 20
6 101 55
vwap50_jit(df.price.values, df.quantity.values)
Out[7]: array([101.2 , 102.06, 101.76, 101. ])
Notice that I use the .values method to pass the numpy arrays of the pandas series. That's one of the requirements of using numba. Numba is numpy-aware and doesn't work on pandas objects.
It performs pretty well on 3 million row arrays, creating an output array of 2.25 million weighted averages.
df = pd.DataFrame({'price': np.random.randint(95, 150, 3000000),
'quantity': np.random.randint(1, 75, 3000000)})
%timeit vwap50_jit(df.price.values, df.quantity.values)
154 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vwap = vwap50_jit(df.price.values, df.quantity.values)
vwap.shape
Out[11]: (2250037,)

Excel formula for reverse percentage calculation

I have a calculation which I am unable to crack.
Lets say my Cost Price is 300. I want to sell the item at No Profit or No Loss. My total commission/expenses will be 30%. So it means i need to sell the item at 390.
But if I do 390 - 30% = 273.
How can I see the item, so that if I minus 30% to it. My Revenue will still be 300.

the formula you want is
=300/0.7
or
=300/(1-30%)
basically it is 300= x*(1-.30) where the (1-.30) is the amount that wants to be kept after the commision of 30%. Solving for x we get the above formula.

You want Sell Price - 30% Sell Price = Cost Price.
Combining the left two, you have 70% Sell Price = Cost Price.
Divide both sides by 70% and you get Sell Price = (1/0.7) Cost Price.
The fraction 1/0.7 is approximately 1.42857.

I was looking for a similar equation/formula, but then I built it in Excel.
So if you have, say, an item with
CostPrice = 200
Then you add a ProfitMargin = 20% so the
SellPrice = 200 + 20% = 240
Then the reverse equation for this will be
CostPrice = ( SellingPrice * 100 ) / ( ProfitMargin + 100 )
// In your case it should be
CostPrice = ( 240 * 100 ) / ( 20 + 100 )
= 200
Or Simply do the following:
CostPrice = SellingPrice / 1.2 = 240 / 1.2 = 200 <-- This will reduce the added 20%

I was looking for the same thing. None of the people here seemed to understand exactly what you were going for.
Here is the formula I used to get 300 as the answer.
=390/(1+30%)

Odds ratio to Probability of Success

We ran a logistic regression model with Passing the certification exam (0 or 1) as an outcome. We found that one of the strongest predictors is the student's program GPA, the highest the program GPA, the highest the odds of passing the certification exam.
Standardized GPA, p-value < .0001, B estimate = 1.7154, odds ratio = 5.559
I interpret this as, with every 0.33 unit (one standard deviation) increase in GPA, the odds of succeeding in the certification exam increased by 5.559 times.
However, clients want to understand this in terms of probability. I calculated probability by:
(5.559 - 1) x 100 = 455.9 percent
I'm having trouble explaining this percentage to our client. I thought probability of success is only supposed to range from 0 to 1. So confused! Help please!

Your math is correct, just need to work on the interpretation.
I suppose the client wants to know "What is the probability of passing the exam if we increase the GPA by 1 unit?"
Using your output, we know that the odds ratio (OR) is 5.559. As you said, this means that the odds in favor of passing the exam increases by 5.559 times for every unit increase in GPA. So what's the increase in probability?
odds(Y=1|X_GPA + 1) = 5.559 = p(Y=1|X_GPA + 1) / (1 - p(Y=1|X_GPA + 1))
Solving for p(Y=1|X_GPA + 1), we get:
p(Y=1|X_GPA + 1) = odds(Y=1|X_GPA + 1) / (1 + odds(Y=1|X_GPA + 1) ) = 5.559 / 6.559 = 0.847.
Note that another way to do this is to make use of the formula for logit:
logit(p) = B_0 + B_1*X_1 +...+ B_GPA*X_GPA therefore
p = 1 / ( 1 + e^-(B_0 + B_1*X_1 +...+ B_GPA*X_GPA) )
Since we know B_GPA = 1.7154, we can calculate that p = 1 / ( 1 + e^-1.7154 ) = 0.847

The change in probability(risk ratio i.e. p2/p1) of the target relies on the baseline probability (p1) and as such isn't a single value for a given odds ratio.
It can be calculated using the following formula:
RR = OR / (1 – p + (p x OR))
where p is the baseline value for p.
Eg.
Odds Ratio 0.1 0.2 0.3 0.4 0.5 0.6
RR(p=0.1) 0.11 0.22 0.32 0.43 0.53 0.63
RR(p=0.2) 0.12 0.24 0.35 0.45 0.56 0.65
RR(p=0.3) 0.14 0.26 0.38 0.49 0.59 0.68
This link elaborates on the formula.
https://www.r-bloggers.com/how-to-convert-odds-ratios-to-relative-risks/

Find the minimum number of tanks to hold the maximum quantity of wines, at each tank maximum possible capacity

My business is in the wine reselling business, and we have this problem I've been trying to solve. We have 50 - 70 types of wine to be stored at any time, and around 500 tanks of various capacity. Each tank can only hold 1 type of wine. My job is to determine the minimum number of tanks to hold the maximum number of type of wines, each filled as close to its maximum capacity as possible, i.e 100l of wine should not be stored in a 200l tank if 2 tanks of 60l and 40l also exist.
I've been doing the job by hand in excel and want to try to automate the process, but using macros and array formulas quickly get out of hand. I can write a simple program in C and Swift, but stuck at finding a general algorithm. And pointer on where I can start is much appreciated. A full solution and I will send you a bottle ;)
Edit: for clarification, I do know how many types of wine I have and their total quantity, i.e Pinot at 700l, Merlot 2000l, etc. These change every week. The tanks however have many different capacities (40, 60, 80, 100, 200 liters etc) and change at irregular interval since they have to be taken out for cleaning and replaced. Simply using 70 tanks to hold 70 types is not possible.
Also, total quantity of wine never matches total tanks' capacity, and I need to use the minimum number of tanks to hold the maximum amount of wine. In case of insufficient capacity the amount of wine left over must be smallest possible (they'll spoil quickly). If there is left-over, the amount left over of each type must be proportional to their quantity.
A simplified example of the problem is this:
Wine:
----------
Merlot 100
Pinot 120
Tocai 230
Chardonay 400
Total: 850L
Tanks:
----------
T1 10
T2 20
T3 60
T4 150
T5 80
T6 80
T7 90
T8 80
T9 50
T10 110
T11 50
T12 50
Total: 830L

This greedy-DP algorithm attempts to perform a proportional split: for example, if you have 700l Pinot, 2000l Merlot and tank capacities 40, 60, 80, 100, 200, that means a total capacity of 480.
700 / (700 + 2000) = 0.26
2000 / (700 + 2000) = 0.74
0.26 * 480 = 125
0.74 * 480 = 355
So we will attempt to store 125l of the Pinot and 355l of the Merlot, to make the storage proportional to the amounts we have.
Obviously this isn't fully possible, because you cannot mix wines, but we should be able to get close enough.
To store the Pinot, the closest would be to use tanks 1 (40l) and 3 (80l), then use the rest for the Merlot.
This can be implemented as a subset sum problem:
d[i] = true if we can make sum i and false otherwise
d[0] = true, false otherwise
sum_of_tanks = 0
for each tank i:
sum_of_tanks += tank_capacities[i]
for s = sum_of_tanks down to tank_capacities[i]
d[s] = d[s] OR d[s - tank_capacities[i]]
Compute the proportions then run this for each type of wine you have (removing the tanks already chosen, which you can find by using the d array, I can detail if you want). Look around d[computed_proportion] to find the closest sum possible to achieve for each wine type.
This should be fast enough for a few hundred tanks, which I'm guessing don't have capacities larger than a few thousands.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string