Acceptable ranges for Solr bias values - search

I've been doing a lot of work recently with applying bias in Solr when searching in order to get more relevant results, and one thing I'm curious about is the acceptable range of bias values. For instance, in one Solr implementation I've seen, the value of applicable bias values ranges from 0.1 to 21.0, with intermediate values of 0.2, 0.3, 0.5, 0.8, 1.0, 2.0, 3.0, 5.0, 8.0, and 13.0. In another place, I've seen a max value of 100. In everything I've read, I've never seen a definition of acceptable value ranges. Is there such a thing? I'm guessing that there are some complex mathematical concepts behind biasing, so I'm also wondering what best practices are when it comes to defining bias value ranges.
Another question along these lines, does the difference between bias values come into play? For instance, if I have two field title and body, and in my qf param I add
title^8 body^2
does that mean that the title field has 4x more weight than the body field, or would adding
title^3 body^2
have the same effect?

You can append debugQuery=true to any query to see exactly how each field contributes to the calculated score.
The weights given in qf is multiplied with the score calculated for the match, so title^8 will contribute more to the final score from the title field than what title^3 does.
This can be quickly tested. With ^2.0:
(MATCH) max of:\n 0.13514908 = (MATCH) weight(field:term^2.0 in 36)
With ^4.0:
(MATCH) max of:\n 0.27026632 = (MATCH) weight(field:term^4.0 in 36)
.. which is exactly twice as much.
So ^8 vs ^2 would mean that the first field is weighed four times heavier than the second field.
Be aware that this comparison works here because the same query normalization is used for both queries (which would not be the case of there is a far larger difference between the boost values - scores across queries isn't really comparable).
Acceptable values are within the range of a double, and "best practices" is to experiment to get the matching profile you're looking for. There is no hard science to this, but you'll have to tweak the values (and there are machine learning options for this if you have enough signals) to get the result list you want.

Related

Calculate risk using Cox model coefficients and mean values

I'm trying to understand the example presented in Appendix C here
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6481149/
Equation C1 is clear to me.
But in Equation C2 they use the mean values.
Such mean values are clear to me in the case of categorical variables for example 1.548 is the mean value of the Sex variable (as shown in the Table 3). Please correct me if I'm wrong.
But in numerical variables I don't understand which mean values are they using. For example for the Age variable they use 3.768, if I understand right, that value is the log of the mean age, should be log(44.15)=1.64. Instead the used value is 3.768.
Please could anybody clarify where does this value come from?
In statistics log often means the natural logarithm, sometimes denoted ln. The four values they take the logarithms of are:
Variable
Reported Mean
ln(Mean)
Reported
Age
44.15
3.788
3.768
BMI
25.61
3.243
3.230
BP Syst
138.6
4.932
4.913
Pulse Rate
75.61
4.326
4.311
The calculated values are not exactly equal to the reported values. But it looks close enough that this is probably the calculation they used. Without the data and/or code they used it's hard to say why the results are different. The study mentions excluding 130 participants because of ethics protections. So, perhaps one table was calculated using a slightly different group of participants than the other table?

Estimating percentiles in a skewed distribution (doesn't need to be exact)

This may be more of a statistics question, and I'd like to find a solution with Excel. I'd rather use simple VBA if any coding is necessary.
Is there a way to estimate the percentile of a specific data point in a skewed distribution? I don't need exact percentiles and only need a reasonable estimate. I work on analyses that rely on weighted average benchmarks reported by multiple sources. All of my sources report the 25th, 50th, 75th, and 90th percentiles as well as the mean and standard deviation. We use these benchmarks to set a target range, and our goal is for our results from a specific analysis to land somewhere within the published percentiles. I'm often asked to indicate what percentile our specific result is at, and all I can provide is broad ranges like 25th-50th, etc. So, I'm then asked to use simple extrapolation to determine the specific percentile of the specific result, and I know that using this method is inaccurate.
Mean and median differ in 99% of cases in my data set, but % difference between mean and median on average is only 6%. Only about 10% of cases have mean and median with greater than 10% difference.
For the 90% of cases with relatively low % difference between mean and median, can I assume the normal distribution?
For cases with higher % difference between mean and median, can I make an assumption that will help me estimate more accurately? I could for these cases just use the normal distribution and send my percentile estimate along with a note indicating that the estimate is likely off in one direction or another, but I'd rather give a better estimate.
Responding to cybernetic.nomad:
First, thanks for commenting! Second, it doesn't seem to work. I think I don't have enough data. The attached image shows an example. The first 5 rows show one set of my weighted average benchmarks for a single case. Below that, I added two lines--one with my "target" amount. This could be any number but, to test out the formula you suggested, I entered my 50th percentile weighted average. The row below that has the results of the formula =percentrank.exc(25th:90th,target). The result should be 0.5 but it's not, so I don't think this works. example

Excel: Add number before multiplying with PRODUCT(...)

I am calculating the geometric mean of a row in MS Excel by using the GEOMEAN(...) command.
What is the geometric mean: The row could be A1:A10. A geometric mean with
GEOMEAN(A1:A10)
is the product of all 10 cell values (multiplied together) after which the 10th root is taken (mathematically: nth_root(A_1 x A_2 x ... x A_n) ).
The issue: The command GEOMEAN(A1:A10) works fine as long as no cells contain negative values (actually just as long as the product ends up positive). If one cell has a negative value, then taking the root is mathematically an invalid action and Excel gives an error.
The solution: I can work-around this by adding a large enough number such as +1000000 to each value before doing GEOMEAN(A1:A10) and afterwards subtracting -1000000 from the result. This is a mathematical approximation to the pure geometrical mean.
The question: But how do I add +1000000 to each value in Excel? A solution would be to create a whole new extra row where the number is added, and then doing GEOMEAN on this row and subtracting the number from the result. But I would really like to avoid creating a new row, since I have many long data sets to perform this command on.
Is there a way to add the number inside the command itself? To add it onto each value before it is multiplied? Something along the lines of:
GEOMEAN(A1:A10+1000000)-1000000
Solution to avoid the work-around
Based on the answer from and discussion with #ImaginaryHuman072889
It turns out that a working command that avoids any work-around is:
IFERROR(GEOMEAN(A1:A10);-GEOMEAN(ABS(A1:A10)))
If an error are cought by the IFERROR, then we know that a negative result would have appeared, so this is constructed manually in that case.
BUT: This does not take into account the case mentioned by #ImaginaryHuman072889, though, because Excel seems to forbid any negative numbers involved and not just if the inner product is negative. For example, both GEOMEAN(-2,-2) as well as GEOMEAN(-2,-2,-2) give errors in Excel, even though they both should be mathematically valid, giving the results 2 and -2, respectively. To overcome this Excel-issue, we can simply write out the exact same command line manually:
IFERROR(PRODUCT(A1:A10)^(1/COUNTA(A1:A10));-(PRODUCT(ABS(A1:A10))^(1/COUNTA(A1:A10)))))
I add this solution to aid any by-comers who have the same issue. This mathematically works, but the fact that -2 and -2 have the geometrical mean 2 does seem a bit odd and not at all like any useful value of a "mean". It is still mathematically legal as far as I can find (WolframAlpha has no issue with it and the Wikipedia article never mentions a sign).
Your "workaround" of doing this:
GEOMEAN(A1:A10+1000000)-1000000
Is completely wrong. This is absolutely not equal to GEOMEAN(A1:A10).
Simple counter-example:
GEOMEAN({2,8}) returns the value of 4, which is the geometric mean of 2 and 8.
GEOMEAN({2,8}+1)-1 is equal to GEOMEAN({3,9})-1 which is approximately 4.196.
What is a valid workaround is if you multiply each value inside GEOMEAN by a certain value, then divide the result by that value.
Simple example:
GEOMEAN({2,8}*3)/3 is equal to GEOMEAN({6,24})/3 which is 4.
However, this method of multiplying by a constant does not help your situation, since this won't get rid of negative values.
Mathematically speaking, the geometric mean of a positive number and a negative number is an imaginary number, which is presumably why Excel cannot handle it.
Example:
2*-8 = -16
sqrt(-16) = 4i
Therefore, 4i is the geometric mean of 2 and -8. Notice how it has the same magnitude as GEOMEAN({2,8}), just that it is an imaginary number.
All that said... here is what I recommend you doing:
I suggest you return two results, one result is the magnitude of the geometric mean and the other is the phase of the geometric mean.
Formula for magnitude:
= GEOMEAN(ABS(A1:A10))
(Note, this is an array formula, so you'd have to press Ctrl+Shift+Enter instead of just Enter after typing this formula.) The use of ABS converts all negative numbers to positive before the GEOMEAN calculation, guaranteeing a positive geometric mean.
Formula for phase, I would just do something like this:
= IF(PRODUCT(A1:A10)>=0,"Real","Imaginary")
Which obviously returns Real if the geometric mean is a real number and returns Imaginary if the geometric mean is an imaginary number.
EDIT
Technically speaking, some of what I said wasn't completely precise, although the magnitude formula above still stands.
Some things I want to clarify:
If PRODUCT(data) is positive (or zero), then the geometric mean of data is positive (or zero).
If PRODUCT(data) is negative and if the number of entries in data is odd, then the geometric mean of data is negative (but still real).
If PRODUCT(data) is negative and if the number of entries in data is even, then the geometric mean of data is imaginary.
That said... if you want these formulas to be a bit more technically accurate, I would modify to this:
Adjusted formula for magnitude:
= GEOMEAN(ABS(A1:A10))*IF(AND(PRODUCT(A1:A10)<0,MOD(COUNT(A1:A10),2)=1),-1,1)
Adjusted formula for phase:
= IF(AND(PRODUCT(A1:A10)<0,MOD(COUNT(A1:A10),2)=0),"Imaginary","Real")
If the geometric mean is real, it returns the precise geometric mean (whether it is positive or negative), and if the geometric mean is imaginary, it returns a positive real value with the correct magnitude.
So, I just found the answer - although I have no idea why this works.
Doing GEOMEAN(A1:A10+1000000)-1000000 is actually possible. But by pressing enter and error #VALUE is displayed. You must click control+shift+enter to have the actual result displayed.
According to this: https://www.mrexcel.com/forum/excel-questions/264366-calculating-geometric-mean-some-negative-values.html
If anyone has an explanation for this, I am very interested.

Compute statistical significance with Excel

I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm

Statistically removing erroneous values

We have a application where users enter prices all day. These prices are recorded in a table with a timestamp and then used for producing charts of how the price has moved... Every now and then the user enters a price wrongly (eg. puts in a zero to many or to few) which somewhat ruins the chart (you get big spikes). We've even put in an extra confirmation dialogue if the price moves by more than 20% but this doesn't stop them entering wrong values...
What statistical method can I use to analyse the values before I chart them to exclude any values that are way different from the rest?
EDIT: To add some meat to the bone. Say the prices are share prices (they are not but they behave in the same way). You could see prices moving significantly up or down during the day. On an average day we record about 150 prices and sometimes one or two are way wrong. Other times they are all good...
Calculate and track the standard deviation for a while. After you have a decent backlog, you can disregard the outliers by seeing how many standard deviations away they are from the mean. Even better, if you've got the time, you could use the info to do some naive Bayesian classification.
That's a great question but may lead to quite a bit of discussion as the answers could be very varied. It depends on
how much effort are you willing to put into this?
could some answers genuinely differ by +/-20% or whatever test you invent? so will there always be need for some human intervention?
and to invent a relevant test I'd need to know far more about the subject matter.
That being said the following are possible alternatives.
A simple test against the previous value (or mean/mode of previous 10 or 20 values) would be straight forward to implement
The next level of complexity would involve some statistical measurement of all values (or previous x values, or values of the last 3 months), a normal or Gaussian distribution would enable you to give each value a degree of certainty as to it being a mistake vs. accurate. This degree of certainty would typically be expressed as a percentage.
See http://en.wikipedia.org/wiki/Normal_distribution and http://en.wikipedia.org/wiki/Gaussian_function there are adequate links from these pages to help in programming these, also depending on the language you're using there are likely to be functions and/or plugins available to help with this
A more advanced method could be to have some sort of learning algorithm that could take other parameters into account (on top of the last x values) a learning algorithm could take the product type or manufacturer into account, for instance. Or even monitor the time of day or the user that has entered the figure. This options seems way over the top for what you need however, it would require a lot of work to code it and also to train the learning algorithm.
I think the second option is the correct one for you. Using standard deviation (a lot of languages contain a function for this) may be a simpler alternative, this is simply a measure of how far the value has deviated from the mean of x previous values, I'd put the standard deviation option somewhere between option 1 and 2
You could measure the standard deviation in your existing population and exclude those that are greater than 1 or 2 standard deviations from the mean?
It's going to depend on what your data looks like to give a more precise answer...
Or graph a moving average of prices instead of the actual prices.
Quoting from here:
Statisticians have devised several methods for detecting outliers. All the methods first quantify how far the outlier is from the other values. This can be the difference between the outlier and the mean of all points, the difference between the outlier and the mean of the remaining values, or the difference between the outlier and the next closest value. Next, standardize this value by dividing by some measure of scatter, such as the SD of all values, the SD of the remaining values, or the range of the data. Finally, compute a P value answering this question: If all the values were really sampled from a Gaussian population, what is the chance of randomly obtaining an outlier so far from the other values? If the P value is small, you conclude that the deviation of the outlier from the other values is statistically significant.
Google is your friend, you know. ;)
For your specific question of plotting, and your specific scenario of an average of 1-2 errors per day out of 150, the simplest thing might be to plot trimmed means, or the range of the middle 95% of values, or something like that. It really depends on what value you want out of the plot.
If you are really concerned with the true max and true of a day's prices, then you have to deal with the outliers as outliers, and properly exclude them, probably using one of the outlier tests previously proposed ( data point is x% more than next point, or the last n points, or more than 5 standard deviations away from the daily mean). Another approach is to view what happens after the outlier. If it is an outlier, then it will have a sharp upturn followed by a sharp downturn.
If however you care about overall trend, plotting daily trimmed mean, median, 5% and 95% percentiles will portray history well.
Choose your display methods and how much outlier detection you need to do based on the analysis question. If you care about medians or percentiles, they're probably irrelevant.

Resources