how do I normalise a solr/lucene score? - search

I am trying to work out how to improve the scoring of solr search results. My application needs to take the score from the solr results and display a number of “stars” depending on how good the result(s) are to the query. 5 Stars = almost/exact down to 0 stars meaning not matching the search very well, e.g. only one element hits. However I am getting scores from 1.4 to 0.8660254 both are returning results that I would give 5 stars to. What I need to do is somehow turn these results in to a percentage so that I can mark these results, with the correct number of stars.
The query that I run that gives me the 1.4 score is:
euallowed:true AND(grade:"2:1")
The query that gives me the 0.8660254 score is:
euallowed:true AND(grade:"2:1" OR grade:"1st")
I've already updated the Similarity so that the tf and idf return 1.0 as I am only interested if a document has a term, not the number of that term in the document. This is what my similarity code looks like:
import org.apache.lucene.search.Similarity;
public class StudentSearchSimilarity extends Similarity {
#Override
public float lengthNorm(String fieldName, int numTerms) {
return (float) (1.0 / Math.sqrt(numTerms));
}
#Override
public float queryNorm(float sumOfSquaredWeights) {
return (float) (1.0 / Math.sqrt(sumOfSquaredWeights));
}
#Override
public float sloppyFreq(int distance) {
return 1.0f / (distance + 1);
}
#Override
public float tf(float freq) {
return (float) 1.0;
}
#Override
public float idf(int docFreq, int numDocs) {
//return (float) (Math.log(numDocs / (double) (docFreq + 1)) + 1.0);
return (float)1.0;
}
#Override
public float coord(int overlap, int maxOverlap) {
return overlap / (float) maxOverlap;
}
}
So I suppose my questions are:
How is the best way of normalising
the score so that I can work out how
many “stars” to give?
Is there another way of scoring the
results?
Thanks
Grant

To quote http://wiki.apache.org/lucene-java/ScoresAsPercentages:
People frequently want to compute a "Percentage" from Lucene scores to determine what is a "100% perfect" match vs a "50%" match. This is also somethings called a "normalized score"
Don't do this.
Seriously. Stop trying to think about your problem this way, it's not going to end well.
That page does give an example of how you could in theory do this, but it's very hard.

It's called normalized score (Scores As Percentages).
You can use the following the following parameters to achieve that:
ns = {!func}product(scale(product(query({!type=edismax v=$q}),1),0,1),100)
fq = {!frange l=20}$ns
Where 20 is your 20% threshold.
See also:
Remove results below a certain score threshold in Solr/Lucene?
http://article.gmane.org/gmane.comp.jakarta.lucene.user/12076
http://article.gmane.org/gmane.comp.jakarta.lucene.user/10810

I've never had to do anything this complicated in Solr, so there may be a way to hook this in as a plugin - but you could handle it in the client when a result set is returned. If you've sorted by relevance this should be staightforward - get the relevence of the first result (max), and the last (min). Then for each result with relevance x, you can calculate
normalisedValue = (x - min) / (max - min)
which will give you a value between 0 and 1. Multiply by 5 and round to get the number of stars.

Related

Measuring F1-score for NER

I am trying to evaluate a model of artificial intelligence for NER (Named Entity Recognition).
In order to compare with other benchmarks, I need to calculate the model's F1-score. However, I am unsure how to code this.
My idea was:
True-positives: equal tokens and equal tags, true-positive for the tag
False-negative: equal tokens and unequal tags or token did not appear in the prediction, false-negative for the tag
False-positive: token does not exist but has been assigned to a tag, example:
Phrase: "This is a test"
Predicted: {token: This is, tag: WHO}
True pairs: {token: This, tag: WHO} {token: a test, tag: what}
In this case, {token: This is, tag: WHO} is considered as a false positive of WHO.
The code:
for val predicted tokens (pseudo-code) {
// val = struct { tokens, tags } from a phrase
for (auto const &j : val.tags) {
if (j.first == current_tokens) {
if (j.second == tag) {
true_positives[tag_id]++;
} else {
false_negatives[tag_id]++;
}
current_token_exists = true;
}
}
if (!current_token_exists) {
false_positives[tag_id]++;
}
}
for (auto const &i : val.tags) {
bool find = 0;
for (auto const &j : listed_tokens) {
if (i.first == j) {find = 1; break;}
}
if (!find) {
false_negatives[str2tag_id[i.second]]++;
}
}
After this, calculate the F-1:
float precision_total, recall_total, f_1_total;
precision_total = total_true_positives / (total_true_positives + total_false_positives);
recall_total = total_true_positives / (total_true_positives + total_false_negatives);
f_1_total = (2 * precision_total * recall_total) / (precision_total + recall_total);
However, I believe that I am wrong in some concept. Does anyone have an opinion?
This is not a complete answer.
Taking a look here
we can see that there are many possible ways of defining an F1 score for NER. There are consider at least 6 possible cases, a part of TP, TN, FN, and FP, since the tag can correspond to more than one token, and therefore we may consider the partial matches.
If you take a look there are different ways of defining the F1 score, some of them defining the TP like a weighted average of strict positive and partial positive, for example.
CoNLL, which is one of the most famous benchmarks for NER looks like they use an strict definition for recall and precission, which is enough to define the F1 score:
precision is the percentage of named entities found by the learning
system that are correct. Recall is the percentage of named entities
present in the corpus that are found by the system. A named entity is
correct only if it is an exact match of the corresponding entity in
the data file.

Why does this programmatically generated musical chord not sound correct?

I have the following class which generates a buffer containing sound data:
package musicbox.example;
import javax.sound.sampled.LineUnavailableException;
import musicbox.engine.SoundPlayer;
public class CChordTest {
private static final int SAMPLE_RATE = 1024 * 64;
private static final double PI2 = 2 * Math.PI;
/*
* Note frequencies in Hz.
*/
private static final double C4 = 261.626;
private static final double E4 = 329.628;
private static final double G4 = 391.995;
/**
* Returns buffer containing audio information representing the C chord
* played for the specified duration.
*
* #param duration The duration in milliseconds.
* #return Array of bytes representing the audio information.
*/
private static byte[] generateSoundBuffer(int duration) {
double durationInSeconds = duration / 1000.0;
int samples = (int) durationInSeconds * SAMPLE_RATE;
byte[] out = new byte[samples];
for (int i = 0; i < samples; i++) {
double value = 0.0;
double t = (i * durationInSeconds) / samples;
value += Math.sin(t * C4 * PI2); // C note
value += Math.sin(t * E4 * PI2); // E note
value += Math.sin(t * G4 * PI2); // G note
out[i] = (byte) (value * Byte.MAX_VALUE);
}
return out;
}
public static void main(String... args) throws LineUnavailableException {
SoundPlayer player = new SoundPlayer(SAMPLE_RATE);
player.play(generateSoundBuffer(1000));
}
}
Perhaps I'm misunderstanding some physics or math here, but it seems like each sinusoid ought to represent the sound of each note (C, E, and G), and by summing the three sinusoids, I should hear something similar to when I play those three notes simultaneously on the keyboard. What I'm hearing, however, is not even close to that.
For what it's worth, if I comment out any two of the sinusoids and keep the third, I do hear the (correct) note corresponding to that sinusoid.
Can somebody spot what I'm doing wrong?
To combine audio signals you need to average their samples, not sum them.
Divide the value by 3 before converting to byte.
You don't say in what way it sounds incorrect, adding three sin values like that you are going to get a signal that ranges from -3.0 to 3.0 and so is going to clip when you apply your *Byte.MAX_VALUE, this is why averaging probable worked for you, adding is correct its just you need to scale the result after to prevent clipping and dividing by the number of sine waves is the easiest way to do this. But if you start changing the number of sine waves dynamically and try to use the same strategy you wont get the result you expect, you have to scale the signal for when you signal is at its loudest. Remember real audio is not going to be at maximum amplitude so you don't have to worry about it two much if you synthesised audio isn't, also, the way we perceive sound volume is logarithmic so a signal at half amplitude is a difference of -3dB which is pretty close to the smallest change in amplitude we can hear.

Find points within a distance using CQL3

I have a cassandra table with user name, latitude and longitude. I would like to get a list of users who are inside the circle with a given latitude, longitude and distance.
For example: my input Lat= 78.3232 and Long = 65.3234 and distance = 30 miles.
I would like to get a list of users who are within 30 miles distance from the point 78.3232 and 65.3234. Is it possible to solve this with single CQL3 query? Or can anyone give me a hint start solving this query?
There was no geospatial support for cassandra so I found a way to Implement it mathematically to generate box coordinates around the point (That was good enough for my work) and use query to get coordinates within boundary.
I'll post the code for others reference.
public class GeoOperations {
public static final int UPPER_LATITUDE = 0;
public static final int LOWER_LATITUDE = 1;
public static final int UPPER_LONGITUDE = 2;
public static final int LOWER_LONGITUDE = 3;
private static final double KM_TO_MILES = 0.621371;
private final double Latitude;
private final double Longitude;
double Boundary[];
public GeoOperations(double init_latitude, double init_longitude) {
Latitude = init_latitude;
Longitude = init_longitude;
Boundary = new double[4];
}
public void GenerateBoxCoordinates(double Distance) {
Distance = Distance * KM_TO_MILES;
double Lat_Factor = (Distance) / 69;
Boundary[UPPER_LATITUDE] = Latitude + Lat_Factor;
Boundary[LOWER_LATITUDE] = Latitude - Lat_Factor;
double Long_Factor = (Distance) / (3960 * 2 * Math.PI / 360 * Math.cos(Latitude));
Boundary[UPPER_LONGITUDE] = Longitude + Long_Factor;
Boundary[LOWER_LONGITUDE] = Longitude - Long_Factor;
for (double x : Boundary) {
System.out.println(x);
}
}
}
And then Used Simple CQL to find coordinates within ranges
if values are like this
UPPER_LATITUDE = 60
LOWER_LATITUDE = 40
UPPER_LONGITUDE = 10
LOWER_LONGITUDE = 5
Query will be something like this (actually I used kundera with Hibernate and used a JPA query. So I havent tested it but it should work)
SELECT * FROM Points_Tablle
WHERE LATITUDE > 40
AND LATITUDE < 60
AND LONGITUDE > 5
AND LONGITUDE < 10;
If you're using DataStax enterprise, you get Geospatial out of the box. Check out Patrick's Demo:
https://github.com/PatrickCallaghan/datastax-geospatial-demo

Bayes' formula for updating probabilistic map

I'm trying to get a mobile robot to map an arena based on what it can see from a camera. I've created a map, and managed to get the robot to identify items placed in the arena and give an estimated location, however, as I'm only using an RGB camera the resulting numbers can vary slightly ever frame due to noise, or change in lighting, etc. What am now trying to do is create a probability map using Bayes' formula to give a better map of the arena.
Bayes' Formula
P(i | x) = (p(i)p(x|i))/(sum(p(j)(p(x|j))
This is what I've got so far. All points on the map are initialised to 0.5.
// Gets the Likely hood of the event being correct
// Para 1 = Is the object likely to be at that location
// Para 2 = is the sensor saying it's at that location
private double getProbabilityNum(bool world, bool sensor)
{
if (world && sensor)
{
// number to test the function works
return 0.6;
}
else if (world && !sensor)
{
// number to test the function works
return 0.4;
}
else if (!world && sensor)
{
// number to test the function works
return 0.2;
}
else //if (!world && !sensor)
{
// number to test the function works
return 0.8;
}
}
// A function to update the map's probability of an object being at location (x,y)
// Para 3 = does the sensor pick up the an object at (x,y)
public double probabilisticMap(int x,int y,bool sensor)
{
// gets the current likelihood from the map (prior Probability)
double mapProb = get(x,y);
//decide if object is at location (x,y)
bool world = (mapProb < threshold);
//Bayes' formula to update the probability
double newProb =
(getProbabilityNum(world, sensor) * mapProb) / ((getProbabilityNum(world, sensor) * mapProb) + (getProbabilityNum(!world, sensor) * (1 - mapProb)));
// update the location on the map
set(x,y,newProb);
// return the probability as well
return newProb;
}
It does work, but the numbers seem to jump rapidly, and then flicker when they are at the top, it also errors if the numbers drop too near to zero. Anyone have any idea why this might be happening? I think it's something to do with the way the equations is coded, but I'm not too sure. (I found this, but I don't quite understand it, so I'm not sure of it's relevents, but it seems to be talking about the same thing
Thanks in Advance.
Use log-likelihoods when doing numerical computations involving probabilities.
Consider
P(i | x) = (p(i)p(x|i))/(sum(p(j)(p(x|j)).
Because x is fixed, the denominator, p(x), is a constant. Thus
P(i | x) ~ p(i)p(x|i)
where ~ denotes "is proportional to."
The log-likelihood function is just the log of this. That is,
L(i | x) = log(p(i)) + log(p(x|i)).

Divide up money evenly in C# using a functional approach

I have these 2 values:
decimal totalAmountDue = 1332.29m;
short installmentCount = 3;
I want to create 3 installments that have an even amount based on the totalAmountDue (extra pennies apply starting with the lowest installment number going to the highest installment number) using this class:
public class Installment
{
public Installment( short installmentNumber, decimal amount )
{
InstallmentNumber = installmentNumber;
Amount = amount;
}
public short InstallmentNumber { get; private set; }
public decimal Amount { get; private set; }
}
The installments should be as follows:
{ InstallmentNumber = 1, Amount = 444.10m }
{ InstallmentNumber = 2, Amount = 444.10m }
{ InstallmentNumber = 3, Amount = 444.09m }
I am looking for an interesting way to create my 3 installments. Using a simple LINQ to objects method would be nice. I have been trying to understand more about functional programming lately and this seems like it could be a fairly good exercise in recursion. The only decent way I can think of doing this is with a traditional while or for loop at the moment...
There's not a whole lot here that is "functional". I would approach the problem like this:
var pennies = (totalAmountDue * 100) % installmentCount;
var monthlyPayment = totalAmountDue / installmentCount;
var installments = from installment in Enumerable.Range(1, installmentCount)
let amount = monthlyPayment + (Math.Max(pennies--, 0m) / 100)
select new Installment(installment, amount);
You might be able to work something out where you constantly subtract the previous payment from the total amount and do the division rounding up to the nearest penny. In F# (C# is too wordy for this) it might be something like:
let calculatePayments totalAmountDue installmentCount =
let rec getPayments l (amountLeft:decimal) = function
| 0 -> l
| count -> let paymentAmount =
(truncate (amountLeft / (decimal)count * 100m)) / 100m
getPayments (new Installment(count, paymentAmount)::l)
(amountLeft - paymentAmount)
(count - 1)
getPayments [] totalAmountDue installmentCount
For those unfamiliar with F#, what that code is doing is setting up a recursive function (getPayments) and bootstrapping it with some initial values (empty list, starting values). Using match expressions it sets up a terminator (if installmentCount is 0) returning the list so far. Otherwise it calculates the payment amount and calls the recursive method adding the new installment to the front of the list, subtracting the payment amount from the amount left, and subtracting the count.
This is actually building the list in reverse (adding on to the front each time), so we throw away the extra pennies (the truncate) and eventually it catches up with us so the penny rounding works as expected. This is obviously more math intensive than the add/subtract code above since we divide and multiply in every iteration. But it is fully recursive and takes advantage of tail recursion so we'll never run out of stack.
The trouble with C# here is that you want a sequence of installments and recursion and there's no idiomatic built-in structure for doing that in C#. Here I'm using F#'s list which is immutable and O(1) operation to prepend.
You could possibly build something using the Scan() method in the Reactive Extensions to pass state from once instance to another.
Talljoe,
I think you are pushing me in the right direction. This code below seems to work. I had to switch out how the penny math was working but this looks pretty good (I think)
decimal totalAmountDue = 1332.29m;
short installmentCount = 8;
var pennies = (totalAmountDue * 100) % installmentCount;
var monthlyPayment = Math.Floor(totalAmountDue / installmentCount * 100);
var installments = from installmentNumber in Enumerable.Range(1, installmentCount)
let extraPenny = pennies-- > 0 ? 1 : 0
let amount = (monthlyPayment + extraPenny) / 100
select new Installment(installmentNumber, amount);

Resources