SQLite full-text search relevance ranking - search

I am using the fts4 extension of sqlite3 to enable full-text indexing and searching of text data. This it working great, but I've noticed that the results are not relevance-ranked at all. I guess I am too used to Lucene. I've seen some brief suggestions to write a custom rank method using the matchinfo() results, but it's not clear to me how this is done, or whether there are any sophisticated examples out there. How have others dealt with this?

There's a complete example in the documentation, look at the end of appendix a. You'll need to do slightly more work to get a good relevance ranking as the function provided is good only for getting started. For example, with matchinfo(table,'pcnalx') there's enough information to implement Okapi BM25.

There seems to be a distinct lack of documentation on how to implement Okapi BM25 in C and it seems it is an unspoken thing that the implementation is left as an exercise for the user.
Well I found the bro of a programmer "Radford 'rads' Smith" who chucked this up on GitHub
https://github.com/rads/sqlite-okapi-bm25
It only implements BM25 although I'm looking into BM25F tweaks now....
....and here it is.
https://github.com/neozenith/sqlite-okapi-bm25

For FTS5, according to SQLite FTS5 Extension,
FTS5 has no matchinfo().
FTS5 supports ORDER BY rank
So very simply, something like
SELECT * FROM email WHERE email MATCH 'fts5' ORDER BY rank;
without DESC works.

Here is an implementation of Okapi BM25. Using this in combination with the suggestions at SQLite.org will help you generate a relevance-ranked MATCH query. This was written all in VB.Net and the query was called using System.Data.SQLite functions. The custom SQLiteFunction at the end can be called from the SQL code without issue, as long as the SQL code is called using System.Data.SQLite functions.
Public Class MatchInfo
Property matchablePhrases As Integer
Property userDefinedColumns As Integer
Property totalDocuments As Integer
Private _int32HitData As List(Of Integer)
Private _longestSubsequencePhraseMatches As New List(Of Integer)
Private _tokensInDocument As New List(Of Integer)
Private _averageTokensInDocument As New List(Of Integer)
Private _max_hits_this_row As Integer?
Public ReadOnly Property max_hits_this_row As Integer
Get
If _max_hits_this_row Is Nothing Then
_max_hits_this_row = 0
For p = 0 To matchablePhrases - 1
For c = 0 To userDefinedColumns - 1
Dim myHitsThisRow As Integer = hits_this_row(p, c)
If myHitsThisRow > _max_hits_this_row Then
_max_hits_this_row = myHitsThisRow
End If
Next
Next
End If
Return _max_hits_this_row
End Get
End Property
Private _max_hits_all_rows As Integer?
Public ReadOnly Property max_hits_all_rows As Integer
Get
If _max_hits_all_rows Is Nothing Then
_max_hits_all_rows = 0
For p = 0 To matchablePhrases - 1
For c = 0 To userDefinedColumns - 1
Dim myHitsAllRows As Integer = hits_all_rows(p, c)
If myHitsAllRows > _max_hits_all_rows Then
_max_hits_all_rows = myHitsAllRows
End If
Next
Next
End If
Return _max_hits_all_rows
End Get
End Property
Private _max_docs_with_hits As Integer?
Public ReadOnly Property max_docs_with_hits As Integer
Get
If _max_docs_with_hits Is Nothing Then
_max_docs_with_hits = 0
For p = 0 To matchablePhrases - 1
For c = 0 To userDefinedColumns - 1
Dim myDocsWithHits As Integer = docs_with_hits(p, c)
If myDocsWithHits > _max_docs_with_hits Then
_max_docs_with_hits = myDocsWithHits
End If
Next
Next
End If
Return _max_docs_with_hits
End Get
End Property
Private _BM25Rank As Double?
Public ReadOnly Property BM25Rank As Double
Get
If _BM25Rank Is Nothing Then
_BM25Rank = 0
'calculate BM25 Rank
'http://en.wikipedia.org/wiki/Okapi_BM25
'k1, calibrates the document term frequency scaling. Having k1 as 0 corresponds to a binary model – no term frequency. Increasing k1 will give rare words more boost.
'b, calibrates the scaling by document length, and can take values from 0 to 1, where having 0 means no length normalization and having 1 corresponds to fully scaling the term weight by the document length.
Dim k1 As Double = 1.2
Dim b As Double = 0.75
For column = 0 To userDefinedColumns - 1
For phrase = 0 To matchablePhrases - 1
Dim IDF As Double = Math.Log((totalDocuments - hits_all_rows(phrase, column) + 0.5) / (hits_all_rows(phrase, column) + 0.5))
Dim score As Double = (IDF * ((hits_this_row(phrase, column) * (k1 + 1)) / (hits_this_row(phrase, column) + k1 * (1 - b + b * _tokensInDocument(column) / _averageTokensInDocument(column)))))
If score < 0 Then
score = 0
End If
_BM25Rank += score
Next
Next
End If
Return _BM25Rank
End Get
End Property
Public Sub New(raw_pcnalsx_MatchInfo As Byte())
Dim int32_pcsx_MatchInfo As New List(Of Integer)
For i = 0 To raw_pcnalsx_MatchInfo.Length - 1 Step 4
int32_pcsx_MatchInfo.Add(BitConverter.ToUInt32(raw_pcnalsx_MatchInfo, i))
Next
'take the raw data and parse it out
Me.matchablePhrases = int32_pcsx_MatchInfo(0)
int32_pcsx_MatchInfo.RemoveAt(0)
Me.userDefinedColumns = int32_pcsx_MatchInfo(0)
int32_pcsx_MatchInfo.RemoveAt(0)
Me.totalDocuments = int32_pcsx_MatchInfo(0)
int32_pcsx_MatchInfo.RemoveAt(0)
'remember that the columns are 0-based
For i = 0 To userDefinedColumns - 1
_averageTokensInDocument.Add(int32_pcsx_MatchInfo(0))
int32_pcsx_MatchInfo.RemoveAt(0)
Next
For i = 0 To userDefinedColumns - 1
_tokensInDocument.Add(int32_pcsx_MatchInfo(0))
int32_pcsx_MatchInfo.RemoveAt(0)
Next
For i = 0 To userDefinedColumns - 1
_longestSubsequencePhraseMatches.Add(int32_pcsx_MatchInfo(0))
int32_pcsx_MatchInfo.RemoveAt(0)
Next
_int32HitData = New List(Of Integer)(int32_pcsx_MatchInfo)
End Sub
Public Function hits_this_row(phrase As Integer, column As Integer) As Integer
Return _int32HitData(3 * (column + phrase * userDefinedColumns) + 0)
End Function
Public Function hits_all_rows(phrase As Integer, column As Integer) As Integer
Return _int32HitData(3 * (column + phrase * userDefinedColumns) + 1)
End Function
Public Function docs_with_hits(phrase As Integer, column As Integer) As Integer
Return _int32HitData(3 * (column + phrase * userDefinedColumns) + 2)
End Function
End Class
<SQLiteFunction("Rank", 1, FunctionType.Scalar)>
Public Class Rank
Inherits SQLiteFunction
Public Overrides Function Invoke(args() As Object) As Object
Return New MatchInfo(args(0)).BM25Rank
End Function
End Class

Related

Optimal means of obtaining cell address column letter from column index and column index from column letter

Typically the accepted approach is to do the following
Number to Letter
public function numberToLetter(ByVal i as long) as string
Dim s as string: s = cells(1,i).address(false,false)
numberToLetter = left(s,len(s)-1)
end function
Letter to Number
Public Function letterToNumber(ByVal s As String) As Long
letterToNumber = Range(s & 1).Column
End Function
However neither of these are particular optimal, as in each case we are creating an object, and then calling a property accessor on the object. Is there a faster approach?
Summary
The core thing to realise is that the lettering system used in Excel is also known as Base26. NumberToLetter is encoding to Base26 from decimal, and LetterToNumber is decoding from Base26 to decimal.
Base conversion can be done with simple loops and
Function base26Encode(ByVal iDecimal As Long) As String
if iDecimal <= 0 then Call Err.Raise(5, "base26Encode" ,"Argument cannot be less than 0")
if iDecimal >= 16384 then Call Err.Raise(5, "base26Encode" ,"There are only 16384 columns in a spreadsheet, thus this function is limited to this number.")
Dim s As String: s = ""
Do
Dim v As Long
v = (iDecimal - 1) Mod 26 + 1
iDecimal = (iDecimal - v) / 26
s = Chr(v + 64) & s
Loop Until iDecimal = 0
base26Encode = s
End Function
Function base26Decode(ByVal sBase26 As String) As Long
sBase26 = UCase(sBase26)
Dim sum As Long: sum = 0
Dim iRefLen As Long: iRefLen = Len(sBase26)
For i = iRefLen To 1 Step -1
sum = sum + (Asc((Mid(sBase26, i))) - 64) * 26 ^ (iRefLen - i)
Next
base26Decode = sum
End Function
Performance
I tested the performance of these functions against the original functions. To do this I used the stdPerformance class of stdVBA.
The code used for testing is as follows:
Sub testPerf()
Dim cMax As Long: cMax = 16384
With stdPerformance.Measure("Encode Original")
For i = 1 To cMax
Call numberToLetter(i)
Next
End With
With stdPerformance.Measure("Encode Optimal")
For i = 1 To cMax
Call base26Encode(i)
Next
End With
With stdPerformance.Measure("Decode Original")
For i = 1 To cMax
Call letterToNumber(base26Encode(i))
Next
End With
With stdPerformance.Measure("Decode Optimal")
For i = 1 To cMax
Call base26Decode(base26Encode(i))
Next
End With
End Sub
The results for which are as follows:
Encode Original: 78 ms
Encode Optimal: 31 ms
Decode Original: 172 ms
Decode Optimal: 63 ms
As shown this is a slightly faster approach (2-3x faster). I am fairly surprised that object creation and property access performed so well however.

VBA Excel define Collection to be used in Function

In Python i would do the following
jkk = {'Cadmium': 0.5, "Bly": 40}
def JKKCadmium(result):
return result / jkk["Cadmium"] - 1
def JKKBly(result):
return result / jkk["Bly"] - 1
I tried to do something similar in VBA
Option Explicit
Private JKK As New Collection
JKK.Add 0.5, "Cadmium"
JKK.Add 40, "Bly"
Function JKKCadmium(result As Double) As Double
JKKCadmium = result / JKK("Cadmium") - 1
End Function
Function JKKBly(result As Double) As Double
JKKBly = result / JKK("Bly") - 1
End Function
But that throws a compile error: "Invalid outside procedure".
How can I make the VBA code work?
EDIT:
Using dictionary instead still throws same error
Option Explicit
Private JKK As New Scripting.Dictionary
JKK.Add "Cadmium", 0.5
JKK.Add "Bly", 40
Function JKKCadmium(result As Double) As Double
JKKCadmium = result / JKK("Cadmium") - 1
End Function
Function JKKBly(result As Double) As Double
JKKBly = result / JKK("Bly") - 1
End Function
Probably better to use a class (and populate dicts within the class init). However, you can have your dictionaries as global public but then need an init to populate with values. You can then call within your funcs
Option Explicit
Public jkk As Scripting.Dictionary
Public Sub init()
Set jkk = New Scripting.Dictionary
jkk.Add "Cadmium", 0.5
jkk.Add "Bly", 40
End Sub
Public Sub test()
init
Debug.Print JKKCadmium(2#)
End Sub
Public Function JKKCadmium(ByVal result As Double) As Double
JKKCadmium = result / jkk("Cadmium") - 1
End Function

Fastest way to conditionally strip off the right part of a string

I need to remove the numeric part at the end of a string. Here are some examples:
"abcd1234" -> "abcd"
"a3bc45" -> "a3bc"
"kj3ih5" -> "kj3ih"
You get the idea.
I implemented a function which works well for this purpose.
Function VarStamm(name As String) As String
Dim i, a As Integer
a = 0
For i = Len(name) To 1 Step -1
If IsNumeric(Mid(name, i, 1)) = False Then
i = i + 1
Exit For
End If
Next i
If i <= Len(name) Then
VarStamm = name.Substring(0, i - 1)
Else
VarStamm = name
End If
End Function
The question is: is there any faster (more efficient in speed) way to do this? The problem is, I call this function within a loop with 3 million iterations and it would be nice to have it be more efficient.
I know about the String.LastIndexOf method, but I don't know how to use it when I need the index of the last connected number within a string.
You can use Array.FindLastIndex and then Substring:
Dim lastNonDigitIndex = Array.FindLastIndex(text.ToCharArray(), Function(c) Not char.IsDigit(c))
If lastNonDigitIndex >= 0
lastNonDigitIndex += 1
Dim part1 = text.Substring(0, lastNonDigitIndex)
Dim part2 = text.Substring(lastNonDigitIndex)
End If
I was skeptical that the Array.FindLastIndex method was actually faster, so I tested it myself. I borrowed the testing code posted by Amessihel, but added a third method:
Function VarStamm3(name As String) As String
Dim i As Integer
For i = name.Length - 1 To 0 Step -1
If Not Char.IsDigit(name(i)) Then
Exit For
End If
Next i
Return name.Substring(0, i + 1)
End Function
It uses your original algorithm, but just swaps out the old VB6-style string methods for newer .NET equivalent ones. Here's the results on my machine:
RunTime :
- VarStamm : 00:00:07.92
- VarStamm2 : 00:00:00.60
- VarStamm3 : 00:00:00.23
As you can see, your original algorithm was already quite well tuned. The problem wasn't the loop. The problem was Mid, IsNumeric, and Len. Since Tim's method didn't use those, it was much faster. But, if you stick with a manual for loop, it's twice as fast as using Array.FindLastIndex, all things being equal
Given your function VarStamm and Tim Schmelter's one named VarStamm2, here is a small test performance I wrote. I typed an arbitrary long String with a huge right part, and ran the functions one million times.
Module StackOverlow
Sub Main()
Dim testStr = "azekzoerjezoriezltjreoitueriou7657678678797897898997897978897898797989797"
Console.WriteLine("RunTime :" + vbNewLine +
" - VarStamm : " + getTimeSpent(AddressOf VarStamm, testStr) + vbNewLine +
" - VarStamm2 : " + getTimeSpent(AddressOf VarStamm2, testStr))
End Sub
Function getTimeSpent(f As Action(Of String), str As String) As String
Dim sw As Stopwatch = New Stopwatch()
Dim ts As TimeSpan
sw.Start()
For i = 1 To 1000000
f(str)
Next
sw.Stop()
ts = sw.Elapsed
Return String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
ts.Hours, ts.Minutes, ts.Seconds,
ts.Milliseconds / 10)
End Function
Function VarStamm(name As String) As String
Dim i, a As Integer
a = 0
For i = Len(name) To 1 Step -1
If IsNumeric(Mid(name, i, 1)) = False Then
i = i + 1
Exit For
End If
Next i
If i <= Len(name) Then
VarStamm = name.Substring(0, i - 1)
Else
VarStamm = name
End If
End Function
Function VarStamm2(name As String) As String
Dim lastNonDigitIndex = Array.FindLastIndex(name.ToCharArray(), Function(c) Not Char.IsDigit(c))
If lastNonDigitIndex >= 0 Then
lastNonDigitIndex += 1
Return name.Substring(0, lastNonDigitIndex)
End If
Return name
End Function
End Module
Here is the output I got:
RunTime :
- VarStamm : 00:00:38.33
- VarStamm2 : 00:00:02.72
So yes, you should choose his answer, his code is both pretty and efficient.

Truncating 2 Strings and combined into smaller String [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
So, I have two strings that are each a max length of 100.
Dim a as String ''has a max length of 100
Dim b as String ''has a max length of 100
These two strings need to be truncated and combined into a new string.
Dim c as String 'has a max length of 100
I need to be able to truncate each string appropriately so that I can get string c as close to 100. I was going to do a bunch of statements by 25 to truncate each one.
if a.length = 100 and b.length =0 then
return a
else if a.length = 100 andalso b.length <= 25 then
return a.truncate(75) & b
else if a.length = 100 andalso b.length <= 50 then
return a.truncate(50) & b
else if....
and so one to hit all the scenarios...
I feel like there is a better way to do this and a more efficient way so that i may not hit scenarios like a.length = 100 and b.length = 51. I would be truncating more characters then needed.
Any suggestions?? Please critique me as needed.
EDIT, This is vb.Net..not C# (I'm between Projects) Sorry!
The reason i do not want to just add them together and truncate them is because if both strings are 100 in length, it will completely truncate off the second string. If they are both 100 then I would want to truncate string a to 50 in length and string b to 50 in length so when they are combined they are 100 total. In other words I need some text from both strings.
If the total length of the strings is greater than the limit then you could take a fraction of each in proportion to their lengths:
Module Module1
Function CombineWithLengthConstraint(a As String, b As String, totalLength As Integer) As String
' trivial case 1:
If totalLength < 1 Then
Return String.Empty
End If
Dim aLen = Len(a)
Dim bLen = Len(b)
' trivial case 2:
If aLen + bLen <= totalLength Then
Return a & b
End If
' impossible-to-satisfy-equably case:
If totalLength = 1 Then
If aLen > 0 Then
Return a.Substring(0, 1)
ElseIf bLen > 0 Then
Return b.Substring(0, 1)
Else
Return String.Empty
End If
End If
' aportion the lengths of the strings to be taken in the ratio of their lengths:
Dim aFrac = CInt(Math.Round(aLen / (aLen + bLen) * totalLength, MidpointRounding.AwayFromZero))
Dim bFrac = CInt(Math.Round(bLen / (aLen + bLen) * totalLength, MidpointRounding.AwayFromZero))
' ensure there is at least one character from each string...
If aFrac = 0 Then
aFrac = 1
bFrac -= 1
End If
If bFrac = 0 Then
bFrac = 1
aFrac -= 1
End If
Dim aPart = a.Substring(0, aFrac)
Dim bPart = b.Substring(0, bFrac)
Return aPart & bPart
End Function
Sub Main()
Dim a = New String("A"c, 10)
Dim b = New String("b"c, 40)
Dim c = CombineWithLengthConstraint(a, b, 10)
Console.WriteLine(c)
Console.WriteLine(Len(c))
Console.ReadLine()
End Sub
End Module
Outputs:
AAbbbbbbbb
10
As you can see, the first string, which was 1/5 of the total number of characters, ended up contributing 1/5 of the result.
The VB.NET Len function gives 0 if its argument is Nothing.
I tested it as working with all lengths from 0 to 100 of both strings being combined into one string of length 100 just in case I had made a mistake with the rounding or anything.
Of course, you could return, say, the ending part of string b instead of the starting part if that made sense in the particular application.
Although not exactly what you asked for, here's another option...
Public Function WeirdConcatinate(a As String, b As String) As String
Dim totalLen = a.Length + b.Length
If totalLen > 100 Then
Dim aLen = 100 * a.Length \ totalLen
Dim bLen = 100 - aLen
Return a.Remove(aLen) & b.Remove(bLen)
Else
Return a & b
End If
End Function
This will give you a number of characters from each string (approximately) proportional to how long they are compared to each other. If both strings are the same length, you'll get 50 from each. If a.Length = 100 and b.Length = 50, you'll end up with 66 from a and 34 from b.
Truncate them after concatenating them, then:
Dim c = a & b
If c.Length > 100 Then c = c.Remove(100)
If you want to preserve as much as possible of the start of each string:
Dim c = ""
If(a.Length > 50 AndAlso b.Length < 50)
c = a.Remove(100 - b.Length) & b
Else If a.Length > 50 AndAlso b.Length > 50
c= a.Remove(50) & b.Remove(50)
Else
c = a & b
End if
If c.Length > 100 Then c = c.Remove(100)
As with some other answers, the algorithm is open to interpretation. My method takes from each string until 100 total characters are taken or the string runs out of characters.
Private Function concat(a As String, b As String, length As Integer) As String
Dim ca As New System.Text.StringBuilder()
Dim cb As New System.Text.StringBuilder()
For i As Integer = 0 To length - 1
ca.Append(If(i >= a.Length, "", a(i)))
cb.Append(If(i >= b.Length, "", b(i)))
If ca.Length + cb.Length >= length Then Exit For
Next
Return (ca.ToString() & cb.ToString() & New String(" "c, 100)).Substring(0, length)
End Function
Sub Main()
Dim a As String = New String("a"c, 0)
Dim b As String = New String("b"c, 5)
Dim c As String = concat(a, b, 100)
Console.WriteLine($"'{c}'")
End Sub
'bbbbb '
(padded to 100 characters, doesn't render in block quote)
Dim a As String = New String("a"c, 30)
Dim b As String = New String("b"c, 90)
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb'
Dim a As String = New String("a"c, 72)
Dim b As String = New String("b"c, 64)
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb'
(your example in a comment. 72 >> 50, 64 >> 50)

VBA : For loop exiting without returning the value

I have the following piece for code to simulate stock prices using stochastic process
Function varswap1(s0, r0, sigma0, t) As Double
Rnd (-10)
Randomize (999)
Dim i As Integer, j As Integer, r As Double
Dim stock() As Double, dt As Double
Dim per As Integer
per = WorksheetFunction.Round(t * 252, 0)
ReDim stock(per)
stock(1) = s0
dt = 1 / 252
For i = 1 To per
stock(i + 1) = stock(i) * Exp((r0 - 0.5 * sigma0 ^ 2) * dt + sigma0 * Sqr(dt) * WorksheetFunction.NormSInv(Rnd()))
Next
varswap1 = WorksheetFunction.Average(stock)
End Function
In this code, I ran debugging by placing a break point at Next and the entire For loop is working absolutely fine. The problem is after completing the loop the function exits and #VALUE! error is displayed in the cell.
I am not able to figure out what is wrong with this code.
Will be thankful if anyone can help me with it.
Try this:
Const n As Integer = 252
Function varswap1(s0, r0, sigma0, t) As Double
Rnd (-10)
Randomize (999)
Dim i As Integer, j As Integer, r As Double
Dim stock() As Double, dt As Double
Dim per As Integer
per = WorksheetFunction.Round(t * n, 0)
ReDim stock(per)
stock(0) = s0 ' First item in the array has index 0
dt = 1# / n ' Avoid integer division, 1/252 = 0
For i = 1 To per
'Each stock depends on the previous stock value:
stock(i) = stock(i - 1) * Exp((r0 - 0.5 * sigma0 ^ 2) * dt + sigma0 * Sqr(dt) * WorksheetFunction.NormSInv(Rnd()))
Next
varswap1 = WorksheetFunction.Average(stock)
End Function
I saw two issues and one suggestion.
One is the array stock goes from 0..252 but you assign values to 1..253 so it crashes.
Also there is a possible integer division resulting in dt=0.0. I updated the definition to make the intent clear that the division is to be done after the conversion from integer to double. Lastly, I moved the magic number 252 to a constant.

Resources