I've found Lucene to be fantastic so far, but I'm having a few issues with duplicating a LIKE equivalent search.
In an application I'm working on I need the option of a "simplified" (LIKE) search and an advanced (full text) search. The data is user based (name, location etc) so not huge reams of text.
In the past I'd simply create a SQL query which concatenated db field names, surrounding the terms with wildcards. I could do that in my application, bypassing lucene for simple searches of the user data - but it would be nice to use lucene.
I've tried regex searches
var query = QueryParser.Escape(_query);
var search = new RegexQuery(new Term("name",string.Concat(".*", _query, ".*")));
but they only work on one column.
One idea I had was to tokenise each field to produce something similar to a full text search e.g:
name: Paul
so I create the following name fields...
Paul
Pau
Pa
aul
ul
au
Would this defeat the point of using lucene over a LIKE SQL search? Would it actually produce the results I want?
What would be the best way to solve this issue?
Edit:
Slightly modifying the code in this question:
Elegant way to split string into 2 strings on word boundaries to minimize length difference
to produce this tokeniser:
private IEnumerable<string> Tokeniser(string _item)
{
string s = _item;
const int maxPrefixLength = 10;
const int maxSuffixLength = 10;
const int minStemLength = 1;
var tokens = new List<string>();
for (int prefixLength = 0; (prefixLength + minStemLength <= s.Length) && (prefixLength <= maxPrefixLength); prefixLength++)
for (int suffixLength = 0; (suffixLength + prefixLength + minStemLength <= s.Length) && (suffixLength <= maxSuffixLength); suffixLength++)
{
string prefix = s.Substring(0, prefixLength);
string suffix = s.Substring(s.Length - suffixLength);
string stem = s.Substring(prefixLength, s.Length - suffixLength - prefixLength);
if (prefix.Length > 1)
if (!tokens.Contains(prefix))
tokens.Add(prefix);
if (suffix.Length > 1)
if (!tokens.Contains(suffix))
tokens.Add(suffix);
if (stem.Length > 1)
if (!tokens.Contains(stem))
tokens.Add(stem);
}
return tokens;
}
The search results do give the equivalent of a LIKE search. My "user" table will only ever be 9000 entities in size - so for me at least, this might fit my needs.
Are there any downfalls of doing this (except for a much larger lucene index?)
Character-based n-gram (NGramTokenizer, NGramTokenFilter, EdgeNGramTokenizer and EdgeNGramTokenFilter should provide the functionality you need.
Related
Let's say I have a dictionary with strings of 0s, 1s, and '*' as wildcards for my key value.
For example, my dictionary is structured as such:
{'010*10000':'foo', '100*1*000':'bar'......}
Each dictionary value has a fixed string length, however, there are wildcards within the string represented as '*' characters. Thus, values of '010110000' or '010010000' both return 'foo'.
The problem lies in the length of my dictionary. The dictionary I am working with has over 500,000+ entries. Therefore, when I try to iterate over each key in the dict to find if a key exists, then it takes far too long with O(n) complexity.
Ideally, I would like to find a way to just check if a value such as '010110000' is in the dictionary, similar to the .get() function for regular python dictionaries without wildcards.
I've already tried iterating over my dictionary using fnmatch like the following Wildcard in dictionary key:
for k in my_dict.keys():
if fnmatch.fnmatch(string_of_1s_and_0s, k):
print(my_dict[k])
break
##Do some operation here if we have found the matching key pair...and then break.
However, it's just too slow with O(n) complexity. Is there any way to implement get() but with wildcards?
dicts are hash code based; the hash code, if implemented correctly, will differ wildly for a difference of just one character. There is no way to make a dict do what you want, but what you're doing is probably best done with something other than a dict in the first place. Have you considered a relational database, where the LIKE operator could do something like this? It might still have to scan a large part of the DB, but ideally it could use anchors at one end or the other to at least narrow the search to matching prefixes/suffixes.
Rotate the original pattern left (by taking characters from the start and putting them at the end) while keeping track of the rotate count; like this:
'010*10000' -> '*10000010', rotate_count = 3
'100*1*000' -> '*1*000100', rotate_count = 3
Then split it into a "complex part" and a "simple part", and determine the length of the simple part, like this:
'010*10000' -> '*10000010', rotate_count = 3
complex = '*`, simple = `10000010', simple_length = 8
'100*1*000' -> '*1*000100', rotate_count = 3
complex = '*1*`, simple = `000100', simple_length = 6
If the fixed length of the strings is 16, then there will be 16 possible values of rotate_count, and for each one there will be 16 - rotate_count possible values of simple_length. This can be described as a nested loop:
for(rotate_count = 0; rotate_count < 16; rotate_count++) {
for(simple_length = 0; simple_length = 16 - rotate_count; simple_length++) {
}
}
You can associate an "array of entries" with this, like:
entry_number = 0;
for(rotate_count = 0; rotate_count < 16; rotate_count++) {
for(simple_length = 0; simple_length = 16 - rotate_count; simple_length++) {
entry_number++;
}
}
Then you can use the entry number to find a hash table, like:
entry_number = 0;
for(rotate_count = 0; rotate_count < 16; rotate_count++) {
for(simple_length = 0; simple_length = 16 - rotate_count; simple_length++) {
hash_table = array_of_hash_tables[entry_number];
entry_number++;
}
}
You can also rotate the string you're looking for by the rotate_count and extract simple_length characters from that, convert those characters into a hash, and use it to find a list of entries from the hash table, like:
entry_number = 0;
for(rotate_count = 0; rotate_count < 16; rotate_count++) {
rotated_string = rotate_string(original_string, rotate_count);
for(simple_length = 0; simple_length = 16 - rotate_count; simple_length++) {
hash_table = array_of_hash_tables[entry_number];
if(hash_table != NULL) {
hash = get_simple_hash(rotated_string, simple_length);
list = hash_table[hash];
// Use "list" and "original string" to do the hard stuff here...
}
entry_number++;
}
}
This will quickly eliminate lots of entries (where the start and end don't match) and give you a list of "potential matches" where you'd have to check the part containing wild cards against the original string to determine if there is/isn't an actual match.
Note that if the characters are "ones and zeros" this can be improved by converting "strings containing binary digits" into integers.
I have a string.
string str = "TTFTTFFTTTTF";
How can I break this string and add character ","?
result should be- TTF,TTF,FTT,TTF
You could use String.Join after you've grouped by 3-chars:
var groups = str.Select((c, ix) => new { Char = c, Index = ix })
.GroupBy(x => x.Index / 3)
.Select(g => String.Concat(g.Select(x => x.Char)));
string result = string.Join(",", groups);
Since you're new to programming. That's a LINQ query so you need to add using System.Linq to the top of your code file.
The Select extension method creates an anonymous type containing the char and the index of each char.
GroupBy groups them by the result of index / 3 which is an integer division that truncates decimal places. That's why you create groups of three.
String.Concat creates a string from the 3 characters.
String.Join concatenates them and inserts a comma delimiter between each.
Here is a really simple solution using StringBuilder
var stringBuilder = new StringBuilder();
for (int i = 0; i < str.Length; i += 3)
{
stringBuilder.AppendFormat("{0},", str.Substring(i, 3));
}
stringBuilder.Length -= 1;
str = stringBuilder.ToString();
I'm not sure if the following is better.
stringBuilder.Append(str.Substring(i, 3)).Append(',');
I would suggest to avoid LINQ in this case as it will perform a lot more operations and this is a fairly simple task.
You can use insert
Insert places one string into another. This forms a new string in your C# program. We use the string Insert method to place one string in the middle of another one—or at any other position.
Tip 1:
We can insert one string at any index into another. IndexOf can return a suitable index.
Tip 2:
Insert can be used to concatenate strings. But this is less efficient—concat, as with + is faster.
for(int i=3;i<=str.Length - 1;i+=4)
{
str=str.Insert(i,",");
}
I'm making an air dictionary and I have a(nother) problem. The main app is ready to go and works perfectly but when I tested it I noticed that it could be better. A bit of context: the language (ancient egyptian) I'm translating from does not use punctuation so a phrase canlooklikethis. Add to that the sheer complexity of the glyph system (6000+ glyphs).
Right know my app works like this :
user choose the glyphs composing his/r word.
app transforms those glyphs to alphanumerical values (A1 - D36 - X1A, etc).
the code compares the code (say : A5AD36) to a list of xml values.
if the word is found (A5AD36 = priestess of Bast), the user gets the translation. if not, s/he gets all the possible words corresponding to the two glyphs (A5A & D36).
If the user knows the string is a word, no problem. But if s/he enters a few words, s/he'll have a few more choices than hoped (exemple : query = A1A5AD36 gets A1 - A5A - D36 - A5AD36).
What I would like to do is this:
query = A1A5AD36 //word/phrase to be translated;
varArray = [A1, A5A, D36] //variables containing the value of the glyphs.
Corresponding possible words from the xml : A1, A5A, D36, A5AD36.
Possible phrases: A1 A5A D36 / A1 A5AD36 / A1A5A D36 / A1A5AD36.
Possible phrases with only legal words: A1 A5A D36 / A1 A5AD36.
I'm not I really clear but to things simple, I'd like to get all the possible phrases containing only legal words and filter out the other ones.
(example with english : TOBREAKFAST. Legal = to break fast / to breakfast. Illegal = tobreak fast.
I've managed to get all the possible words, but not the rest. Right now, when I run my app, I have an array containing A1 - A5A - D36 - A5AD36. But I'm stuck going forward.
Does anyone have an idea ? Thank you :)
function fnSearch(e: Event): void {
var val: int = sp.length; //sp is an array filled with variables containing the code for each used glyph.
for (var i: int = 0; i < val; i++) { //repeat for every glyph use.
var X: String = ""; //variable created to compare with xml dictionary
for (var i2: int = 0; i2 < val; i2++) { // if it's the first time, use the first glyph-code, else the one after last used.
if (X == "") {
X = sp[i];
} else {
X = X + sp[i2 + i];
}
xmlresult = myXML.mot.cd; //xmlresult = alphanumerical codes corresponding to words from XMLList already imported
trad = myXML.mot.td; //same with traductions.
for (var i3: int = 0; i3 < xmlresult.length(); i3++) { //check if element X is in dictionary
var codeElement: XML = xmlresult[i3]; //variable to compare with X
var tradElement: XML = trad[i3]; //variable corresponding to codeElement
if (X == codeElement.toString()) { //if codeElement[i3] is legal, add it to array of legal words.
checkArray.push(codeElement); //checkArray is an array filled with legal words.
}
}
}
}
var iT2: int = 500 //iT2 set to unreachable value for next lines.
for (var iT: int = 0; iT < checkArray.length; iT++) { //check if the word searched by user is in the results.
if (checkArray[iT] == query) {
iT2 = iT
}
}
if (iT2 != 500) { //if complete query is found, put it on top of the array so it appears on top of the results.
var oldFirst: String = checkArray[0];
checkArray[0] = checkArray[iT2];
checkArray[iT2] = oldFirst;
}
results.visible = true; //make result list visible
loadingResults.visible = false; //loading screen
fnPossibleResults(null); //update result list.
}
I end up with an array of variables containing the glyph-codes (sp) and another with all the possible legal words (checkArray). What I don't know how to do is mix those two to make legal phrases that way :
If there was only three glyphs, I could probably find a way, but user can enter 60 glyphs max.
I strings in the format of name:key:dataLength:data and these strings can often be chained together. for example "aNum:n:4:9879aBool:b:1:taString:s:2:Hi" this would map to an object something like:
{
aNum: 9879,
aBool: true,
aString: "Hi"
}
I have a method for parsing a string in this format but I'm not sure whether it's use of substring is the most efficient way of pprocessing the string, is there a more efficient way of processing strings in this fashion (repeatedly chopping off the front section):
Map<string, dynamic> fromString(String s){
Map<String, dynamic> _internal = new Map();
int start = 0;
while(start < s.length){
int end;
List<String> parts = new List<String>(); //0 is name, 1 is key, 2 is data length, 3 is data
for(var i = 0; i < 4; i++){
end = i < 3 ? s.indexOf(':') : num.parse(parts[2]);
parts[i] = s.substring(start, end);
start = i < 3 ? end + 1 : end;
}
var tranType = _tranTypesByKey[parts[1]]; //this is just a map to an object which has a function that can convert the data section of the string into an object
_internal[parts[0]] = tranType._fromStr(parts[3]);
}
return _internal;
}
I would try s.split(':') and process the resulting list.
If you do a lot of such operations you should consider creating benchmarks tests, try different techniques and compare them.
If you would still need this line
s = i < 3 ? s.substring(idx + 1) : s.substring(idx);
I would avoid creating a new substring in each iteration but instead just keep track of the next position.
You have to decide how important performance is relative to readability and maintainability of the code.
That said, you should not be cutting off the head of the string repeatedly. That is guaranteed to be inefficient - it'll take time that is quadratic in the number of records in your string, just creating those tail strings.
For parsing each field, you can avoid doing substrings on the length and type fields. For the length field, you can build the number yourself:
int index = ...;
// index points to first digit of length.
int length = 0;
int charCode = source.codeUnitAt(index++);
while (charCode != CHAR_COLON) {
length = 10 * length + charCode - 0x30;
charCode = source.codeUnitAt(index++);
}
// index points to the first character of content.
Since lengths are usually small integers (less than 2<<31), this is likely to be more efficient than creating a substring and calling int.parse.
The type field is a single ASCII character, so you could use codeUnitAt to get its ASCII value instead of creating a single-character string (and then your content interpretation lookup will need to switch on character code instead of character string).
For parsing content, you could pass the source string, start index and length instead of creating a substring. Then the boolean parser can also just read the code unit instead of the singleton character string, the string parser can just make the substring, and the number parser will likely have to make a substring too and call double.parse.
It would be convenient if Dart had a double.parseSubstring(source, [int from = 0, int to]) that could parse a substring as a double without creating the substring.
What is the most efficient way to find the common prefix of many strings.
For example:
For this set of strings
/home/texai/www/app/application/cron/logCron.log
/home/texai/www/app/application/jobs/logCron.log
/home/texai/www/app/var/log/application.log
/home/texai/www/app/public/imagick.log
/home/texai/www/app/public/status.log
I wanna get /home/texai/www/app/
I want to avoid char by char comparatives.
You cannot avoid going through at least the common parts to find common prefix.
I don't think this needs any fancy algorithm. Just keep track of the current common prefix, then shorten the prefix by comparing the current prefix with the next string.
Since this is common prefix of all strings, you may end up with empty string (no common prefix).
I'm not sure what you mean by avoid char by char comparative, but you at least need to read the common prefix from each of the strings, so the following algorithm is the best you can achieve (just iterate over the strings until they deviate or until the current longest prefix count is reached):
List<string> list = new List<string>()
{
"/home/texai/www/app/application/cron/logCron.log",
"/home/texai/www/app/application/jobs/logCron.log",
"/home/texai/www/app/var/log/application.log",
"/home/texai/www/app/public/imagick.log",
"/home/texai/www/app/public/status.log"
};
int maxPrefix = list[0].Length;
for(int i = 1; i < list.Count; i++)
{
int pos = 0;
for(; pos < maxPrefix && pos < list[i].Length && list[0][pos] == list[i][pos]; pos++);
maxPrefix = pos;
}
//this is the common prefix
string prefix = list[0].Substring(0, maxPrefix);