I am currently working on a people search tool using SOLR to facilitate the indexing + fuzzy search across multiple fields (with edismax), using various filters such as SynonymFilterFactory, WordDelimiterFactory etc and disabling TF-IDF.
This works very well, except for a few cases where a search term is matched multiple times. For example, searching for "Martin XXXX" returns "Marvin Martin" as the highest result because it matches Martin against both "Marvin" and "Martin".
Matching a search term against multiple words in a document, in general, makes a lot of sense. However, in the case of people search, I'd like it to only add the maximum score for each search term (i.e., map each search term to only one word in the document (person's name / information)).
Is there a mechanism in SOLR/Lucene which would allow me to force a one-to-one mapping between search term and matched term?
You can see the issue below in the debug for the query:
0.27641854 = (MATCH) sum of:
0.27641854 = (MATCH) sum of:
0.15077375 = (MATCH) weight(FullName:martin in 118169) [NoTFIDFSimilarityClass], result of:
0.15077375 = score(doc=118169,freq=1.0 = termFreq=1.0
), product of:
0.15077375 = queryWeight, product of:
1.0 = idf(docFreq=1619, maxDocs=328317)
0.15077375 = queryNorm
1.0 = fieldWeight in 118169, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.0 = idf(docFreq=1619, maxDocs=328317)
1.0 = fieldNorm(doc=118169)
0.12564479 = (MATCH) weight(FullName:marvin^0.8333333 in 118169) [NoTFIDFSimilarityClass], result of:
0.12564479 = score(doc=118169,freq=1.0 = termFreq=1.0
), product of:
0.12564479 = queryWeight, product of:
0.8333333 = boost
1.0 = idf(docFreq=105, maxDocs=328317)
0.15077375 = queryNorm
1.0 = fieldWeight in 118169, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.0 = idf(docFreq=105, maxDocs=328317)
1.0 = fieldNorm(doc=118169)
The query is e.g.,
http://domain/solr/peoplefinder/select?q=Martin~&wt=json&indent=true&defType=edismax&qf=FullName&stopwords=true&lowercaseOperators=true&debug=true
Related
The Solr "qf" parameter works as follows:
Let's say I have: query = "sid" and qf = [field1, field1_edge, field2, field2_edge].
The Solr score is calculated as follows:
max(f1, f1_e, f2, f2_e) + tie * (sum of other 3 fields) where: "tie" lies in [0,1]
Let's call: winner1 = field with max(f1, f1_e) and
winner2 = field with max(f2, f2_e)
I would like to score a given query in Solr as follows:
score1 = winner1_score + tie_1 * loser1_score
score2 = winner2_score + tie_1 * loser2_score
final score = score1 + tie_2 * score2
Effectively, I want to apply qf in two layers (taking tie_1 = 0 and tie_2 = 1). What are my options to implement this idea of relevance? I think neither "qf" parameter nor function boosts support this.
Thanks!
It seems to me that's the way to do it is to use the query function which allows you to apply functions to queries.
You combine this with nested query parsers which allows you to run multiple dismax queries.
You can do something like this (where you set tie1 and tie2 according to what you want):
q=_val_:"add(query($qq1),product(query($qq2),${tie2}))"
qq1={!edismax qf='field1 field1_edge' v='sid' tie=${tie1}}
qq2={!edismax qf='field2 field2_edge' v='sid' tie=${tie1}}
tie1=0.5
tie2=0.3
If you used Solr 7.2 (or higher) you also need to set uf=_query_ * in order for the _val_ hook to work.
P.S: it should be possible (though I haven't tested it) to move the content of q into the qf parameter and that way you don't have to use the _val_ hook:
qf=add(query($qq1),product(query($qq2),${tie2}))
After making Lucene search my index using the IndexSearcher, how can I print the terms that are next to the search term.
QueryParser qp = new QueryParser("body", new StandardAnalyzer());
String queryStr = "search term";
Query q1 = qp.parse(queryStr);
TopDocs hits = searcher.search(q1, 1);
System.out.println(hits.totalHits + " docs found for the query \"" + q1.toString() + "\"");
The above code just prints the search term if it exists, but I wish to print the terms next to the search term instead.
I have a query to get photos according to values in a pivot table, that stores the relation of "pics" and "tags":
#photos
$q = PicTag::select(DB::raw('distinct(pics.id)'),
'pics.url',
'pics.titel',
'pics.hits',
'pics.created_at',
'users.username',
'users.displayname')
->leftJoin('pics', 'pics.id', '=', 'pic_tag.pic_id')
->leftJoin('users','users.id','=','pics.user_id')
->whereIn('pic_tag.tag_id', $tagids);
if($cat)
$q->where('typ',$cat);
if($year)
$q->where('jahrprod',$year);
$pics = $q->orderBy('pics.id','desc')
->paginate(30);
The problem is, when for a certain photo multiple (same) tags are stored like "Tag", "tag" and "tAG". Then the same photo would be shown 3 times in my gallery. That is why I use the distinct in the query.
Then the gallery is ok, but $pics->total() does not show "87 photos" but for example "90 photos", because the distinct is not used in the pagination. In laravel 4, I used groupBy('pics.id'), but this did not seem to be the fastest query and with laravel 5 it gives me a total() count result of 1.
How could I get the right total() value?
I know it's an old subject but it could help some other people.
I faced the same problem and the only good solution (low memory cost) I found was to do the request in two times:
$ids = DB::table('foo')
->selectRaw('foo.id')
->distinct()
->pluck('foo.id');
$results = $query = DB::table('foo')
->selectRaw('foo.id')
->whereIn('foo.id', $ids)
->paginate();
I tried this with 100k results, and had no problem at all.
Laravel has issue in paginate of complex queries. so you should handle them manually . In laravel 5 I did it in 2 steps :
Step 1: repository method :
public function getByPage($page = 1, $limit = 10 , $provinceId , $cityId , $expertiseId)
{
$array = ['users.deleted' => false];
$array["users.approved"] = true;
$array["users.is_confirmed"] = true;
if($provinceId)
$array["users.province_FK"] = $provinceId;
if($cityId)
$array["users.city_FK"] = $cityId;
if($expertiseId)
$array["ONJNCT_USERS_EXPERTISE.expertise_FK"] = $expertiseId;
$results = new \stdClass();
$results->page = $page;
$results->limit = $limit;
$results->totalItems = 0;
$results->items = array();
$users= DB::table('users')
->distinct()
->select('users.*','ONDEGREES.name as drgree_name')
->join('ONJNCT_USERS_EXPERTISE', 'users.id', '=', 'ONJNCT_USERS_EXPERTISE.user_FK')
->join('ONDEGREES', 'users.degree_FK', '=', 'ONDEGREES.id')
->where($array)
->skip($limit * ($page - 1))->take($limit)->get();
//$users = $usersQuery>skip($limit * ($page - 1))->take($limit)->get();
$usersCount= DB::table('users')
->select('users.*','ONDEGREES.name as drgree_name')
->join('ONJNCT_USERS_EXPERTISE', 'users.id', '=', 'ONJNCT_USERS_EXPERTISE.user_FK')
->join('ONDEGREES', 'users.degree_FK', '=', 'ONDEGREES.id')
->where($array)
->count(DB::raw('DISTINCT users.id'));
$results->totalItems = $usersCount;
$results->items = $users;
return $results;
}
Step 2:
In my Search Controller :
function search($provinceId , $cityId , $expertiseId){
$page = Input::get('page', 1);
$data = $this->userService->getByPage($page, 1 , $provinceId ,$cityId , $expertiseId);
$users = new LengthAwarePaginator($data->items, $data->totalItems, 1 , Paginator::resolveCurrentPage(),['path' => Paginator::resolveCurrentPath()]);
return View::make('guest.search.searchResult')->with('users' ,$users);
}
It worked for me well!
Recently I've started using Apache CMIS and read the official documentation and examples. I haven't noticed anything about paging query results.
There is an example showing how to list folder items, setting maxItemsPerPage using operationContext, but it seems that operationContext can be used inside getChilder method:
int maxItemsPerPage = 5;
int skipCount = 10;
CmisObject object = session.getObject(session.createObjectId(folderId));
Folder folder = (Folder) object;
OperationContext operationContext = session.createOperationContext();
operationContext.setMaxItemsPerPage(maxItemsPerPage);
ItemIterable<CmisObject> children = folder.getChildren(operationContext);
ItemIterable<CmisObject> page = children.skipTo(skipCount).getPage();
This is ok when it comes to listing u folder. But my case is about getting results from custom search query. The basic approach is:
String myType = "my:documentType";
ObjectType type = session.getTypeDefinition(myType);
PropertyDefinition<?> objectIdPropDef = type.getPropertyDefinitions().get(PropertyIds.OBJECT_ID);
String objectIdQueryName = objectIdPropDef.getQueryName();
String queryString = "SELECT " + objectIdQueryName + " FROM " + type.getQueryName();
ItemIterable<QueryResult> results = session.query(queryString, false);
for (QueryResult qResult : results) {
String objectId = qResult.getPropertyValueByQueryName(objectIdQueryName);
Document doc = (Document) session.getObject(session.createObjectId(objectId));
}
This approach will retrieve all documents in a queryResult, but I would like to include startIndex and limit. The idea would be to type something like this:
ItemIterable<QueryResult> results = session.query(queryString, false).skipTo(startIndex).getPage(limit);
I'm not sure about this part: getPage(limit). Is this right approach for paging? Also I would like to retrieve Total Number of Items, so I could know how to set up the max items in grid where my items will be shown. There is a method, but something strange is written in docs, like sometimes the repository can't be aware of max items. This is that method:
results.getTotalNumItems();
I have tried something like:
SELECT COUNT(*)...
but that didn't do the trick :)
Please, could you give me some advice how to do a proper paging from a query result?
Thanks in advance.
Query returns the same ItemIterable that getChildren returns, so you can page a result set returned by a query just like you can page a result set returned by getChildren.
Suppose you have a result page that shows 20 items on the page. Consider this snippet which I am running in the Groovy Console in the OpenCMIS Workbench against a folder with 149 files named testN.txt:
int PAGE_NUM = 1
int PAGE_SIZE = 20
String queryString = "SELECT cmis:name FROM cmis:document where cmis:name like 'test%.txt'"
ItemIterable<QueryResult> results = session.query(queryString, false, operationContext).skipTo(PAGE_NUM * PAGE_SIZE).getPage(PAGE_SIZE)
println "Total items:" + results.getTotalNumItems()
for (QueryResult result : results) {
println result.getPropertyValueByQueryName("cmis:name")
}
println results.getHasMoreItems()
When you run it with PAGE_NUM = 1, you'll get 20 results and the last println statement will return true. Also note that the first println will print 149, the total number of documents that match the search query, but as you point out, not all servers know how to return that.
If you re-run this with PAGE_NUM = 7, you'll get 9 results and the last println returns false because you are at the end of the list.
If you want to see a working search page that leverages OpenCMIS and plain servlets and JSP pages, take a look at the SearchServlet class in The Blend, a sample web app that comes with the book CMIS & Apache Chemistry in Action.
I'm trying geo search function with the latest version of Lucene(4.0.0), the requirement is simple: getting the points inside a circle(the center and radius are passed in as query condition). I can not find the API that outputs the distance of each result to center, I have to calculate the distance after I get out the latitude and longitude of each result. anyone can help? the code is listed below:
SpatialContext sc = SpatialContext.GEO;
SpatialArgs args = new SpatialArgs(SpatialOperation.Intersects,
sc.makeCircle(lo, la, DistanceUtils.dist2Degrees(dist, DistanceUtils.EARTH_MEAN_RADIUS_KM)));
Filter geo_filter = strategy.makeFilter(args);
try {
Sort chainedSort = new Sort(sfArray).rewrite(searcher);
TopDocs docs = searcher.search(new MatchAllDocsQuery(), geo_filter, 10000, chainedSort);
logger.debug("search finished, num: " + docs.totalHits);
for (ScoreDoc scoreDoc : docs.scoreDocs){
Document doc = searcher.doc(scoreDoc.doc);
double la1 = Double.parseDouble(doc.get("la"));
double lo1 = Double.parseDouble(doc.get("lo"));
double distance = getDistance(la1, lo1, la, lo); // have to calc distance by myself here, not cool
}
} catch (IOException e) {
logger.error("fail to get the search result!", e);
}
It's easy to get distance with Lucene 3.X, anyone familiar with geo(spatial) search with Lucene 4.0.0?
You have the lat & lon from the field; now you need to calculate the distance from the center point of the query circle. In your code, this would look like:
double distDEG = sc.getDistCalc().distance(args.getShape().getCenter(), lo1, la1);
double distKM = DistanceUtils.degrees2Dist(distDEG, DistanceUtils.EARTH_MEAN_RADIUS_KM);
Not bad; ehh?
(p.s. I wrote much of Lucene 4 spatial)