Search in new Sitecore ContentSearch using whole words - search

I am using new Sitecore search, and the issue I ran into is having results come for words that have nothing to do with my search term.
For example, searching for "lies" will find "applies". And this is not what I am looking for.
This is an example of search I am doing (simplified). It is a direct LINQ check for "Contains" on the "Content" property of the SearchResultItem, and most likely not what I supposed to do. It is just happen to be that samples I find online are practically doing so.
Example of my code (simplified). In here I break down the search sentence to separate keywords. By the way, I am looking for a way to show full sentence match first.
using (var context = ContentSearchManager.GetIndex("sitecore_web_index").CreateSearchContext())
{
var results = context.GetQueryable<SearchResultItem>()
.Filter(i => i.Path.StartsWith(Home.Paths.FullPath))
.Filter(GetTermPredicate(Term));
// use results here
}
protected Expression<Func<SearchResultItem, bool>> GetTermPredicate(string term)
{
var predicate = PredicateBuilder.True<SearchResultItem>();
foreach (var tempTerm in term.Split(' '))
{
predicate = predicate.And(p => p.Content.Contains(tempTerm));
}
return predicate;
}
Thank you in advance!!

All right. I got help from Sitecore Support.
In my version of Sitecore I can use the following to acheive search for a whole word instead of partial:
instead of:
predicate = predicate.And(p => p.Content.Contains(tempTerm));
use
predicate = predicate.And(p => p.Content.Equals(tempTerm));
Issue solved.

Replace Filter in your code by Where, it should be fine,
here is an example :
var currentIndex = ContentSearchManager.GetIndex("sitecore_web_index");
using (var context = currentIndex.CreateSearchContext())
{
var predicate = PredicateBuilder.True();
foreach (var currentWord in term.Split(‘ ‘))
{
predicate = predicate.Or(x => x.Content.Contains(currentWord ));
}
var results = context.GetQueryable().Where(predicate).GetResults();
}

As Ahmed notes, you should use Where instead of Filter, since Filter has no effect on search rank. The classic use case for filters is to apply a facet chosen by the user without distorting the ordering of results, as would happen if you used a Where clause.
Filtering is similar to using Where to restrict the result list. Both methods will affect the result in the same
result list, but when you use a Filter the scoring/ranking of the search hits is not affected by the filters.
Developer's Guide to Item Buckets and Search
There's a good dicussion of when to use Filter on the Sitecore 7 team blog:
Making Good Part 4: Filters, Paging and I'm feeling lucky.

If you only want to search for whole words, you could prefix and postfix the searchterm with a space. Allthough this doesn't catch all situations.

Related

Lucene Index Search without stoppers

I am doing some queries over a Lucene index, right now I'm looking for latin phrases over this queries. The problem is that some of this phrases include words that i think are consider like stoppers. For example if my search term is "a contrario sensu" the result is zero but if I only search for "contrario sensu" i have over 100 coincidences.
The question is how can i do a search without this stoppers?
My code looks like this
public IEnumerable<TesisIndx> Search(string searchTerm)
{
List<TesisIndx> results = new List<TesisIndx>();
IndexSearcher searcher = new IndexSearcher(FSDirectory.GetDirectory(indexPath));
QueryParser parser = new QueryParser("Rubro", analyzer);
PhraseQuery q = new PhraseQuery();
String[] words = searchTerm.Split(' ');
foreach (string word in words)
{
q.Add(new Term("Rubro", word));
}
//Query query = parser.Parse(searchTerm);
Hits hitsFound = searcher.Search(q);
TesisIndx sampleDataFileRow = null;
for (int i = 0; i < hitsFound.Length(); i++)
{
sampleDataFileRow = new TesisIndx();
Document doc = hitsFound.Doc(i);
sampleDataFileRow.Ius = int.Parse(doc.Get("Ius"));
sampleDataFileRow.Rubro = doc.Get("Rubro");
sampleDataFileRow.Texto = doc.Get("Texto");
results.Add(sampleDataFileRow);
}
}
I use a StandardAnalyzer to build the index and perform the search
It is a stop word. However, when it comes to phrase queries, that doesn't mean it isn't considered at all. If you try printing your query after parsing, you should see something like:
Rubro:"? contrario sensu"
That question mark represents a position increment, in this case a removed stop word. So it's looking for the phrase with a gap where a stop word has been removed at the beginning.
You can disable position increments in the query parser with QueryParser.setEnablePositionIncrements(false), though you should be aware this could cause problems for you if you still have position increments in the index, and run into a stop word in the middle of a phrase.
The StandardAnalyzer will exclude a set of stop words including "a" (see the end of https://github.com/apache/lucenenet/blob/3.0.3-2/src/core/Analysis/StopAnalyzer.cs for a full list)
It is important that the analysis style when querying is compatible with the style used when indexing. This is why your PhraseQuery only works without the "a" because the indexing step removed it.
You can use the StandardAnalyzer ctor that takes ISet<string> stopWords and pass in new HashSet<string>() Something like:
new StandardAnalyzer(Version.LUCENE_30, new HashSet<string>())
This means that all words will be included in the token stream for the field.
Use this analyzer when indexing and querying and you will get better results.
However, you should note that the StandardAnalyzer also fiddles with the words somewhat. It is designed to be "a good tokenizer for most European-language documents". See the comments at the beginning of https://github.com/apache/lucenenet/blob/3.0.3-2/src/core/Analysis/Standard/StandardTokenizer.cs for some more info and check if it's compatible with your use case.
It may be worth your time to investigate different analyzers for the kind of text you are indexing.

Lucene wild card search

How can I perform a wildcard search in Lucene ?
I have the text: "1997_titanic"
If I search like "1997_titanic", it is returning a result, but I am not able to do below two searches:
1) If I search with only 1997 it is not returning any results.
2) Also if there is a space, such as in "spider man", that is not finding any results.
I retrieve all movie information from a DB and store it in Lucene Documents:
public Document createMovieDoc(Movie m){
document.add(new StoredField("moviename", m.getName()));
TextField field = new TextField("movienameSearch", m.getName().toLowerCase(), Store.NO);
field.setBoost(5.0f);
document.add(field);
}
And to search, I have this method:
public List searh(String txt){
PhraseQuery phQuery= new PhraseQuery();
Term term = new Term("movienameSearch", txt.toLowerCase());
BooleanQuery b = new BooleanQuery();
b.add(phQuery, Occur.SHOULD);
TopFieldDocs tp= searcher.search(b, 20, ..);
for(int i=0;i<tp.length;i++)
{
int mId = tp[i].doc;
Document d = searcher.doc(mId);
String moviename = d.get("moviename");
list.add(moviename);
}
return list;
}
I'm not sure what analyzer you are using to index. Sounds like maybe WhitespaceAnalyzer? It sounds like, when indexing "1997_titanic" remains a single token, while "spider man" is split into the token "spider" and "man".
Could also be SimpleAnalyzer which uses a LetterTokenizer. This would make it impossible to search for "1997", since that tokenizer will eliminate all numbers for the indexed representation of the text.
Your search method doesn't look right. You aren't adding any terms to your PhraseQuery, so I wouldn't expect it to find anything. You must add some terms in order for anything to be found. You create a Term in what you've provided, but nothing is ever done with that Term. Maybe this has something to do with how you've pick your excerpts, or something? Not sure, I'm a bit confused by that.
In order to manually construct a PhraseQuery you must add each term individually, so to search for "spider man", you would do something like:
PhraseQuery phQuery= new PhraseQuery();
phQuery.add(new Term("movienameSearch", "spider"));
phQuery.add(new Term("movienameSearch", "man"));
This requires you to know what the analyzer was doing at index time, and tokenize the input yourself to suit. The simpler solution is to just use the QueryParser:
//With whatever analyzer you like to use.
QueryParser parser = new QueryParser(Version.LUCENE_46, "defaultField", analyzer);
Query query = parser.parse("movienameSearch:\"" + txt.toLowerCase() + "\"");
TopFieldDocs tp= searcher.search(query, 20);
This allows you to rely on the same analyzer to index and query, so you don't have to know how to tokenize your phrases to suit.
As far as finding "1997" and "titanic" individually, I would recommend just using StandardAnalyzer. It will tokenize those into discrete tokens, allowing them to be searched very easily, with a simple query like: movienameSearch:1997.

How to search and sort with CouchDB in one map function

I'm stumbling a bit with my CouchDB knowledge.
I have a database of content that is tagged with an array of tags and has a created date.
I want to create a view that pulls a limited number of newest stories tagged with a specific tag.
For example, the newest 6 stories tagged "Business."
Ran across this question, which seems to get me almost to where I need to go, but I'm missing one key element, which I think is how to craft the query string to sort by one key while searching by the other.
Here's my map function.
function(doc) {
if (doc.published == "yes" && doc.type == "news") {
for (var i = 0; i < doc.tags.length; i++) {
if (doc.tags[i]) {
emit([doc.created, doc.tags[i]], doc);
}
}
}
}
So how do I query that view for a all documents tagged "Business" that are the newest documents based on created.
The created attribute is a date sortable format.
First, I would switch the order of your emit:
emit([doc.tags[i], doc.created]);
(leave out doc as well, you can just add include_docs=true to get the entire document, and your view won't take up so much disk-space in the process)
Now you can query for the all the stories tagged as "Business" by using the following querystring:
startkey=["Business"]&endkey=["Business",{}]
You'll get all the documents with the tag business, and they'll be sorted by date.
This takes advantage of view collation, which basically is the rules governing how indexes are sorted/queried. For complex keys like this, the sorting is done for each item of the array separately. (ie. the first key is sorted first, the second key is sorted second, etc) This is why the order matters, as you must always move from left to right when querying a view index.
If you want the 6 most recent, your querystring will need to change:
descending=true&limit=6&endkey=["Business"]&startkey=["Business",{}]
NOTICE You need to swap the startkey/endkey values, due to how the descending parameter works. See the View reference page on the wiki for further explanation.
OK, I think I figured this out, but I'm not quite certain I fully understand it.
I found this story about complex keys and searching and sorting.
My map function looks like this:
function(doc) {
if (doc.published == "yes" && doc.type == "news") {
for (var i = 0; i < doc.tags.length; i++) {
if (doc.tags[i]) {
emit([doc.tags[i], doc.created], doc);
}
}
}
}
And to query and sort using it, the query looks like this.
http://localhost:5984/database/_design/story/_view/tagged?limit=10&startkey=["Business"]&endkey=["Business",{}]&descending=false
I'm getting the results I want, but I'm not entirely certain I understand it all.

Configuring Solr for Suggestive/Predictive Auto Complete Search

We are working on integrating Solr 3.6 to an eCommerce site. We have indexed data & search is performing really good.
We have some difficulties figuring how to use Predictive Search / Auto Complete Search Suggestion. Also interested to learn the best practices for implementing this feature.
Our goal is to offer predictive search similar to http://www.amazon.com/, but don't know how to implement it with Solr. More specifically I want to understand how to build those terms from Solr, or is it managed by something else external to solr? How the dictionary should be built for offering these kind of suggestions? Moreover, for some field, search should offer to search in category. Try typing "xper" into Amazon search box, and you will note that apart from xperia, xperia s, xperia p, it also list xperia s in Cell phones & accessories, which is a category.
Using a custom dictionary this would be difficult to manage. Or may be we don't know how to do it correctly. Looking to you to guide us on how best utilize solr to achieve this kind of suggestive search.
I would suggest you a couple of blogpost:
This one which shows you a really nice complete solution which works well but requires some additional work to be made, and uses a specific lucene index (solr core) for that specific purpose
I used the Highlight approach because the facet.prefix one is too heavy for big index, and the other ones had few or unclear documentation (i'm a stupid programmer)
So let's suppose the user has just typed "aaa bbb ccc"
Our autocomplete function (java/javascript) will call solr using the following params
q="aaa bbb"~100 ...base query, all the typed words except the last
fq=ccc* ...suggest word filter using last typed word
hl=true
hl.q=ccc* ...highlight word will be the one to suggest
fl=NONE ...return empty docs in result tag
hl.pre=### ...escape chars to locate highlight word in the response
hl.post=### ...see above
you can also control the number of suggestion with 'rows' and 'hl.fragsize' parameters
the highlight words in each document will be the right candidates for the suggestion with "aaa bbb" string
more suggestion words are the ones before/after the highlight words and, of course, you can implement more filters to extract valid words, avoid duplicates, limit suggestions
if interested i can send you some examples...
EDITED: Some further details about the approach
The portion of example i give supposes the 'autocomplete' mechanism given by jquery: we invoke a jsp (or a servlet) inside a web application passing as request param 'q' the words just typed by user.
This is the code of the jsp
ByteArrayInputStream is=null; // Used to manage Solr response
try{
StringBuffer queryUrl=new StringBuffer('putHereTheUrlOfSolrServer');
queryUrl.append("/select?wt=xml");
String typedWords=request.getParameter("q");
String base="";
if(typedWords.indexOf(" ")<=0) {
// No space typed by user: the 'easy case'
queryUrl.append("&q=text:");
queryUrl.append(URLEncoder.encode(typedWords+"*", "UTF-8"));
queryUrl.append("&hl.q=text:"+URLEncoder.encode(typedWords+"*", "UTF-8"));
} else {
// Space chars present
// we split the search in base phrase and last typed word
base=typedWords.substring(0,typedWords.lastIndexOf(" "));
queryUrl.append("&q=text:");
if(base.indexOf(" ")>0)
queryUrl.append("\""+URLEncoder.encode(base, "UTF-8")+"\"~1000");
else
queryUrl.append(URLEncoder.encode(base, "UTF-8"));
typedWords=typedWords.substring(typedWords.lastIndexOf(" ")+1);
queryUrl.append("&fq=text:"+URLEncoder.encode(typedWords+"*", "UTF-8"));
queryUrl.append("&hl.q=text:"+URLEncoder.encode(typedWords+"*", "UTF-8"));
}
// The additional parameters to control the solr response
queryUrl.append("&rows="+suggestPageSize); // Number of results returned, a parameter to control the number of suggestions
queryUrl.append("&fl=A_FIELD_NAME_THAT_DOES_NOT_EXIST"); // Interested only in highlights section, Solr return a 'light' answer
queryUrl.append("&start=0"); // Use only first page of results
queryUrl.append("&hl=true"); // Enable highlights feature
queryUrl.append("&hl.simple.pre=***"); // Use *** as 'highlight border'
queryUrl.append("&hl.simple.post=***"); // Use *** as 'highlight border'
queryUrl.append("&hl.fragsize="+suggestFragSize); // Another parameter to control the number of suggestions
queryUrl.append("&hl.fl=content,title"); // Look for result only in some fields
queryUrl.append("&facet=false"); // Disable facets
/* Omitted section: use a new URL(queryUrl.toString()) to get the solr response inside a byte array */
is=new ByteArrayInputStream(solrResponseByteArray);
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(is);
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("//response/lst[#name=\"highlighting\"]/lst/arr[#name=\"content\"]/str");
NodeList valueList = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
Vector<String> suggestions=new Vector<String>();
for (int j = 0; j < valueList.getLength(); ++j) {
Element value = (Element) valueList.item(j);
String[] result=value.getTextContent().split("\\*\\*\\*");
for(int k=0;k<result.length;k++){
String suggestedWord=result[k].toLowerCase();
if((k%2)!=0){
//Highlighted words management
if(suggestedWord.length()>=suggestedWord.length() && !suggestions.contains(suggestedWord))
suggestions.add(suggestedWord);
}else{
/* Words before/after highlighted words
we can put these words inside another vector
and use them if not enough suggestions */
}
}
}
/* Finally we build a Json Answer to be managed by our jquery function */
out.print(request.getParameter("json.wrf")+"({ \"suggestions\" : [");
boolean firstSugg=true;
for(String suggestionW:suggestions) {
out.print((firstSugg?" ":" ,"));
out.print("{ \"suggest\" : \"");
if(base.length()>0) {
out.print(base);
out.print(" ");
}
out.print(suggestionW+"\" }");
firstSugg=false;
}
out.print(" ]})");
}catch (Exception x) {
System.err.println("Exception during main process: " + x);
x.printStackTrace();
}finally{
//Gracefully close streams//
try{is.close();}catch(Exception x){;}
}
Hope to be helpfull,
Nik
This might help you out.I am trying to do the same.
http://solr.pl/en/2010/10/18/solr-and-autocomplete-part-1/

Search query with Subsonic

Ok,
Today I am trying to learn Subsonic. Pretty cool stuff.
I am trying to build some search functionality into my website but am struggling about how I might achieve this in Subsonic.
I have one search field that could contain multiple keywords. I want to return results that match all of the keywords. The target on the search is a single text column.
So far I have this (it runs but never returns results):
return new SubSonic.Select().From(Visit.Schema)
.InnerJoin(InfopathArchive.VisitIdColumn, Visit.VisitIdColumn)
.Where(InfopathArchive.XmlDocColumn).Like(keywords)
.ExecuteTypedList<Visit>();
There is a one to one mapping between the Visit table and the InfoPathArchive table. I just want to return the collection of Visits that have the keywords in the related XMLDocColumn.
If I could get that working it would be great. Now the second problem is that if someone searches for 'australia processmodel' then obviously the above code should only return that exact phrase. How can I create a query that splits up my search term so that it must return documents that contain ALL of the individual search terms?
Any help appreciated.
Edit: Ok, so the basic search works, but the multiple keyword search doesnt. I did what Adam suggested but it seems Subsonic only uses one parameter for the query.
Here is the code:
List<string> wordsInQueryList = keywords.Split(' ').ToList();
SqlQuery q = Select.AllColumnsFrom<Visit>()
.InnerJoin(InfopathArchive.VisitIdColumn, Visit.VisitIdColumn)
.Where(Visit.IsDeletedColumn).IsEqualTo(false);
foreach(string wordInQuery in wordsInQueryList)
{
q = q.And(InfopathArchive.XmlDocColumn).Like("%" + wordInQuery + "%");
}
return q.ExecuteTypedList();
Then if I look at the query that Subsonic generates:
SELECT (bunch of columns)
FROM [dbo].[Visit]
INNER JOIN [dbo].[InfopathArchive] ON [dbo].[Visit].[VisitId] = [dbo].[InfopathArchive].[VisitId]
WHERE [dbo].[Visit].[IsDeleted] = #IsDeleted
AND [dbo].[InfopathArchive].[XmlDoc] LIKE #XmlDoc
AND [dbo].[InfopathArchive].[XmlDoc] LIKE #XmlDoc
So it ends up that only the last keyword is being searched for.
Any ideas?
First question:
return new SubSonic.Select().From(Visit.Schema)
.InnerJoin(InfopathArchive.VisitIdColumn, Visit.VisitIdColumn)
.Where(InfopathArchive.XmlDocColumn).Like("%" + keywords + "%")
.ExecuteTypedList<Visit>();
Second question:
Pass a List of words in your query to a function that builds a SubSonic query as follows
SqlQuery query = DB.Select().From(Visit.Schema)
.InnerJoin(InfopathArchive.VisitIdColumn, Visit.VisitIdColumn)
.Where("1=1");
foreach(string wordInQuery in wordsInQueryList)
{
query = query.And(InfopathArchive.XmlDocColumn).Like("%" + wordInQuery + "%")
}
return query.ExecuteTypedList<Visit>();
Obviously this is untested but it should point you in the right direction.
You can do what Adam is suggesting or with 2.2 you can simply use "Contains()" instead of Like("%...%"). We also support StartsWith and EndsWith() :)

Resources