How to find a match within a single term using Lucene - search

I am using the Lucene search engine but it only seems to find matches that occur at the beginning of terms.
For example:
Searching for "one" would match "onematch" or "one day a time" but not "loneranger".
The Lucene doc says it doesnt support wildcards at the front of a search string so I am not sure whether Lucene even searches inter-term matches or only can match documents that start with the search term.
Is this a problem with how I have created my index, how I am building my search query or just a limitation of Lucene?

Found some info in another post here on Stack Overflow [LUCENE.NET] Leading wildcard throws an error"
You can set the SetAllowLeadingWildcardCharacters property on your Query Parser to allow leading wildcards during your search. This will of course have the obvious large performance impact but will allow user to find matches within a search term.

Lucene will find a document if the search term appears anywhere within it, but it doesn't allow you to do wildcard queries where the wildcard is on the front of the search term, because it performs horribly. If that is functionality you care about, you will either have to do some low-level Lucene hacking change a config flag (thanks for the interesting link), find a third-party library that has already done that hacking, or find a different search implementation (for small enough datasets, the built in search from a lot of RDBMS engines is sufficient).

Your query should be
"Query query = new WildcardQuery(new Term("contents", "*one *"));"
where contents is the field name in which you are searching.
"one" should be enclosed with asterisk mark. I have given space in the query after *one but there should not be any space. without space the * is not displaying that is why I added star.

Related

SharePoint Online - Search File Name by middle characters

We have files in a SharePoint Online library's that got migrated from an old network drive, unfortunately all the files have names like "ReportOnTheMexicoArea20010101.doc". If a user wanted to find that exact file they may search for "Mexico" but that will not return this file as its characters within a string, is there any custom query or trick to search characters in the middle of a file name?
Thanks in advance
SharePoint does not support suffix wildcard search, it only support for prefix matching.When you use words in a free-text KQL query, Search in SharePoint returns results based on exact matches of your words with the terms stored in the full-text index. You can use just a part of a word, from the beginning of the word, by using the wildcard operator (*) to enable prefix matching. In prefix matching, Search in SharePoint matches results with terms that contain the word followed by zero or more characters.
For example: Report*
I would suggest you expand your approach for tagging your documents. I would suggest you use SharePoint metadata to tag your documents rather than just creating a title.

SOLR: Return missing words for multi word searches

I'm trying to receive the words of a search query in solr, which were not included in a match.
Let's say I'm searching for "Red Hat Linux chickpeas" (without quotes) and one of the hits is "Red Hat Enterprise Linux operating system".. Then I'd like to get the information that the word "chickpeas" is missing in this result.
I think this should somehow be possible with SOLR, however apparently I couldn't come up with the right google/stackoverflow query to find a solution to this.
You could try using a facet to get the number returns with the given terms:
q=Red+Hat+Linux+chickpeas&facet=true&facet.field={!terms=red,hat,linux,chickpeas}text
Where text is a catch-all field (tokenized, lowercase filtered). Note that the facets are case-sensitive.
The answer to my question is using an exists function query for each search term.
See here:
https://stackoverflow.com/a/26163945/467944

Azure search, search by partial terms

Here are two examples for search in the portal, where I would expect to get some results in the second search, even with one letter missing.
The search is in Hebrew language
The full term return some results,
The same term with one letter missing return no results,
There are a few ways you can search for partial terms in Azure Search. You'll need to decide which of the following methods will work best in your scenario. Based on the example it seems either fuzzy search or prefix search will do the job. You can learn about the differences between the these methods in the documentation.
Fuzzy search: blog, documentation
Wildcard search, specifically prefix search: documentation
Regular expression search: documentation
Index partial terms by defining a custom analyzer: blog, documentation
Let me know if you have any questions about any of the above
Check this answer I solve this using a regex and change the GET by a POST request.

Exact match in google search

I am trying to make an application which find all the copied code in a project.
But basically my question is purely related to google search.
I made a search for the keyword "public void bubbleSort(int[] arr){"
and this was the result.
In the first page of search results, only the last url makes a perfect match with my keyword.
Can i tell google with some search keywords so that it will give more importance to pages with an exact match of my search keyword?
although the plus sign, +, is no longer an available Google search filter, you can use quotes, or after running the query selecting Search Tools and then verbatim under the All Results drop down.
You can also search the Google code archives, https://code.google.com/ or try some of the other code search engines around the Internet.
+"public void bubbleSort(int[] arr){"
the plus sign means to include this term no matter what. the quotes turn the loosely coupled words into a single term.
for a full list of Google syntax operators:
[web]: https://support.google.com/websearch/answer/136861?hl=en

Solr behind Drupal returns too many results for specific query

We've got Solr sat behind one of our client's Drupal 7 websites, and while it's working well, it returns too many results for what should be quite specific queries. (It also has relevance/weighting problems; but I'm hoping that solving this problem will remove the - literally - irrelevant results.)
For example, searching for the phrase 'particular phrase in london' should return the node with that as its title, quite high up; I don't even think that any other content should be returned. But I find that it's returning lots of content, purely on the fact that it mentions "London"!
Frivolously, searching for the ridiculous phrase 'piecrusts in london' returns a lot of results too, apparently just because they mention London. No content on the site mentions actual piecrusts.
When I search for 'particular phrase in london', here are the parameters that end up in the catalina.out log on the server (whitespace added for clarity):
{spellcheck=false&facet=true&f.im_field_health_topic.facet.mincount=1
&facet.mincount=1&f.ds_created.facet.date.gap=%2B1YEAR
&spellcheck.q=particular+phrase+in+london
&qf=taxonomy_names^2.0&qf=path_alias^5.0&qf=content^40&qf=label^21.0
&qf=tos_content_extra^1.0&qf=ts_comments^20&qf=tm_vid_3_names^200
&facet.date=ds_created
&f.ds_created.facet.date.start=1970-01-01T00:00:00Z/YEAR
&f.bundle.facet.mincount=1&hl.fl=content,ts_comments
&json.nl=map&wt=json&rows=10&fl=id,entity_id,entity_type,bundle,bundle_name,
label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,
tos_name,tm_node,zs_entity
&start=0&facet.sort=count&f.bundle.facet.limit=50&q=special+phrase+in+london
&f.ds_created.facet.date.end=2012-01-01T00:00:00Z%2B1YEAR/YEAR
&bf=recip(ms(NOW,ds_created),3.16e-11,1,1)^150.0
&facet.field=im_field_health_topic&facet.field=bundle
&f.im_field_health_topic.facet.limit=50&f.ds_created.facet.limit=50}
hits=1998 status=0 QTime=14
Note that these parameters have been built by Drupal's Apache Solr module; I don't believe we've got any particular custom code of our own that's doing anything to it.
This corresponds to the following URL, if entered directly in the browser:
http://example.com:8081/solr/CLIENT/select?spellcheck=false&facet=true&f.im_field_health_topic.facet.mincount=1&facet.mincount=1&f.ds_created.facet.date.gap=%2B1YEAR&spellcheck.q=particular+phrase+in+London&qf=taxonomy_names^2.0&qf=path_alias^5.0&qf=content^40&qf=label^21.0&qf=tos_content_extra^1.0&qf=ts_comments^20&qf=tm_vid_3_names^200&facet.date=ds_created&f.ds_created.facet.date.start=1970-01-01T00:00:00Z/YEAR&f.bundle.facet.mincount=1&hl.fl=content,ts_comments&json.nl=map&wt=json&rows=10&fl=id,entity_id,entity_type,bundle,bundle_name,label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,tos_name,tm_node,zs_entity&start=0&facet.sort=count&f.bundle.facet.limit=50&q=particular+phrase+in+London&f.ds_created.facet.date.end=2012-01-01T00:00:00Z%2B1YEAR/YEAR&bf=recip(ms(NOW,ds_created),3.16e-11,1,1)^150.0&facet.field=im_field_health_topic&facet.field=bundle&f.im_field_health_topic.facet.limit=50&f.ds_created.facet.limit=50
This URL returns nearly 2000 results - that's most of the content on the site! I've experimented with removing each query parameter at a time, and the only one to make any difference seems to be qf and q: if I remove qf, zero results; if I remove q, I get more results back!
I guess there are two questions here:
Is there anything in these parameters that tell Solr "don't worry if 'particular phrase', or 'piecrusts' appears: just collate the results for 'london'" and then order by relevancy? I would add that I think 'in' is mentioned in the stopwords file, so we can probably ignore the effect of that (?)
Or is this something in the (standard Drupal) schema that I need to change?
I appreciate that sometimes search is better for the visitor if it's inclusive; Google does return results even if it doesn't find perfect matches. But, stopwords and stemming aside, the client does require that searches return only results where all words appear in the content.
As mentioned in the post at http://drupal.org/node/1783454, the Apache Solr Search Integration module makes use of the mm param, which is more or less configured to effect rankings by how closely the keywords are in the dataset. Looking through the docs there are other ways you can use the parameter to effect rankings as well. Therefore the results produced by Apache Solr Search Integration are weighted more closely to the AND operator even though it will return more results as you add more keywords. The benefit of this param is that in cases where the user enters keywords that are too restrictive, results will still be returned. Displaying no results is a really quick way to guide people away from your site.
How are you displaying the search ?
Maybe you could solr views to limit the search range ?
http://drupal.org/project/apachesolr_views
thanks
Nick

Resources