Solr, managing entities - search

I have the following situation when using Solr. My document contains "entities" for example "peanut butter". I have a list of such entities. These are items that go together and are not to be treated as two individual words. During indexing, I want solr to realize this and treat "peanut butter" as an entity. For example if someone searches for
"peanut"
then documents that have the word peanut should rank higher than documents that have the word "peanut butter". However if someone searches for
"peanut butter"
then the document that has peanut butter should show up higher than ones that have just peanut. Is there a config setting somewhere which can be modified such that the entity list can be specified in a file and Solr would do the needful?

Configure that field to use a StrField type, instead of a TextField. TextField is designed to handle tokenization and full-text search on textual content. StrField treats it's contents as a keyword, and so does not tokenize.

Related

Solr search using contains, sound like

Problem:
I have a movie information in solr. Two string fields define the movie title and director name. A copy field define another field which solr search for default.
I would like to have google like search with limited scope as follows. How to achieve it.
1)How to search solr for contains
E.g.
a) If the movie director name is "John Cream", searching for joh won't return anything. However, searchign for John return the correct result.
b) If there is a movie title called aaabbb and another one called aaa, searching for aaa returns only one result. I need to return the both results.
2) How to account for misspelling
E.g.
If the movie director name is "John Cream", searching for Jon returns no results. Is there a good sounds like (soundex) implementation for solr. If so how to enable it?
You can use solr query syntax
Searching for contains is obviously possible using wildcards (eg: title:*aaa* will match 'aaabbb' and also 'cccaaabbb'), but be careful about it, becouse it doesn't use indexes efficently. Do you really need this?
A soundex like search is possible applying solr.PhoneticFilterFactory filter to both your index and query. To achieve this define your fieldType like this in schema:
<fieldType name="text_soundex" class="solr.TextField">
...
<filter class="solr.PhoneticFilterFactory" encoder="Soundex" inject="true"/>
</fieldType>
If you define your "director" field as "text_soundex" you'll be able to search for "Jon" and find "John"
See http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters for more information.
Things you are asking, the first one is definitely achievable from Solr. I don't know about soundex.
1)How to search solr for contains
You can store data into string type of field or text type of field. In string field by wild card searching you can achieve the result (E.g field1:"John*"). Also you should look into different types of analyzers. But before everything, please look into the Solr reference http://wiki.apache.org/solr/.
def self.get_search_deals(search_q, per = 50)
data = Sunspot.search(Deal) do
fulltext '*'+search_q +'*', fields: :title
paginate page: page_no, per_page: per
end
data.results
end
searchable do
text :title
end
just pass string as "*sam*"

Implement Search Everything using Solr

How the search everything kind of application is indexing & keeping track of data into its search indexes.
Recently I have been working on Apache Solr which is producing amazing results for a search. But it was for one particular products catalog section that is being searched. As Solr is a stores it's data document, we indexed searchable fields as document in solr. I'm not sure how it can be used to build a search everything kind of search? And how should I index data into Solr?
By search everything I mean, to search into different module for information like Customers, Services, Accounts, Orders, Catalog, Support Ticket, etc. So search return results which is combined as a result from a single search form and user don't need to go into different forms for search that module?
Do I need to build different indexes for each such data models or store them into solr as single document? What is the best strategy to implement this.
You can store all that data in a single index with each document having an extra field that stores its type (Customer, Order, etc.). For the within-module search, just restrict the search query to documents of that type. For the Search All functionality, use copyField to copy all the relevant fields in each document type into one big field, and search with the document type field unconstrained.

Does Solr store the original contents of the document after indexing?

If I mark a field as "don't store," does Solr retain the original contents of that field anywhere, or does it only retain the "bag of words" that it culls for the index itself?
I'm asking from the standpoint of document security. If someone cracked into the machine running our Solr index, could they get the original text passed into Solr for this "don't store" field, or not?
No, the Solr index does not store the original value in any retrievable or viewable way for fields that are set to stored="false". Common Field options on the Solr wiki states the following behavior of setting the stored option.
True if the value of the field should be retrievable during a search
If someone cracked into the machine running the Solr index and ran Solr queries based on the above they would not be able to see the contents of the field as Solr would not return that field. However if they had access to the disk and the actual index folder and segment files as written by Lucene, they could see the terms that Solr stored for each document in that field using Luke - Lucene Index Toolbox to examine the index folder.
When a field is Storable.No, only enough information is stored for Lucene to perform the search.
However, if you specify WITH_POSITIONS_OFFSETS when constructing each field, there is usually enough information to retrieve:
lowercase(EXACTSTRINGINDEXED) - LUCENEDELIMITERS - STOPWORDS
For example, if you indexed:
Jerry&Mary's Live Bait and Yellow Cab
with an analyzer that treats "&" and "'" as delimiters, did not index single letters, and treated 'and' as a stopword, you would see in the index something like:
jerry mary live bait [null word] yellow cab
(You can verify this with Luke, as mentioned above.)

sunspot solr attribute substring/wildcard searches

Is it possible to search solr attribute fields (non solr.TextField types) using a substring/wildcard/partial string match?
For instance if I have a solr.StrField field and documents that contain the string "1234567890" I want to be able to search on "456" and have that document returned.
From what I can see only textfields can be searched in this method using things like EdgeNGram and the like, but not attribute fields??
You can have the partial matches working for String as well Text fields for wildcards.
If the query parsers you are using, supports leading wildcard queries, you can easily search for *456*, and this should match 1234567890.
However, EdgeNGram would only work for solr.TextField, as solr.strField do not allow analysers to be added to it.
So you can only define fields with class as solr.TextField and have the EdgeNGram in the analysis chain, which would break down the indexed terms into shingles for partial matching.

How to find related items by tags in Lucene.NET

My indexed documents have a field containing a pipe-delimited set of ids:
a845497737704e8ab439dd410e7f1328|
0a2d7192f75148cca89b6df58fcf2e54|
204fce58c936434598f7bd7eccf11771
(ignore line breaks)
This field represents a list of tags. The list may contain 0 to n tag Ids.
When users of my site view a particular document, I want to display a list of related documents.
This list of related document must be determined by tags:
Only documents with at least one matching tag should appear in the "related documents" list.
Document with the most matching tags should appear at the top of the "related documents" list.
I was thinking of using a WildcardQuery for this but queries starting with '*' are not allowed.
Any suggestions?
Setting aside for a minute the possible uses of Lucene for this task (which I am not overly familiar with) - consider checking out the LinkDatabase.
Sitecore will, behind the scenes, track all your references to and from items. And since your multiple tags are indeed (I assume) selected from a meta hierarchy of tags represented as Sitecore Items somewhere - the LinkDatabase would be able to tell you all items referencing it.
In some sort of pseudo code mockup, this would then become
for each ID in tags
get all documents referencing this tag
for each document found
if master-list contains document; increase usage-count
else; add document to master list
sort master-list by usage-count descending
Forgive me that I am not more precise, but am unavailable to mock up a fully working example right at this stage.
You can find an article about the LinkDatabase here http://larsnielsen.blogspirit.com/tag/XSLT. Be aware that if you're tagging documents using a TreeListEx field, there is a known flaw in earlier versions of Sitecore. Documented here: http://www.cassidy.dk/blog/sitecore/2008/12/treelistex-not-registering-links-in.html
Your pipe-delimited set of ids should really have been separated into individual fields when the documents were indexed. This way, you could simply do a query for the desired tag, sorting by relevance descending.
You can have the same field multiple times in a document. In this case, you would add multiple "tag" fields at index time by splitting on |. Then, when you search, you just have to search on the "tag" field.
Try this query on the tag field.
+(tag1 OR tag2 OR ... tagN)
where tag1, .. tagN are the tags of a document.
This query will return documents with at least one tag match. The scoring automatically will take care to bring up the documents with highest number of matches as the final score is sum of individual scores.
Also, you need to realizes that if you want to find documents similar to tags of Doc1, you will find Doc1 coming at the top of the search results. So, handle this case accordingly.

Resources