How to configure Solr for one-to-many relationship - search

I'm developing a search application using Solr that is required to search 'books' that are split into chapters. A book might look like this:
title: "book title"
author: "mr whoever"
chapters: [
{
title: "some chapter title"
text: "blah blah blah"
},
{
title: "some other title"
text: "blah blah blah"
},
... etc.
]
Requirements for the search:
The user is searching for books not chapters, so the top results must be the most relevant books overall, given all the chapter text inside.
The user needs to see which chapters from a book have matched, information about those chapters and how many matches there were per chapter.
Progress:
Multivalued fields
Solr supports multi valued fields (i.e. multiple chapters per book) but it isn't possible to have two fields (title and text) per field on the book document.
Solr "Join"
I don't know if this is necessary. Each chapter will only be owned by one book so it seems like we could just put them all in one document without too much repetition.
Dynamic fields
Have fields like "chapter1text_txt", "chapter1title_txt" and "chapter2text_txt" for example and only join up the per chapter information independent of solr, so solr doesn't know that "chapter1text_txt", "chapter1title_txt" are part of the same thing.
What is the proper way of configuring schema.xml to support and search this type of document?

Document structure
So far the best solution has been using multivalued fields for both chapter_title and chapter_text, and enforcing a consistent ordering of these values in the upload documents, so the first chapter_title always corresponds to the first chapter_text and so on.
Here's the section of schema.xml:
<field name="report_title"
type="text_en" indexed="true" stored="true"/>
<field name="chapter_title"
type="text_en" indexed="true" stored="true" multiValued="true"/>
<field name="chapter_text"
type="text_en" indexed="true" stored="true" multiValued="true"/>
This is a compromise because the index cannot know about this relationship between chapter_title and chapter_text, so it is impossible to ask for "chapters with X in the title and Y in the text".
Match Counts
I still haven't found a way of doing this, but I'm considering using highlighting and counting the number of highlighted terms after asking for one large snippet covering the whole document.

Related

Solr search for matching fields

We are trying to use Solr to search our document contents, however I want to be able to search for fields that match internally. I have looked but cannot find anything on self-referential or inner joins.
So for example:
<doc>
<field name="id">12345</field>
<field name="author">Smith</field>
<field name="last_edit">Smith</field>
...
</doc>
Obviously a (author:Smith AND last_edit:Smith) would work, but I would like to be able to search for all documents where author and last_edit are the same, not necessarily a fixed value. Defining a new field is fine.

Solr search using contains, sound like

Problem:
I have a movie information in solr. Two string fields define the movie title and director name. A copy field define another field which solr search for default.
I would like to have google like search with limited scope as follows. How to achieve it.
1)How to search solr for contains
E.g.
a) If the movie director name is "John Cream", searching for joh won't return anything. However, searchign for John return the correct result.
b) If there is a movie title called aaabbb and another one called aaa, searching for aaa returns only one result. I need to return the both results.
2) How to account for misspelling
E.g.
If the movie director name is "John Cream", searching for Jon returns no results. Is there a good sounds like (soundex) implementation for solr. If so how to enable it?
You can use solr query syntax
Searching for contains is obviously possible using wildcards (eg: title:*aaa* will match 'aaabbb' and also 'cccaaabbb'), but be careful about it, becouse it doesn't use indexes efficently. Do you really need this?
A soundex like search is possible applying solr.PhoneticFilterFactory filter to both your index and query. To achieve this define your fieldType like this in schema:
<fieldType name="text_soundex" class="solr.TextField">
...
<filter class="solr.PhoneticFilterFactory" encoder="Soundex" inject="true"/>
</fieldType>
If you define your "director" field as "text_soundex" you'll be able to search for "Jon" and find "John"
See http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters for more information.
Things you are asking, the first one is definitely achievable from Solr. I don't know about soundex.
1)How to search solr for contains
You can store data into string type of field or text type of field. In string field by wild card searching you can achieve the result (E.g field1:"John*"). Also you should look into different types of analyzers. But before everything, please look into the Solr reference http://wiki.apache.org/solr/.
def self.get_search_deals(search_q, per = 50)
data = Sunspot.search(Deal) do
fulltext '*'+search_q +'*', fields: :title
paginate page: page_no, per_page: per
end
data.results
end
searchable do
text :title
end
just pass string as "*sam*"

solr index for multi-valued multi-type field

I am indexing a collection of xml document with the next structure:
<mydoc>
<id>1234</id>
<name>Some Name</name>
<experiences>
<experience years="10" type="Java"/>
<experience years="4" type="Hadoop"/>
<experience years="1" type="Hbase"/>
</experiences>
</mydoc>
Is there any way to create solr index so that it would support the next query:
find all docs with experience type "Hadoop" and years>=3
So far my best idea is to put delimited years||type into multiValued string field, search for all docs with type "Hadoop" and after that iterate through the results to select years>=3. Obviously this is very inefficient for a large set of docs.
I think there is no obvious solution for indexing data coming from the many-to-many relationship. In this case I would go with dynamic fields: http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
Field definition in schema.xml:
<dynamicField name="experience_*" type="integer" indexed="true" stored="true"/>
So, using your example you would end up with something like this:
<mydoc>
<id>1234</id>
<name>Some Name</name>
<experience_Java>10</experience_Java>
<experience_Hadoop>4</experience_Hadoop>
<experience_Hbase>1</experience_Hbase>
</mydoc>
Then you can use the following query: fq=experience_Java:[3 to *]

Searching locations in Solr

I have four pieces of data that I want to make searchable.
Town, City, Postcode, Country
What is the best way that I can make these results searchable by any of the following ways:
London, England
Swindon, Wiltshire, England
Wiltshire, England
England
Wiltshire
Swindon
I could normalise the data, but then I would get duplicate results if someone searched for simply "London".
If I had only "London, England" stored, but not just "London", then if someone searched for "London" it wouldnt find any results.
Its a catch22. How should one store addresses to allow flexibility when the user is searching?
The best approach would be to use solr spatial search features http://wiki.apache.org/solr/SpatialSearch/ but that would require access to a mapping data service which could return the latitude / longitude of the location and store that with the solr record. Then do the same lookup on searching to get the latitude / longitude and you will be able to do radius searches and get much more accurate results compared to text searching on locations.
is best to follow the suggestion of the previous answer.
you should add a field location
and configure schema.xml
added to the section <fieldType>
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
added to the section <field>
<field name="location" type="location" indexed="true" stored="true" required="true" />
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>
Now update your index solr/dataimport?command=delta-import
can make your query &q=:&fq={!geofilt pt=45.15,-93.85 sfield=store d=5}
http://wiki.apache.org/solr/SpatialSearch
http://wiki.apache.org/solr/SpatialSearchDev
If you don't have the geospatial data available you could give Herarchical Faceting a try. It indexes the data in a specific manner, allowing queries within a hierarchy, e.g.:
Document: England > London > Chelsea
Index: 0/England, 1/England/London, 2 England/London/Chelsea
Query: facet.field = category, facet.prefix = 1/London, facet.mincount = 1
There is some redundancy in the index, but it should be negligable in most cases.

search in all fields for multiple values?

i have two fields:
title
body
and i want to search for two words
dog
OR
cat
in each of them.
i have tried q=*:dog OR cat
but it doesnt work.
how should i type it?
PS. could i enter default search field = ALL fields in schema.xml in someway?
As Mauricio noted, using a copyField (see http://wiki.apache.org/solr/SchemaXml#Copy_Fields) is one way to allow searching across multiple fields without specifying them in the query string. In that scenario, you define the copyField, and then set the fields that get copied to it.
<field name="mysearchfield" type="string" indexed="true" stored="false"/>
...
<copyField source="title" dest="mysearchfield"/>
<copyField source="body" dest="mysearchfield"/>
Once you've done that, you could do your search like:
q=mysearchfield:dog OR mysearchfield:cat
If your query analyzer is setup to split on spaces (typical), that could be simplified to:
q=mysearchfield:dog cat
If "mysearchfield" is going to be your standard search, you can simplify things even further by defining that copyField as the defaultSearchField in the schema:
<defaultSearchField>mysearchfield</defaultSearchField>
After that, the query would just become:
q=dog cat

Resources