I have four pieces of data that I want to make searchable.
Town, City, Postcode, Country
What is the best way that I can make these results searchable by any of the following ways:
London, England
Swindon, Wiltshire, England
Wiltshire, England
England
Wiltshire
Swindon
I could normalise the data, but then I would get duplicate results if someone searched for simply "London".
If I had only "London, England" stored, but not just "London", then if someone searched for "London" it wouldnt find any results.
Its a catch22. How should one store addresses to allow flexibility when the user is searching?
The best approach would be to use solr spatial search features http://wiki.apache.org/solr/SpatialSearch/ but that would require access to a mapping data service which could return the latitude / longitude of the location and store that with the solr record. Then do the same lookup on searching to get the latitude / longitude and you will be able to do radius searches and get much more accurate results compared to text searching on locations.
is best to follow the suggestion of the previous answer.
you should add a field location
and configure schema.xml
added to the section <fieldType>
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
added to the section <field>
<field name="location" type="location" indexed="true" stored="true" required="true" />
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>
Now update your index solr/dataimport?command=delta-import
can make your query &q=:&fq={!geofilt pt=45.15,-93.85 sfield=store d=5}
http://wiki.apache.org/solr/SpatialSearch
http://wiki.apache.org/solr/SpatialSearchDev
If you don't have the geospatial data available you could give Herarchical Faceting a try. It indexes the data in a specific manner, allowing queries within a hierarchy, e.g.:
Document: England > London > Chelsea
Index: 0/England, 1/England/London, 2 England/London/Chelsea
Query: facet.field = category, facet.prefix = 1/London, facet.mincount = 1
There is some redundancy in the index, but it should be negligable in most cases.
Related
I'm developing a search application using Solr that is required to search 'books' that are split into chapters. A book might look like this:
title: "book title"
author: "mr whoever"
chapters: [
{
title: "some chapter title"
text: "blah blah blah"
},
{
title: "some other title"
text: "blah blah blah"
},
... etc.
]
Requirements for the search:
The user is searching for books not chapters, so the top results must be the most relevant books overall, given all the chapter text inside.
The user needs to see which chapters from a book have matched, information about those chapters and how many matches there were per chapter.
Progress:
Multivalued fields
Solr supports multi valued fields (i.e. multiple chapters per book) but it isn't possible to have two fields (title and text) per field on the book document.
Solr "Join"
I don't know if this is necessary. Each chapter will only be owned by one book so it seems like we could just put them all in one document without too much repetition.
Dynamic fields
Have fields like "chapter1text_txt", "chapter1title_txt" and "chapter2text_txt" for example and only join up the per chapter information independent of solr, so solr doesn't know that "chapter1text_txt", "chapter1title_txt" are part of the same thing.
What is the proper way of configuring schema.xml to support and search this type of document?
Document structure
So far the best solution has been using multivalued fields for both chapter_title and chapter_text, and enforcing a consistent ordering of these values in the upload documents, so the first chapter_title always corresponds to the first chapter_text and so on.
Here's the section of schema.xml:
<field name="report_title"
type="text_en" indexed="true" stored="true"/>
<field name="chapter_title"
type="text_en" indexed="true" stored="true" multiValued="true"/>
<field name="chapter_text"
type="text_en" indexed="true" stored="true" multiValued="true"/>
This is a compromise because the index cannot know about this relationship between chapter_title and chapter_text, so it is impossible to ask for "chapters with X in the title and Y in the text".
Match Counts
I still haven't found a way of doing this, but I'm considering using highlighting and counting the number of highlighted terms after asking for one large snippet covering the whole document.
We are trying to use Solr to search our document contents, however I want to be able to search for fields that match internally. I have looked but cannot find anything on self-referential or inner joins.
So for example:
<doc>
<field name="id">12345</field>
<field name="author">Smith</field>
<field name="last_edit">Smith</field>
...
</doc>
Obviously a (author:Smith AND last_edit:Smith) would work, but I would like to be able to search for all documents where author and last_edit are the same, not necessarily a fixed value. Defining a new field is fine.
I'm having trouble querying on a StrField with a large value (say 70k characters). I'm using Solr 4.4 and have documents with a string type:
<fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
and field:
<dynamicField name="someFieldName_*" type="string" indexed="true" stored="true" />
Note that it's stored, in case that matters.
Across my documents, the length of the value in this StrField can be up to ~70k characters or more.
The query I'm trying is someFieldName_1:*. If someFieldName_1 has values with length < 32,767 characters, then it works fine and I get back various documents with values in that field.
However, if I query someFieldName_2:* and someFieldName_2 has values with length >= 32,767, I don't get back any documents. Even though I know that many documents have a value in someFieldName_2.
I know this because I query *:* and see documents with (large) values in someFieldName_2.
So is there some type of limit to the length of strings in a StrField that I can query against? 32,767 = 2^15 is mighty suspicious =)
Yonik answered this question on the Solr user mailing list with, "I believe that's the maximum size of an indexed token...". So it seems like the behavior is somewhat expected.
However, another user has opened up a bug report about the lack of errors, "i'll open a bug to figure out why we aren't generating an error for this at index time, but the behavior at query time looks correct..."
I try to model my db using this example from solr wiki.
I have a table called item and a table called features with id,featureName,description
here is the updated xml (added featureName)
<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
<document>
<entity name="item" query="select * from item">
<entity name="feature" query="select description, featureName as features from feature where item_id='${item.ID}'"/>
</entity>
</document>
Now I get two lists in the xml element
<doc>
<arr name="featureName">
<str>number of miles in every direction the universal cataclysm was gathering</str>
<str>All around the Restaurant people and things relaxed and chatted. The</str>
<str>- Do we have... - he put up a hand to hold back the cheers, - Do we</str>
</arr>
<arr name="description">
<str>to a stupefying climax. Glancing at his watch, Max returned to the stage</str>
<str>air was filled with talk of this and that, and with the mingled scents of</str>
<str>have a party here from the Zansellquasure Flamarion Bridge Club from</str>
</arr>
</doc>
But I would like to see the list together (using xml attributes) so that I dont have to join the values.
Is it possible?
I wanted to suggest the ScriptTransformer, it gives you the flexibility to alter the data as needed, but it will not work in your case since it's working at the row level.
You can always define an aggregation function for string concatenation in SQL(example), but you will potentially have performance issues.
If you would use a http/xml data source the solution would have been to use the flatten atribute.
Nevertheless the search functionality will work as expected even if you ended up with multi-valued fields. The down side would be on the client where you will concatenate them before the presentation layer, which is not really a problem if you use some sort of pagination.
I am indexing a collection of xml document with the next structure:
<mydoc>
<id>1234</id>
<name>Some Name</name>
<experiences>
<experience years="10" type="Java"/>
<experience years="4" type="Hadoop"/>
<experience years="1" type="Hbase"/>
</experiences>
</mydoc>
Is there any way to create solr index so that it would support the next query:
find all docs with experience type "Hadoop" and years>=3
So far my best idea is to put delimited years||type into multiValued string field, search for all docs with type "Hadoop" and after that iterate through the results to select years>=3. Obviously this is very inefficient for a large set of docs.
I think there is no obvious solution for indexing data coming from the many-to-many relationship. In this case I would go with dynamic fields: http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
Field definition in schema.xml:
<dynamicField name="experience_*" type="integer" indexed="true" stored="true"/>
So, using your example you would end up with something like this:
<mydoc>
<id>1234</id>
<name>Some Name</name>
<experience_Java>10</experience_Java>
<experience_Hadoop>4</experience_Hadoop>
<experience_Hbase>1</experience_Hbase>
</mydoc>
Then you can use the following query: fq=experience_Java:[3 to *]