Solr query based on a string field's subset - search

I'd like to send a string to Solr and let it answer with all records which are a subset of that string.
The string I would send has integer numbers separated by spaces. I wanna make solr give me all records where a specific string field is a subset of the numbers I provide as the request string.
An example...
Imagine I have an string field indexed in Solr which is in reality a set of integers separated by space. For example, let's say I have the following record's field indexed in Solr:
"888110"
"888110 888120"
"888110 888120 888130"
"888110 888120 888130 888140"
"888110 888130 888140"
"888110 888140"
"888140"
"888120 888130"
I wanna Solr to receive a query with, for example, "888110 888140" and reply with the following records:
"888110"
"888110 888140"
"888140"
If I query by "888110 888120 888130" the retrieved records would be...
"888110"
"888110 888120"
"888110 888120 888130"
"888120 888130"
The retrieved records must be exactly a subset of the numbers provided as a string.
Is it possible to make Solr behave like this?

I'm a bit confused why in the first example "888110" is not returned, but it is in the second example.
Anyways, if I understand generally what you are trying to do, I would be making a new field multi valued and use your boolean operators (AND ,OR) on the query.
eg in the schema
<field name="code_string" ... />
<field name="codes" ... multiValued="true"/>
so you have a document like
<doc>
<arr name="codes">
<str>811001</str>
<str>811002</str>
</arr>
and in your query
?=codes=811001 OR codes=811002 OR ....
In my experience with solr it is generally cleaner / more maintainable to sacrifice a little memory rather than creating debilitatingly complex chains of filters etc

Related

Hybris: Use same field for search and facet

I have to use a field "manufacturerName" for both solr search and solr facet in Hybris. While the solr free text search requires the field type to be text, the facet only works properly in string type.
Is there any way to use this same field for both search and facet. I think there is one way by using "copyField" but I searched a lot, and still don't know how to use it?
Any help would be highly appreciated!
PS: On keeping the field type string, free text search doesn't fetch proper results. On keeping the field type text, facet shows truncated values.
Using a copyField instruction is the way to go, but that require you to define an alternative field - meaning you have one field with the type text and the associated tokenization, and one field of the type string which isn't processed in any way. There is no way in Solr to combine these in a single field that I know of.
You'll then use the name of the string field to generate the facets, while you use the other field when you're querying.
<copyField source="text_search_field" dest="string_facet_field" />
You'll then have to refer to the name string_facet_field when you're filtering or faceting on the field. You'll want to filter against the facet field after the user selects a facet, since you otherwise would end up with documents from other facets possibly leaking into your document result set (for example if the facet was "Foo Bar", you'd suddenly get documents that had "Baz Foo Bar Spam" as the facet, since both words are present in the search string.
I was not able to implement the "copyField" approach, but I found another easy way to do this. In solr.impex, I had already added my new field manufacturerNameFacet of type string, but there is a parameter "fieldValueProvider" and "valueProviderParameter". I provided these values as "springELValueProvider" and the field I wanted to use for search and facet "manufacturerName". After a solr full indexing, it worked like a charm. No other setting was required. The search and facet both were working as expected.

Solr search using contains, sound like

Problem:
I have a movie information in solr. Two string fields define the movie title and director name. A copy field define another field which solr search for default.
I would like to have google like search with limited scope as follows. How to achieve it.
1)How to search solr for contains
E.g.
a) If the movie director name is "John Cream", searching for joh won't return anything. However, searchign for John return the correct result.
b) If there is a movie title called aaabbb and another one called aaa, searching for aaa returns only one result. I need to return the both results.
2) How to account for misspelling
E.g.
If the movie director name is "John Cream", searching for Jon returns no results. Is there a good sounds like (soundex) implementation for solr. If so how to enable it?
You can use solr query syntax
Searching for contains is obviously possible using wildcards (eg: title:*aaa* will match 'aaabbb' and also 'cccaaabbb'), but be careful about it, becouse it doesn't use indexes efficently. Do you really need this?
A soundex like search is possible applying solr.PhoneticFilterFactory filter to both your index and query. To achieve this define your fieldType like this in schema:
<fieldType name="text_soundex" class="solr.TextField">
...
<filter class="solr.PhoneticFilterFactory" encoder="Soundex" inject="true"/>
</fieldType>
If you define your "director" field as "text_soundex" you'll be able to search for "Jon" and find "John"
See http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters for more information.
Things you are asking, the first one is definitely achievable from Solr. I don't know about soundex.
1)How to search solr for contains
You can store data into string type of field or text type of field. In string field by wild card searching you can achieve the result (E.g field1:"John*"). Also you should look into different types of analyzers. But before everything, please look into the Solr reference http://wiki.apache.org/solr/.
def self.get_search_deals(search_q, per = 50)
data = Sunspot.search(Deal) do
fulltext '*'+search_q +'*', fields: :title
paginate page: page_no, per_page: per
end
data.results
end
searchable do
text :title
end
just pass string as "*sam*"

Solr search with ranking and best match

i am new to this forum. I am looking for you suggestion on one of our searching requirement.
We have data of names , addresses and other relevant data to search for. The input for search going to be a free from text string with more than one word. The search api should match the input string against the complete data set includes names,address and other data. To fulfill the same , i have used copyField to copy all the required fields to a search field in solr confg. I am using the searchField as searchble agianst the input string that comes in. The input search string can have partial words like example below.
Name: Test Insurance company
Address: 123 Main Avenue, Galaxy city
Phone: 6781230000
After solr creates the index, the searchable field will have the document like below
searchField {
Name: Test Insurance company
Address: 123 Main Avenue, Galaxy city
Phone: 6781230000
}
End user can enter search string like "Test Company Main Ave" and the search is currently returns the above document. But not at the top, i see other documents are being returned too.
I am framing the solr query as ""Test* Company Main Ave" , adding a "*" after first word and going against the searchFiled
I have followed this approach after searching few forums over internet. How can i get the maximum match at the top. Not sure the above approach is right.
Any help appreciated.
Thanks,
Ram
You could index all fields separately and also use your searchField as a catchall.
Use an Edismax search handler to query all field with a scoring boost + also query your catchall field.
eg.
<str name="qf">
Name^2.0
Address^1.5
.
.
.
searchField^1.0
</str>
To boost relevancy, you could also index each field twice, once with a string type and then with a text_en type, as per this
<str name="qf">
Name^2.0
Name_exact^5.0
Address^1.5
Address_exact^3.0
.
.
.
searchField^1.0
</str>
Technically if there are documents above the one you want to match then they are a better match so it depends why they are getting a higher relevancy score. Try turning the debug on and see where the documents above your preferred document are getting the extra relevancy from.
Once you know why they are coming higher then you need to ask yourself why should your preferred document come first, what makes it a "better" match in your eyes.
Once you've decided why it should come top then you need to work out how to index and search the content so that the documents you expect to come first actually do come first, you may as qux says in his answer need to index multiple versions of the data to allow for better matching etc.
Si

Querying on StrFields with large values returns no documents in Solr 4.4

I'm having trouble querying on a StrField with a large value (say 70k characters). I'm using Solr 4.4 and have documents with a string type:
<fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
and field:
<dynamicField name="someFieldName_*" type="string" indexed="true" stored="true" />
Note that it's stored, in case that matters.
Across my documents, the length of the value in this StrField can be up to ~70k characters or more.
The query I'm trying is someFieldName_1:*. If someFieldName_1 has values with length < 32,767 characters, then it works fine and I get back various documents with values in that field.
However, if I query someFieldName_2:* and someFieldName_2 has values with length >= 32,767, I don't get back any documents. Even though I know that many documents have a value in someFieldName_2.
I know this because I query *:* and see documents with (large) values in someFieldName_2.
So is there some type of limit to the length of strings in a StrField that I can query against? 32,767 = 2^15 is mighty suspicious =)
Yonik answered this question on the Solr user mailing list with, "I believe that's the maximum size of an indexed token...". So it seems like the behavior is somewhat expected.
However, another user has opened up a bug report about the lack of errors, "i'll open a bug to figure out why we aren't generating an error for this at index time, but the behavior at query time looks correct..."

Solr - index JSON query string from database?

I would like to know if it is possible to index data that contains a JSON string that can be decoded and each JSON value to be indexed with the separate values.
I am using the DIH to connect to a MySQL database and able to index the individual columns.
The result would look like the following:
<response name="response" numFound="1" start="0" maxScore="2.7143538">
...
<result name="response" numFound="1" start="0" maxScore="2.7143538">
<doc>
<float name="score">2.7143538</float>
<str name="id">82</str>
<str name="name">jorge</str>
<str name="otherinfo">{"day":15,"year":1989,"month":"January"}</str>
</doc>
</result>
</response>
The problem is that "otherinfo" is a JSON string that I would like to decode and have something like the following in my index:
<response name="response" numFound="1" start="0" maxScore="2.7143538">
...
<result name="response" numFound="1" start="0" maxScore="2.7143538">
<doc>
<float name="score">2.7143538</float>
<str name="id">82</str>
<str name="name">jorge</str>
<str name="day">15</str>
<str name="year">1989</str>
<str name="month">January</str>
</doc>
</result>
</response>
Would this be possible to do at all with Solr?
Thanks in advance
I commented on this. I decided that I should answer instead.
The fix for your issue isn't at the Solr level. You shouldn't be storing your data this way in the DB to begin with. In the long run, it would be better to fix this problem there, as opposed to trying to hack this at the Solr indexing level.
Your question proves that someone, probably an end user, is interested in searching by this data. This implies that it should probably be stored in the database as an actual Date or Timestamp field so that it can be properly selected or sorted on.
I'm sure people won't like that this doesn't exactly answer your question, but someone needs to tell you this.
If you know your way around Java you could write your own, custom transformer that would handle your specific case.
Have you tried using DIH RegexTransformer to parse JSON?
I think that should be doable, especially if you have fixed json format (doesn't contain document in document in document in ...).
I've just noticed ScriptTransformer, which allows you to write your own parser. I think this is the way to go...
Is the otherinfo field in the DB a JSON string to start with?
You would need dynamic fields (docs, explanation) and client-side code to let Solr store data with arbitary schema.
You would need to define dynamic fields in your schema like:
dyn_string_*: store text as it is
dyn_text__*: store text and index it for search
etc
Then you will need to tell DIH to map DB fields to solr dynamic fields (pseudocode warning; sorry, but I am not familiar with DIH):
Select
day as dyn_number_day,
name as dyn_text_name
from
tablename
Edit
You do have the requirement to query into the data structure. This needs a schema-less datastore.
Document DBs like MongoDB offer exactly the functionality: store data on arbitary fields you determine at insert-time. And it can run any kind of ad-hoc query on your data.
I am not aware of a request handler that can index your data for that. You can write code that fetches updated (or added or removed) rows periodically, decodes the JSON field and index it to Solr.
I reccomend skinny data model to store attributes to properties independent of current DB schema. I asked a question ' Set intersection in MySQL: a clean way ' a while back.
Recap: MongoDB and friends contain exactly the functionality you need. If you want relations and referential integrity, you can keep using RDBMS. If you still want that JSON thing, develop an active system that will parse it and index it to solr. But I recommend moving to a skinny data model, since you can get the same (conditions apply!) query capabilities that Solr gives you by SQL.
Exotic technology: Graph databases like Neo4j contain document database functionality (ad-hoc queries) and relations: a relation directly links one node to another, no joins involved. So it's just one step short of referential integrity.

Resources