Controlling how elasticsearch fields are tokenized for faceting - search

I'm new to elasticsearch (and the underlying Lucene engine).
We're storing some metadata about documents eg a single document might be described as:
UniqueHash: ABC123
CreatedBy: John Smith
ApplicationName: MSExcel
ContentType: application/vnd.ms-excel
WordCount: 7000
...
This all works very well for indexing/searching but when we come to faceting, things get interesting.
Faceting on (say) CreatedBy would return
John: 1
Smith: 1
or on ContentType
application: 1
vnd.ms: 1
excel: 1
neither of these is desirable. I have no direct control over the contents of the field (that is to say, I can't change the underlying data). I can perform a transform on the way in but that would result in storing dodgy data just so the searching works as expected which feels like the wrong approach.
How can I convince elasticsearch to treat the whole contents of each field (or at least of specified fields) as the value to use for faceting?

You can index your field twice using Multi Field Type. After reindexing, you will be able to continue using analyzed version of your field for search, and use "untouched" version of the field for facets.

Related

Can elastic search long document?

I have a study project about identify text content must use JS. Input is a paragraph includes at least 15 lines and search in 100 text files from 3 to 5 pages. Output is which text file has the same content as the input text.
Can Elastic resolve it? Or can you recommend me some solutions?
I found a blog entry from https://ambar.cloud/blog/2017/01/02/es-large-text/ that can respond to your question. There is an in depth example similar to yours.
ElasticSearch can deal with with large documents and still deliver quite a performance, but for cases like yours its important to set up the index correctly.
Lets supose you have ElasticSearch documents with a text field with 3 to 5 pages worth of text.
When you try to query documents that contain a paragraph in the large text field, ElasticSearch will perform a search through all the terms from all the documents and their fields, including the large text field.
During merge ElasticSearch collects all the found documents into memory, including the large text field. After building the results into memory, ElasticSearch will try to send these large documents as a single JSON response. This is very exprensive in terms of performance.
ElasticSearch should handle the large text field separately from other fields. To do this, in the index mapping you should set the parameter store:true for the large text field. This tells ElasticSearch to store the field separately from other document's fields. You should also exclude the large text field from _source by adding this parameter in the index settings:
_source: {
excludes: [
"your_large_text_field"
]
}
If you set your indexes this way, the large text field will be separated from _source. Querying the large text field is now much more effective since it is stored separately and there is no need to merge it with _source.
To conclude, yes, ElasticSearch can handle the search of large text fields, and, with some extra settings it can increase the search performance by 1100 times.

Lucene wildcard applied to indexed field

I have a set of indexed fields such as these:
submitted_form_2200FA17-AF7A-4E44-9749-79D3A391A1AF:true
submitted_form_2398389-2-32-43242423:true
submitted_form_54543-32SDf-3242340-32422:true
And I get that it's possible to wildcard queries such as
submitted_form_2398389-2-32-43242423:t*e
What I'm trying to do is get "any" submitted form via something like:
submitted_form_*:true
Is this possible? Or will I have to do a stream of "OR"s on the known forms (which seems quite heavy)
That's not the intended use of fields, I think. Field names aren't supposed to be the searchable values, field values are. Field names are supposed to be known a priori.
My suggestion is (if possible) to store the second part of the name as the field value, for instance: submitted_form:2398389-2-32-43242423. submitted_from would be the field known a priori, and the value could eventually be searched with a PrefixQuery.
Anyway, you could access the collection of fields' names using IndexReader.getFieldNames() in Lucene 3.x and this in Lucene 4.x. I wouldn't expect search performance there.

Solr search with ranking and best match

i am new to this forum. I am looking for you suggestion on one of our searching requirement.
We have data of names , addresses and other relevant data to search for. The input for search going to be a free from text string with more than one word. The search api should match the input string against the complete data set includes names,address and other data. To fulfill the same , i have used copyField to copy all the required fields to a search field in solr confg. I am using the searchField as searchble agianst the input string that comes in. The input search string can have partial words like example below.
Name: Test Insurance company
Address: 123 Main Avenue, Galaxy city
Phone: 6781230000
After solr creates the index, the searchable field will have the document like below
searchField {
Name: Test Insurance company
Address: 123 Main Avenue, Galaxy city
Phone: 6781230000
}
End user can enter search string like "Test Company Main Ave" and the search is currently returns the above document. But not at the top, i see other documents are being returned too.
I am framing the solr query as ""Test* Company Main Ave" , adding a "*" after first word and going against the searchFiled
I have followed this approach after searching few forums over internet. How can i get the maximum match at the top. Not sure the above approach is right.
Any help appreciated.
Thanks,
Ram
You could index all fields separately and also use your searchField as a catchall.
Use an Edismax search handler to query all field with a scoring boost + also query your catchall field.
eg.
<str name="qf">
Name^2.0
Address^1.5
.
.
.
searchField^1.0
</str>
To boost relevancy, you could also index each field twice, once with a string type and then with a text_en type, as per this
<str name="qf">
Name^2.0
Name_exact^5.0
Address^1.5
Address_exact^3.0
.
.
.
searchField^1.0
</str>
Technically if there are documents above the one you want to match then they are a better match so it depends why they are getting a higher relevancy score. Try turning the debug on and see where the documents above your preferred document are getting the extra relevancy from.
Once you know why they are coming higher then you need to ask yourself why should your preferred document come first, what makes it a "better" match in your eyes.
Once you've decided why it should come top then you need to work out how to index and search the content so that the documents you expect to come first actually do come first, you may as qux says in his answer need to index multiple versions of the data to allow for better matching etc.
Si

Does Solr store the original contents of the document after indexing?

If I mark a field as "don't store," does Solr retain the original contents of that field anywhere, or does it only retain the "bag of words" that it culls for the index itself?
I'm asking from the standpoint of document security. If someone cracked into the machine running our Solr index, could they get the original text passed into Solr for this "don't store" field, or not?
No, the Solr index does not store the original value in any retrievable or viewable way for fields that are set to stored="false". Common Field options on the Solr wiki states the following behavior of setting the stored option.
True if the value of the field should be retrievable during a search
If someone cracked into the machine running the Solr index and ran Solr queries based on the above they would not be able to see the contents of the field as Solr would not return that field. However if they had access to the disk and the actual index folder and segment files as written by Lucene, they could see the terms that Solr stored for each document in that field using Luke - Lucene Index Toolbox to examine the index folder.
When a field is Storable.No, only enough information is stored for Lucene to perform the search.
However, if you specify WITH_POSITIONS_OFFSETS when constructing each field, there is usually enough information to retrieve:
lowercase(EXACTSTRINGINDEXED) - LUCENEDELIMITERS - STOPWORDS
For example, if you indexed:
Jerry&Mary's Live Bait and Yellow Cab
with an analyzer that treats "&" and "'" as delimiters, did not index single letters, and treated 'and' as a stopword, you would see in the index something like:
jerry mary live bait [null word] yellow cab
(You can verify this with Luke, as mentioned above.)

filtering results in solr

I'm trying to build auto suggest functionality using Solr. The index contains different locations within a city and looks something like
id: unique id
name: the complete name
type: can be one of 'location_zone', 'location_subzone', 'location_city', 'outlet', 'landmark' ...
city: city id
now when the user types something, I want it to return suggestion only from the current city and of type location_*. something similar to WHERE city_id = 1 AND type="location_%" in SQL.
I guess one way to do it is by faceting but is that the right way? will it still search in all documents and then filter the results or will it apply the condition first as mysql would do it
PS: I'm new to solr and would appreciate if you can point out any mistakes in the approach
Solr does provide filtering, using the fq parameter. What you're looking for should be something along the lines of:
&fq=city_id:1&fq=type:location_*&q=...
This page illustrates very well how and when to use filter queries in Solr.

Resources