I've got data coming in from Logstash that's being analyzed in an overeager manner. Essentially, the field "OS X 10.8" would be broken into "OS", "X", and "10.8". I know I could just change the mapping and re-index for existing data, but how would I change the default analyzer (either in ElasticSearch or LogStash) to avoid this problem in future data?
Concrete Solution: I created a mapping for the type before I sent data to the new cluster for the first time.
Solution from IRC: Create an Index Template
According this page analyzers can be specified per-query, per-field or per-index.
At index time, Elasticsearch will look for an analyzer in this order:
The analyzer defined in the field mapping.
An analyzer named default in the index settings.
The standard analyzer.
At query time, there are a few more layers:
The analyzer defined in a full-text query.
The search_analyzer defined in the field mapping.
The analyzer defined in the field mapping.
An analyzer named default_search in the index settings.
An analyzer named default in the index settings.
The standard analyzer.
On the other hand, this page point to important thing:
An analyzer is registered under a logical name. It can then be referenced from mapping definitions or certain APIs. When none are defined, defaults are used. There is an option to define which analyzers will be used by default when none can be derived.
So the only way to define a custom analyzer as default is overriding one of pre-defined analyzers, in this case the default analyzer. it means we can not use an arbitrary name for our analyzer, it must be named default
here a simple example of index setting:
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"char_filter": {
"charMappings": {
"type": "mapping",
"mappings": [
"\\u200C => "
]
}
},
"filter": {
"persian_stop": {
"type": "stop",
"stopwords_path": "stopwords.txt"
}
},
"analyzer": {
"default": {<--------- analyzer name must be default
"tokenizer": "standard",
"char_filter": [
"charMappings"
],
"filter": [
"lowercase",
"arabic_normalization",
"persian_normalization",
"persian_stop"
]
}
}
}
}
}
As you know, elasticsearch uses standard analyzer when no analyzer is specified explicitly. So while setting the templates, you can set your custom analyzer which is named as standard. And there you can set you own rules of setting analyzer, tokenzier, token filters.
Here are some helpful links that will help you understand better:
http://elasticsearch-users.115913.n3.nabble.com/How-we-can-change-Elasticsearch-default-analyzer-td4040411.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis.html
Related
According to Avro Schema specification (for Unions): https://avro.apache.org/docs/current/spec.html
Unions
Unions, as mentioned above, are represented using JSON arrays. For example, ["null", "string"] declares a schema which may be either a null or string.
(
Note that when a default value is specified for a record field whose type is a union, the type of the default value must match the first element of the union.
Thus, for unions containing "null", the "null" is usually listed first, since the default value of such unions is typically null.)
It appears from the standard, when declaring unions, the first word must be the default value and the second must be the data type.
In our product, we are using Avro encoding with the following Schema:
{
"name": "data",
"type": {
"name": "data",
"type": "record",
"fields": [
{
"name": "data_asset",
"type": ["string", "null"],
"default": null,
"doc": "The serialized JSON describing the entity - can be null for special cases"
}
]
}
}
What we have found is that, while Unions have a "MUST" requirement that the first item is the default, no errors are thrown by the Schema-validator when we reverse the order (["string", "null"]) as shown above.
The question I have is:
Why does the validation pass, even though it is "incorrect" as per the standard?
This is a case where the implementation doesn't match the specification. Some libraries might implement this check and so it's probably best to make sure your schema matches the specification even if the specific library you are using doesn't check it.
I set up blob indexing and full-text searching for Azure as described in this article: Indexing Documents in Azure Blob Storage with Azure Search.
Some of my documents are failing in the indexer, throwing the returning the following error:
Field 'content' contains a term that is too large to process. The max length for UTF-8 encoded terms is 32766 bytes. The most likely cause of this error is that filtering, sorting, and/or faceting are enabled on this field, which causes the entire field value to be indexed as a single term. Please avoid the use of these options for large fields.
The particular pdf that is producing this error is 3.68 MB, and contains a variety of content (text, tables, images, etc).
The index and indexer are set up exactly as described in that article, with the addition of some file type restrictions.
Index:
{
"name": "my-index",
"fields": [{
"name": "id",
"type": "Edm.String",
"key": true,
"searchable": false
}, {
"name": "content",
"type": "Edm.String",
"searchable": true
}]
}
Indexer:
{
"name": "my-indexer",
"dataSourceName": "my-data-source",
"targetIndexName": "my-index",
"schedule": {
"interval": "PT2H"
},
"parameters": {
"maxFailedItems": 10,
"configuration": {
"indexedFileNameExtensions": ".pdf,.doc,.docx,.xls,.xlsx,.ppt,.pptx,.html,.xml,.eml,.msg,.txt,.text"
}
}
}
I tried searching through their docs and some other related articles, but I couldn't really find any information. I'm guessing this is because this feature is still in preview.
there's a limit on the size of a single term in the search index - it also happens to be 32KB. If the content field in your search index is marked as filterable, facetable or sortable then you'll hit this limit (regardless of whether the field is marked as searchable or not). Typically for large searchable content you want to enable searchable and sometimes retrievable but not the rest. That way you won't hit limits on content length from the index side.
Please see this answer for more context as well.
tldr;
How to match and filter localized search with a localized index ?
long version
I have an application where the user search must be done in the context of it's language.
In elastic search index, I want documents with both i18n properties and non i18n properties (I want to avoid creating multiple index, one for each language).
The mapping of the document should look like :
'entry': {
'properties': {
'name' : {'type': 'string'}, /* unlocalized properties */
'category': { /* localized properties */
"properties" : {
"lang_fr" : {
"type" : "string"
},
"lang_de" : {
"type" : "string"
}
}
},}}
having that, I have two requirements:
1) Matching: when doing a search, exclude from search the localized fields that are not concerned by the user language (let's say the user's language is 'fr', I want to exclude 'de' fields from search. How to do this without specifying the entire list of fields I want to search on. To start simple, I tried this but it doesn't work :
{
"query": {
"match": {
"*.lang_fr": "full_text"
}
}
}
However, "categories.lang_fr": "full_text" works well. But I don't want to maintain the list of fields in the query. I want a general rule like you can do in SolR.
2) Filtering: when I retrieve my results, I want to filter out all localized fields that doesn't corresponds to my user language. In other words, using the source filter, I'd like to have all unlocalized fields, exclude all fields starting with "lang" , but include all fields being 'lang_fr'. I tried the following but it doesn't work:
{
"_source": {
"include": [ "*", "*.lang_fr" ],
"exclude": [ "*.lang_*" ],
}
...}
the wildcard operator doesn't seems to work. I partially have what I want if I specify "categories.lang_de", but again, I don't want to maintain the list of fields, I want a generic rule. The include/exclude operation doesn't work as I would like. The only thing that actually works is a query where I specify all languages to exclude for all fields specifically, such as :
{
"_source": {
"exclude": [ "categories.lang_de", "categories.lang_en", "categories.lang_it",
"another_field.lang_de", "catanother_fieldgories.lang_en", "another_field.lang_it"],
}
...}
for 'fr' search.
I'm quite surprised I couldn't find anything on google. I see it as a very standard case of i18n applied to elasticsearch. Maybe I'm modelizing i18n the wrong way in ES ?
thank you in advance !
You can achieve the first one using a query_string query which takes advantage of the powerful Lucene expression language and allows to specify wildcard in field names:
{
"query": {
"query_string": {
"query": "\\*.lang_fr:full_text"
}
}
}
or you can also specify the field name in the fields parameter, like this
{
"query": {
"query_string": {
"query": "full_text"
"fields": ["*.lang_fr"]
}
}
}
As for your second one, source filtering is indeed the way to go but I suggest simply excluding all languages but the one you're searching for. For instance, if the search is in French, you'd simply exclude all other languages without necessarily having to enumerate all the fields, just all the languages that you don't want (which would be much less). That would allow you to add localized fields as you go without having to change the query.
{
"_source": {
"exclude": [ "*.lang_de", "*.lang_it" ],
}
...}
Lucene mentions that -
If The document you are indexing are very large. Lucene by default only indexes the first 10,000 terms of a document to avoid OutOfMemory errors
though we can configure it by IndexWriter.setMaxFieldLength(int).
I created an index in elasticsearch - http://localhost:9200/twitter and posted a document with 40,000 terms in it.
mapping -
{
"twitter": {
"mappings": {
"tweet": {
"properties": {
"filter": {
"properties": {
"term": {
"properties": {
"message": {
"type": "string"
}
}
}
}
},
"message": {
"type": "string",
"analyzer": "standard"
}
}
}
}
} }
i indexed a document with message field has 40,000 terms - message: "text1 text2 .... text40000" .
Since standard analyzer analyzes on space it has indexed 40,000 terms.
My point is Does elasticsearch sets a limit of number of indexed terms on lucene ? If yes what is that limit ?
If no, how my all 40,000 terms got indexed , it shouldn't have indexed terms more than 10000.
The source you're citing doesn't seem up-to-date, as IndexWriter.setMaxFieldLength(int) was deprecated in Lucene 3.4 and now isn't available anymore in Lucene 4+, which ES is based on. It's been replaced by LimitTokenCountAnalyzer. However, I don't think such a limit exists anymore, or at least it is not set explicitly within the Elasticsearch codebase.
The only limit you might encounter while indexing documents would be related to either the HTTP payload size or Lucene's internal buffer size such as explained in this post
With Elasticsearch I have created an index using a custom mapping and custom set of analszers, however I'm not able to do query search on the _all field.
I'm using these analyzers:
{
"analysis": {
"analyzer": {
"case_insensitive": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
],
"char_filter": "punctuation"
}
},
"char_filter": {
"punctuation": {
"type": "mapping",
"mappings": [
".=>\\u0020",
"-=>\\u0020",
"_=>\\u0020"
]
}
}
}
}
and this mapping:
{
"article": {
"_all": {
"enabled": true,
"store": "yes",
"index_analyzer": "case_insensitive",
"search_analyzer": "case_insensitive"
},
"properties": {
"title": {
"type": "string",
"index": "analyzed"
},
"subtitle": {
"type": "string",
"analyzer": "case_insensitive"
},
"comment": {
"type": "string",
"index": "not_analyzed"
},
"review": {
"type":"string",
"index": "not_analyzed",
"include_in_all":false
}
}
}
}
Then I add a document like this:
{
"title": "This is the story of a wonderful man.",
"subtitle":"A man goes on vacation in the worst place possible.",
"comment": "I like the movie very much, however I did not undertand it.",
"review":"Very well"
}
and I expect the following 3 out of 4 fields shall be included in _all, in particular title, subtitle and comment.
The analyzer is working as following (tested using the analyzer test in elasticsearch):
"I like the movie very much, however I did not undertand it." -> "i like the movie very much, however i did not undertand it "
"This is the story of a wonderful man." -> "this is the story of a wonderful man "
I expect that at least searching on _all using the query: "This is the story of a wonderful man." I should be able to find the document.
What am I doing wrong?
How is elasticsearch populating the _all field?
If the field 'title' shall be added to the _all field, which data is used and how? is it using the output of the analyzer selected for the 'title' field as input for the analyzer of the _all or is using the raw data?
How is the flow of data in the _all field? For example
input -> analyzer -> title -> index_analyser -> _all
or
input -> analyzer -> title
-> index_analyser -> _all
Thank you in advance...
Your mapping looks ok to me. The only thing I would try is to set one of the fields explicitly to include_in_all=true and then rerun your query.
According to the docs, it may be that as you are overriding the default value of include_in_all for one of the fields, it may have changed it for all the other fields of the objects. See here _all
Relevant text from the documentation is below:
Inclusion in the _all field can be controlled on a field-by-field basis by using the include_in_all setting, which defaults to true. Setting include_in_all on an object (or on the root object) changes the default for all fields within that object.
UPDATE:
I think I know why its not working. Here is what I did. First, I removed the custom analysers from the _all_ field (so using the standard analyser). With this I was able to query and get the results as expected. Results were returned for terms that were in any of the document attributes but review. At least this confirms that the general behaviour of _all is correct. Next to test the analysers, I did a query on the subtitle field with the exact text(as it is using keyword analyser). This also worked. Then I realised that _all is an aggregated field and then analysed.
So the query should include all the text from all the fields to work. But again, how do we know in which order they were aggregated :)
This link _all custom analyser has some information. Relevant bits extracted below (from Shay).
You don't want to set the analyzer for _all to be keyword, _all is an aggregation of all the other fields int the doc, so you basically treat the whole aggregation of text as a single token.