Max terms indexed in a document by Elasticsearch? - search

Lucene mentions that -
If The document you are indexing are very large. Lucene by default only indexes the first 10,000 terms of a document to avoid OutOfMemory errors
though we can configure it by IndexWriter.setMaxFieldLength(int).
I created an index in elasticsearch - http://localhost:9200/twitter and posted a document with 40,000 terms in it.
mapping -
{
"twitter": {
"mappings": {
"tweet": {
"properties": {
"filter": {
"properties": {
"term": {
"properties": {
"message": {
"type": "string"
}
}
}
}
},
"message": {
"type": "string",
"analyzer": "standard"
}
}
}
}
} }
i indexed a document with message field has 40,000 terms - message: "text1 text2 .... text40000" .
Since standard analyzer analyzes on space it has indexed 40,000 terms.
My point is Does elasticsearch sets a limit of number of indexed terms on lucene ? If yes what is that limit ?
If no, how my all 40,000 terms got indexed , it shouldn't have indexed terms more than 10000.

The source you're citing doesn't seem up-to-date, as IndexWriter.setMaxFieldLength(int) was deprecated in Lucene 3.4 and now isn't available anymore in Lucene 4+, which ES is based on. It's been replaced by LimitTokenCountAnalyzer. However, I don't think such a limit exists anymore, or at least it is not set explicitly within the Elasticsearch codebase.
The only limit you might encounter while indexing documents would be related to either the HTTP payload size or Lucene's internal buffer size such as explained in this post

Related

Count duplicate values via Elasticsearch terms aggregation

I am trying to run an Elasticsearch terms aggregation on multiple fields of the documents in my index. Each document contains multiple fields with hashtags, which can be extracted using a custom hashtag analyzer. The goal is to find the most common hashtags in the system.
As stated in the Elasticsearch documentation, it is not possible to run a terms aggregation on multiple fields of a document. I am thus trying to use a copy_to field. The problem now is, that if the document contains the same hashtag in multiple fields, it should count the term multiple times. This is not the case with the default terms aggregation:
Given Mapping:
{
"properties": {
"field_one": {
"type": "string",
"copy_to": "hashtags"
},
"field_two": {
"type": "string",
"copy_to": "hashtags"
}
}
Given Document:
{
"field_one": "Hello #World",
"field_two": "One #World",
}
The aggregation will return a single bucket {"key": "#World", "doc_count": 1}. What I need is a single bucket {"key": "#World", "doc_count": 2}.

"Content" too large when indexing blob content for Azure Search

I set up blob indexing and full-text searching for Azure as described in this article: Indexing Documents in Azure Blob Storage with Azure Search.
Some of my documents are failing in the indexer, throwing the returning the following error:
Field 'content' contains a term that is too large to process. The max length for UTF-8 encoded terms is 32766 bytes. The most likely cause of this error is that filtering, sorting, and/or faceting are enabled on this field, which causes the entire field value to be indexed as a single term. Please avoid the use of these options for large fields.
The particular pdf that is producing this error is 3.68 MB, and contains a variety of content (text, tables, images, etc).
The index and indexer are set up exactly as described in that article, with the addition of some file type restrictions.
Index:
{
"name": "my-index",
"fields": [{
"name": "id",
"type": "Edm.String",
"key": true,
"searchable": false
}, {
"name": "content",
"type": "Edm.String",
"searchable": true
}]
}
Indexer:
{
"name": "my-indexer",
"dataSourceName": "my-data-source",
"targetIndexName": "my-index",
"schedule": {
"interval": "PT2H"
},
"parameters": {
"maxFailedItems": 10,
"configuration": {
"indexedFileNameExtensions": ".pdf,.doc,.docx,.xls,.xlsx,.ppt,.pptx,.html,.xml,.eml,.msg,.txt,.text"
}
}
}
I tried searching through their docs and some other related articles, but I couldn't really find any information. I'm guessing this is because this feature is still in preview.
there's a limit on the size of a single term in the search index - it also happens to be 32KB. If the content field in your search index is marked as filterable, facetable or sortable then you'll hit this limit (regardless of whether the field is marked as searchable or not). Typically for large searchable content you want to enable searchable and sometimes retrievable but not the rest. That way you won't hit limits on content length from the index side.
Please see this answer for more context as well.

ElasticSearch: Full-Text Search made easy

I am investigate possibility to switch to ElasticSearch from SphinxSearch.
What is good about SphinxSearch - full-text search just work out of the bot on pretty good level. Make it work on ElasticSearch appeared not as easy as I expected.
In my project I have search box with typeahead, means I stype Clint E and see dropdown with results including Clint Eastwood on the first place. Type robert down and see Robert Downey Jr. on the first place. All this I achieved with SphinxSearch out of the box just providing it my DB credentials and SQL query to pull the necessary fields.
On the other hand, with ElasticSearch I can't get satisfying results even after a day of reading about Fuzzy Like This Query, matching, partial matching and other. A lot of information but it does not make task easier. I feel like I need to be PhD in search just to make it work at simplest level.
So far I ended up with such configuration
{
"settings": {
"analysis": {
"analyzer": {
"stem": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"stop",
"porter_stem"
]
}
}
}
},
"mappings": {
"movies": {
"dynamic": true,
"properties": {
"title": {
"type": "string",
"analyzer": "stem"
}
}
}
}
}
The Query look like this:
{
"query": {
"query_string": {
"query": "clint eastw"
"default_field": "title"
}
}
}
But quality of search in this case is not satisfying at all - back to my example, it can not find Clint Eastwood profile until I type his name completely.
Then I tried to use
{
"query": {
"fuzzy_like_this": {
"fields": [
"title"
],
"like_text": "clint eastw",
"max_query_terms": 25,
"fuzziness": 0.5
}
}
}
It helps but not much, now I can find what I need with shorter request clint eastwo and after some manipulations with parameters with clint eastw but still not encouraging.
So I wonder, is there a simple recipe how to cook full-text search with ElasticSearch and get decent quality of results. I spend a day reading but didn't find the solution.
Couple of images to demonstrate what I am talking about:
Elastic, name almost complete but no expected result, note that there is no better match as well.
One letter after, elastic found it!
At the same moment Sphinx shining :)
Elasticsearch ships with auto completion suggester.
You need not put this into query functioanility , the way it works is on token level and not on partial token level.
Go for completion suggester , it also have support for fuzzy logic.

Twitter like search for users using Elasticsearch and python

I am trying to build a twitter like search for users using elasticsearch and python. That is a search across first_name, last_name and username. I have decided to go in with ngram. This is how the analyzer is configured:
settings = {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"mynGram"
]
}
},
"filter": {
"mynGram": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20
}
}
}
}
This produces an index size of 700 MB for about 700,000 documents. This covers most of my use cases but one:
John - Gives a set of results
John D - Gives the same set of results as 'John'
John Do - Gives the correct set of results.
My guess is that because of the min. ngram size being 2, it creates a blind spot in query 2 above. I have the option of reducing min. ngram size to 1 but I am worried about scalability and performance issues.
Is ngram the correct approach considering scalability and performance?
The problem is probably in your mapping definition. With an ngram analyzer, you want the index_analyzer to be ngram_analyzer, but not the search_analyzer.
Otherwise, your query string itself will be split into ngrams. John becomes Jo, oh, hn, etc. and a term or match filter will match any of those tokens.
Documentation: Index time search-as-you-type
On a related note, if you intend to do only prefix searches, an edge-ngram tokenizer would be more appropriate and would use less memory (both RAM and disk).

View with geospatial and non geospatial keys with CouchDB

I'm using CouchDB and GeoCouch and I'm trying to understand if it were possible to build a geospatial index and "query" the database both by using a location and a value from another field.
Data
{
"_id": "1",
"profession": "medic",
"location": [15.12, 30.22]
}
{
"_id": "2",
"profession": "secretary",
"location": [15.12, 30.22]
}
{
"_id": "3",
"profession": "clown",
"location": [27.12, 2.2]
}
Questions
Is there any way to perform the following queries on these documents:
Find all documents with job = "medic" near location [15.12, 30.22] (more important)
List all the different professions near this location [15.12, 30.22] (a plus)
In case that's not possible, what options do I have? I'm already considering switching to MongoDB, but I'd rather solve in a different way.
Notes
Data changes quickly, new documents might be added and many might be removed
References
Faceted search with geo-index using CouchDB

Resources