View with geospatial and non geospatial keys with CouchDB - couchdb

I'm using CouchDB and GeoCouch and I'm trying to understand if it were possible to build a geospatial index and "query" the database both by using a location and a value from another field.
Data
{
"_id": "1",
"profession": "medic",
"location": [15.12, 30.22]
}
{
"_id": "2",
"profession": "secretary",
"location": [15.12, 30.22]
}
{
"_id": "3",
"profession": "clown",
"location": [27.12, 2.2]
}
Questions
Is there any way to perform the following queries on these documents:
Find all documents with job = "medic" near location [15.12, 30.22] (more important)
List all the different professions near this location [15.12, 30.22] (a plus)
In case that's not possible, what options do I have? I'm already considering switching to MongoDB, but I'd rather solve in a different way.
Notes
Data changes quickly, new documents might be added and many might be removed
References
Faceted search with geo-index using CouchDB

Related

Apply a filter on array field of couchDB

I'm working on Hyperledger fabric. I need a particular value from array not a full document in CouchDB.
Example
{
"f_id": "1",
"History": [
{
"amount": "1",
"contactNo": "-",
"email": "i2#mail.com"
},
{
"amount": "5",
"contactNo": "-",
"email": "i#gmail.com",
}
],
"size": "12"
}
I want only an email :"i2#mail.com" Object on history array, not a full History array.
mango Query:
{
"selector": {
"History": {
"$elemMatch": {
"email": "i2#mail.com"
}
}
}
}
Output:
{
"f_id": "1",
"History": [
{
"amount": "1",
"contactNo": "-",
"email": "i2#mail.com"
},
{
"amount": "5",
"contactNo": "-",
"email": "i#gmail.com",
}
],
"size": "12"
}
Full History array But needs only the first object of history array.
Can anyone guide me?
Thanks.
I think it's not possible, because rich queries are for retrieving complete records (key-value pairs) according to given selector.
You may want to reconsider your design. For example if you want to hold an history and query from there, this approach may work out:
GetState of your special key my_record.
If key exists:
PutState new value with key my_record.
Enrich old value with additional attributes: {"DocType": "my_history", "time": "789546"}. With the help of these new attributes, it will be possible create indexes and search via querying.
PutState enriched old value with a new key my_record_<uniqueId>
If key doesn't exists, just put your value with key my_record without any new attributes.
With this approach my_record key will always hold latest value. You can query history with any attributes with/out pagination by using indexes (or not, based on your performance concerns).
This approach will also be less space consuming approach. Because if you accumulate history on single key, existing history will be copied to next version every time which means your every entry will consume previous_size + delta, instead of just delta.

How to page-wise index a blob document in Azure Cognitive Search?

I am new to Azure Search. I am indexing few pdf documents using this method
But, I want to get search result page-wise. It is currently providing result from the whole document, but instead of that I want the result to be shown from each page and I also need that particular file name and page number that has the highest score.
As you have noticed, the document cracking by default shoves all text into one field (content). If you have an OCR skill involved (assuming you have images within the PDF that contain text), it does the same thing by default in merged_content. I do not believe there is a way to force these two tasks to break your data out into pages.
I say "believe" because it difficult to find documentation on the shape of the document object that is input into your skillsets. For example, look at the input to this merge skillset. It uses /document/content and other document related data and pushes it all into a field called merged_content. If you could find documentation on all the fields in document, it MIGHT have your pages broken down.
{
"#odata.type": "#Microsoft.Skills.Text.MergeSkill",
"name": "#BookMergeSkill",
"description": "Some description",
"context": "/document",
"insertPreTag": " ",
"insertPostTag": " ",
"inputs": [
{
"name": "text",
"source": "/document/content"
},
{
"name": "itemsToInsert",
"source": "/document/normalized_images/*/text"
},
{
"name": "offsets",
"source": "/document/normalized_images/*/contentOffset"
}
],
"outputs": [
{
"name": "mergedText",
"targetName": "merged_content"
}
]
},
The only way I know to approach this is to use a custom skill, which would reside in an Azure Function and be called as part of the document skillset pipeline. Inside that Azure Function, you would have to use a PDF reader, like iText7, and crack open the documents yourself and return data that you would place in the index document as an array of text or custom objects.
We were going to go down a custom cracking process with a client (not to do this but for other reasons), but the project was canned due to the cost of holding large amounts of data within an index.

Count duplicate values via Elasticsearch terms aggregation

I am trying to run an Elasticsearch terms aggregation on multiple fields of the documents in my index. Each document contains multiple fields with hashtags, which can be extracted using a custom hashtag analyzer. The goal is to find the most common hashtags in the system.
As stated in the Elasticsearch documentation, it is not possible to run a terms aggregation on multiple fields of a document. I am thus trying to use a copy_to field. The problem now is, that if the document contains the same hashtag in multiple fields, it should count the term multiple times. This is not the case with the default terms aggregation:
Given Mapping:
{
"properties": {
"field_one": {
"type": "string",
"copy_to": "hashtags"
},
"field_two": {
"type": "string",
"copy_to": "hashtags"
}
}
Given Document:
{
"field_one": "Hello #World",
"field_two": "One #World",
}
The aggregation will return a single bucket {"key": "#World", "doc_count": 1}. What I need is a single bucket {"key": "#World", "doc_count": 2}.

ElasticSearch: Full-Text Search made easy

I am investigate possibility to switch to ElasticSearch from SphinxSearch.
What is good about SphinxSearch - full-text search just work out of the bot on pretty good level. Make it work on ElasticSearch appeared not as easy as I expected.
In my project I have search box with typeahead, means I stype Clint E and see dropdown with results including Clint Eastwood on the first place. Type robert down and see Robert Downey Jr. on the first place. All this I achieved with SphinxSearch out of the box just providing it my DB credentials and SQL query to pull the necessary fields.
On the other hand, with ElasticSearch I can't get satisfying results even after a day of reading about Fuzzy Like This Query, matching, partial matching and other. A lot of information but it does not make task easier. I feel like I need to be PhD in search just to make it work at simplest level.
So far I ended up with such configuration
{
"settings": {
"analysis": {
"analyzer": {
"stem": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"stop",
"porter_stem"
]
}
}
}
},
"mappings": {
"movies": {
"dynamic": true,
"properties": {
"title": {
"type": "string",
"analyzer": "stem"
}
}
}
}
}
The Query look like this:
{
"query": {
"query_string": {
"query": "clint eastw"
"default_field": "title"
}
}
}
But quality of search in this case is not satisfying at all - back to my example, it can not find Clint Eastwood profile until I type his name completely.
Then I tried to use
{
"query": {
"fuzzy_like_this": {
"fields": [
"title"
],
"like_text": "clint eastw",
"max_query_terms": 25,
"fuzziness": 0.5
}
}
}
It helps but not much, now I can find what I need with shorter request clint eastwo and after some manipulations with parameters with clint eastw but still not encouraging.
So I wonder, is there a simple recipe how to cook full-text search with ElasticSearch and get decent quality of results. I spend a day reading but didn't find the solution.
Couple of images to demonstrate what I am talking about:
Elastic, name almost complete but no expected result, note that there is no better match as well.
One letter after, elastic found it!
At the same moment Sphinx shining :)
Elasticsearch ships with auto completion suggester.
You need not put this into query functioanility , the way it works is on token level and not on partial token level.
Go for completion suggester , it also have support for fuzzy logic.

Max terms indexed in a document by Elasticsearch?

Lucene mentions that -
If The document you are indexing are very large. Lucene by default only indexes the first 10,000 terms of a document to avoid OutOfMemory errors
though we can configure it by IndexWriter.setMaxFieldLength(int).
I created an index in elasticsearch - http://localhost:9200/twitter and posted a document with 40,000 terms in it.
mapping -
{
"twitter": {
"mappings": {
"tweet": {
"properties": {
"filter": {
"properties": {
"term": {
"properties": {
"message": {
"type": "string"
}
}
}
}
},
"message": {
"type": "string",
"analyzer": "standard"
}
}
}
}
} }
i indexed a document with message field has 40,000 terms - message: "text1 text2 .... text40000" .
Since standard analyzer analyzes on space it has indexed 40,000 terms.
My point is Does elasticsearch sets a limit of number of indexed terms on lucene ? If yes what is that limit ?
If no, how my all 40,000 terms got indexed , it shouldn't have indexed terms more than 10000.
The source you're citing doesn't seem up-to-date, as IndexWriter.setMaxFieldLength(int) was deprecated in Lucene 3.4 and now isn't available anymore in Lucene 4+, which ES is based on. It's been replaced by LimitTokenCountAnalyzer. However, I don't think such a limit exists anymore, or at least it is not set explicitly within the Elasticsearch codebase.
The only limit you might encounter while indexing documents would be related to either the HTTP payload size or Lucene's internal buffer size such as explained in this post

Resources