Count duplicate values via Elasticsearch terms aggregation - search

I am trying to run an Elasticsearch terms aggregation on multiple fields of the documents in my index. Each document contains multiple fields with hashtags, which can be extracted using a custom hashtag analyzer. The goal is to find the most common hashtags in the system.
As stated in the Elasticsearch documentation, it is not possible to run a terms aggregation on multiple fields of a document. I am thus trying to use a copy_to field. The problem now is, that if the document contains the same hashtag in multiple fields, it should count the term multiple times. This is not the case with the default terms aggregation:
Given Mapping:
{
"properties": {
"field_one": {
"type": "string",
"copy_to": "hashtags"
},
"field_two": {
"type": "string",
"copy_to": "hashtags"
}
}
Given Document:
{
"field_one": "Hello #World",
"field_two": "One #World",
}
The aggregation will return a single bucket {"key": "#World", "doc_count": 1}. What I need is a single bucket {"key": "#World", "doc_count": 2}.

Related

Cloudant Sorting on a nullable field

I want to sort on a field lets say name which is indexed in Cloudant DB. I am getting all the documents both which has this name field and which doesn't by using the index without sort . But when i try to sort with the name field I am not getting the documents which doesn't have this name field in the doc.
Is there any way to do this by using the query indexes. I want all the documents in sorted order which doesn't have the name field too.
For Example :
Below are some documents:
{
"_id": 1234,
"classId": "abc",
"name": "Happa"
}
{
"_id": 12345,
"classId": "abc",
"name": "Prasanth"
}
{
"_id": 123456,
"classId": "abc",
}
Below is the Query what i am trying to execute:
{
"selector": {
"classId": "abc",
"name" :{
"or" : [
{"$exists": true},{"$exists": false}
]
}
},
"sort": [{ "classId": "asc" }, { "name": "asc" }],
"use_index": "idx-classId_name"
},
I am expecting all the documents to be returned in a sorted order including the document which doesn't have that name field.
Your query makes no sense to me as it stands. You're requesting a listing of documents which either have, or don't have a specific field (meaning every document), and expecting to sort those on this field that may or may not exist. Such an order isn't defined out of the box.
I'd remove the name clause from the selector, sorting only on the classId field which appear in every document, and then do the secondary partial ordering on the client side, so you can decide how you intend to mix in the documents without the name field with those that have it.
Another solution is to use a view instead of a Cloudant Query index. I've not tested this, but hopefully the intent is clear:
function(doc) {
if (doc && doc.classId) {
var name = doc.name || "[notfound]";
emit(doc.classId+"-"+name, 1);
}
}
which will key the docs on "classId-name" and for docs with no name, a specified sentinel value.
Querying the view should return the documents lexicographically ordered on this compound key (which you can reverse with a query parameter if you wish).

How to sort on multiple fields individually using a single index

I am trying to declare multiple fields in a single index like below and trying to sort on the single field only. is it possible?
Is there any way by which using a single combine fields index I can sort on individual fields dynamically.
{
"index": {
"fields": ["name","createdDate","updatedDate"]
},
"name" : "multi-filter",
"ddoc" : "MultiFilter"
"type" : "json"
}
after that, I can apply sort on the same sequence and list like
{
"selector": {"name": "Robert De Niro"},
"sort": [{"name": "asc"}, {"createdDate": "asc"},{"updatedDate": "asc"}]
}
BUT if I change the sequence or want to use a filter/sort on a single field like
{
"selector": {"name": "Robert De Niro"},
"sort": [{"name": "asc"}]
}
it gives an error saying, my motive is to use the single index, but sort individual fields. It looks like it is a limitation of couch DB and I need to create three separate indexes for the same to make it work, still hoping for the best option
{"error":"no_usable_index","reason":"No index exists for this sort, try indexing by the sort fields."}
I found this answer here: "Unknown Error: mango_idx :: {no_usable_index,missing_sort_index}"}
you could define an index only with the good field, eg:
{
"index": {
"fields": ["name"]
},
"name" : "name_sort",
"type" : "json"
}

How to reduce query execution time using mango query in CouchDB?

I am doing pagination of 15000 records using mango query in CouchDB, but as I skip the records in more numbers then the execution time is increasing.
Here is my query:
{
"selector": {
"name": {"$ne": "null"}
},
"fields": ["_id", "_rev", "name", "email" ],
"sort": [{"name": "asc" }],
"limit": 10,
"skip": '.$skip.'
}
Here skip documents are dynamic depends upon the pagination number and as soon as the skip number increases the query execution time also get increase.
CouchDB "Mango" queries that use the $ne (not equal) operator tend to suffer performance issues because of the way the indexing works. One solution is to create and index that *only contains documents where name does not equal null by using CouchDB's relative new partial index feature.
Partial indexes allow the database to be filtered at index time, so that the built index only contains documents that pass the filter test you specify. The index can then be used with a query at query time to further winnow the data set down.
An index is created by calling the /db/_index endpoint:
POST /db/_index HTTP/1.1
Content-Type: application/json
Content-Length: 144
Host: localhost:5984
{
"index": {
"partial_filter_selector": {
"name": {
"$ne": "null"
}
},
"fields": ["_id", "_rev", "name", "email"]
},
"ddoc": "mypartialindex",
"type" : "json"
}
This creates an index where only documents whose name is not null are included. We can then specify this index at query time:
{
"selector": {
"name": {
"$ne": "null"
}
},
"use_index": "mypartialindex"
}
In the above query, my selector is choosing all records, but the index it is accessing is already filtered. You may add additional clauses to the selector here to further filter the data at query time.
Partial indexing is described in the CouchDB documentation here and in this blog post.

Linking nested documents together and facetting in ElasticSearch

I have a mapping which looks like this:
"mappings": {
"mydoc": {
"properties": {
"event": {
"type": "nested",
"properties": {
"eventType": {
"type": "string"
},
"idList": {
"type": "integer"
},
"id": {
"type": "integer"
},
}
}
}
}
}
A mydoc document contains a nested array of event documents.
Within a mydoc document, I want to find all IDs where:
There exists an event with event.type='A' and event.idList contains some ID X
There exists another event with event.type='B' and event.id equals X
Across the index, I want a list of IDs where this criteria holds and also a count (for each ID) of the number of mydoc documents this occurred in.
Is it possible to achieve this in ElasticSearch? I was thinking it might be possible with a nested facet filter or a terms filter lookup but I have not seen a way to do it with these yet.
I think that a parent-child relation might suit your case better then a nested document.
Then you can query you (child) events document directly if you're searching only in the scope of the events (or add a condition on the _parent field to limit to a specific top document).
And you can use the has_child filter or query to search (or facet) on your top documents with conditions on the events (see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-has-child-filter.html )

View with geospatial and non geospatial keys with CouchDB

I'm using CouchDB and GeoCouch and I'm trying to understand if it were possible to build a geospatial index and "query" the database both by using a location and a value from another field.
Data
{
"_id": "1",
"profession": "medic",
"location": [15.12, 30.22]
}
{
"_id": "2",
"profession": "secretary",
"location": [15.12, 30.22]
}
{
"_id": "3",
"profession": "clown",
"location": [27.12, 2.2]
}
Questions
Is there any way to perform the following queries on these documents:
Find all documents with job = "medic" near location [15.12, 30.22] (more important)
List all the different professions near this location [15.12, 30.22] (a plus)
In case that's not possible, what options do I have? I'm already considering switching to MongoDB, but I'd rather solve in a different way.
Notes
Data changes quickly, new documents might be added and many might be removed
References
Faceted search with geo-index using CouchDB

Resources