We are evaluating ArangoDB performance in space of facets calculations.
There are number of other products capable of doing the same, either via special API or query language:
MarkLogic Facets
ElasticSearch Aggregations
Solr Faceting etc
We understand, there is no special API in Arango to calculate factes explicitly.
But in reality, it is not needed, thanks for a comprehensive AQL it can be easily achieved via simple query, like:
FOR a in Asset
COLLECT attr = a.attribute1 INTO g
RETURN { value: attr, count: length(g) }
This query calculate a facet on attribute1 and yields frequency in the form of:
[
{
"value": "test-attr1-1",
"count": 2000000
},
{
"value": "test-attr1-2",
"count": 2000000
},
{
"value": "test-attr1-3",
"count": 3000000
}
]
It is saying, that across my entire collection attribute1 took three forms (test-attr1-1, test-attr1-2 and test-attr1-3) with related counts provided.
Pretty much we run a DISTINCT query and aggregated counts.
Looks simple and clean. With only one, but really big issue - performance.
Provided query above runs for !31 seconds! on top of the test collection with only 8M documents.
We have experimented with different index types, storage engines (with rocksdb and without), investigating explanation plans at no avail.
Test documents we use in this test are very concise with only three short attributes.
We would appreciate any input at this point.
Either we doing something wrong. Or ArangoDB simply is not designed to perform in this particular area.
btw, ultimate goal would be to run something like the following in under-second time:
LET docs = (FOR a IN Asset
FILTER a.name like 'test-asset-%'
SORT a.name
RETURN a)
LET attribute1 = (
FOR a in docs
COLLECT attr = a.attribute1 INTO g
RETURN { value: attr, count: length(g[*])}
)
LET attribute2 = (
FOR a in docs
COLLECT attr = a.attribute2 INTO g
RETURN { value: attr, count: length(g[*])}
)
LET attribute3 = (
FOR a in docs
COLLECT attr = a.attribute3 INTO g
RETURN { value: attr, count: length(g[*])}
)
LET attribute4 = (
FOR a in docs
COLLECT attr = a.attribute4 INTO g
RETURN { value: attr, count: length(g[*])}
)
RETURN {
counts: (RETURN {
total: LENGTH(docs),
offset: 2,
to: 4,
facets: {
attribute1: {
from: 0,
to: 5,
total: LENGTH(attribute1)
},
attribute2: {
from: 5,
to: 10,
total: LENGTH(attribute2)
},
attribute3: {
from: 0,
to: 1000,
total: LENGTH(attribute3)
},
attribute4: {
from: 0,
to: 1000,
total: LENGTH(attribute4)
}
}
}),
items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),
facets: {
attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a),
attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a),
attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a),
attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 return a)
}
}
Thanks!
Turns out main thread has happened on ArangoDB Google Group.
Here is a link to a full discussion
Here is a summary of current solution:
Run custom build of the Arango from a specific feature branch where number of performance improvements has been done (hope they should make it to a main release soon)
No indexes are required for a facets calculations
MMFiles is a preferred storage engine
AQL should be written to use "COLLECT attr = a.attributeX WITH COUNT INTO length" instead of "count: length(g)"
AQL should be split into smaller pieces and run in parallel (we are running Java8's Fork/Join to spread facets AQLs and then join them into a final result)
One AQL to filter/sort and retrieve main entity (if required. while sorting/filtering add corresponding skiplist index)
The rest are small AQLs for each facet value/frequency pairs
In the end we have gained >10x performance gain compare to an original AQL provided above.
Related
I currently have three collections that need to be routed into one endpoint. I want to get the Course collection sort it, then from that course, I have to use nested subqueries to fetch a random review(there could be multiple tied to the same course) and also get the related user.
User{
name:
_id:User/4638
key: ...}
Review{
_from: User/4638
_to: Course/489
date: ....}
Course{
_id: Course/489
title: ...}
The issue I'm having is fetching the user based on the review. I've tried MERGE, but that seems to limit the query to one use when there should be multiple. Below is the current output using LET.
"course": {
"_key": "789",
"_id": "Courses/789",
"_rev": "_ebjuy62---",
"courseTitle": "Pandas Essential Training",
"mostRecentCost": 15.99,
"hours": 20,
"averageRating": 5
},
"review": [
{
"_key": "543729",
"_id": "Reviews/543729",
"_from": "Users/PGBJ38",
"_to": "Courses/789",
"_rev": "_ebOrt9u---",
"rating": 2
}
],
"user": []
},
Here is the current LET subquery method I'm using. I was wondering if there was anyway to pass or maybe nest the subqueries so that user can read review. Currently I try to pass the LET var but that isn't read in the output since a blank array is shown.
FOR c IN Courses
SORT c.averageRating DESC
LIMIT 3
LET rev = (FOR r IN Reviews
FILTER c._id == r._to
SORT RAND()
LIMIT 1
RETURN r)
LET use = (FOR u IN Users
FILTER rev._from == u._id
RETURN u)
RETURN {course: c, review: rev, user: use}`
The result of the first LET query, rev, is an array with one element. You can re-write the complete query two ways:
Set rev to the first element of the LET query result:
FOR c IN Courses
SORT c.averageRating DESC
LIMIT 3
LET rev = (FOR r IN Reviews
FILTER c._id == r._to
SORT RAND()
LIMIT 1
RETURN r)[0]
LET use = (FOR u IN Users
FILTER rev._from == u._id
RETURN u)
RETURN {course: c, review: rev, user: use}
I use this variant in my own projects.
Access the first elememt og rev in the second LET query:
FOR c IN Courses
SORT c.averageRating DESC
LIMIT 3
LET rev = (FOR r IN Reviews
FILTER c._id == r._to
SORT RAND()
LIMIT 1
RETURN r)
LET use = (FOR u IN Users
FILTER rev[0]._from == u._id
RETURN u)
RETURN {course: c, review: rev, user: use}
This is untested, the syntax might need slight changes. And you have to look at cases where there aren't any reviews - I can't say how this behaves in that case from the top of my head.
The query receives a pair of coordinates, a maximum Distance radius, a "skip" integer and a "limit" integer. The function should return the closest and newest locations according to the position given. There is no visible error in my code, however, when I call the query again, it returns repeated results. "skip" variable is updated according to the results returned.
Example:
1) I make query with skip = 0, limit = 10. I receive 10 non-repeated locations.
2) Query is called again now, skip = 10, limit = 10. I receive another 10 locations with repeated results from the first query.
QUERY
Locations.find({ coordinates :
{ $near : [ x , y ],
$maxDistance: maxDistance }
})
.sort('date_created')
.skip(skip)
.limit(limit)
.exec(function(err, locations) {
console.log("[+]Found Locations");
callback(locations);
});
SCHEMA
var locationSchema = new Schema({
date_created: { type: Date },
coordinates: [],
text: { type: String }
});
I have tried looking everywhere for a solution. My only option would be versions of Mongo? I use mongoose 4.x.x and mongodb is like 2.5.6. I believe. Any ideas?
There are a couple of things to consider here in the sort of results that you want, with the first consideration being that you have a "secondary" sort criteria in the "date_created" to deal with.
The basic problem there is that the $near operator and like operators in MongoDB do not at present "project" any field to indicate the "distance" from the queried location, and simply just "default sort" the data. So in order to do that "secondary" sort, a field with the "distance" needs to be present. There are therefore other options for this.
The second case is that "skip" and "limit" style paging is horrible form performance on large sets of data and should be avoided where you can. So it's better to select data based on a "range" where it occurs rather than "skip" through all the results you have previously displayed.
The first thing to do here is use a command that can "project" the distance into the document along with the other information. The aggregation command of $geoNear is good for this, and especially since we want to do other sorting:
var seenIds = [],
lastDistance = null,
lastDate = null;
Locations.aggregate(
[
{ "$geoNear": {
"near": [x,y],
"maxDistance": maxDistance
"distanceField": "dist",
"limit": 10
}},
{ "$sort": { "dist": 1, "date_created": -1 }
],
function(err,results) {
results.forEach(function(result) {
if ( ( result.dist != lastDistance ) || ( result.date_created != lastDate ) ) {
seenIds = [];
lastDistance = result.dist;
lastDate = result.date_created;
}
seenIds.push(result._id);
});
// save those variables to session or other persistence
// do something with results
}
)
That is the first iteration of your results where you fetch the first 10. Noting the logic inside the loop, where each document in the results is inspected for either a change in the "date_created" or the projected "dist" field now present in the document and where this occurs the "seenIds" array is wiped of all current entries. The general action is that all the variables are tested and possibly updated on each iteration and where there is no change then items are added to the list of "seenIds".
All those three variables being worked on need to be stored somewhere awaiting the next request. For web applications the session store is ideal, but different approaches vary. You just want those values to be recalled when we start the next request, as on the next and subsequent iterations we alter the query a bit:
Locations.aggregate(
[
{ "$geoNear": {
"near": [x,y],
"maxDistance": maxDistance,
"minDistance": lastDistance,
"distanceField": "dist",
"limit": 10,
"query": {
"_id": { "$nin": seenIds },
"date_created": { "$lt": lastDate }
}
}},
{ "$sort": { "dist": 1, "date_created": -1 }
],
function(err,results) {
results.forEach(function(result) {
if ( ( result.dist != lastDistance ) || ( result.date_created != lastDate ) ) {
seenIds = [];
lastDistance = result.dist;
lastDate = result.date_created;
}
seenIds.push(result._id);
});
// save those variables to session or other persistence
// do something with results
}
)
So there the "minDistance" parameter is entered as you want to exclude any of the "nearer" results that have already been seen, and the additional checks are placed in the query with the "date_created" needing to be "less than" the "lastDistance" recorded as well since we are in descending order of sort, with the final "sure" filter in excluding any "_id" values that were recorded within the list because the values had not changed.
Now with geospatial data that "seenIds" list is not likely to grow as generally you are not going to find things all at the same distance, but it is a general process of paging a sorted list of data like this, so it is worth understanding the concept.
So if you want to be able to use a secondary field to sort on with geospatial data and also considering the "near" distance then this is the general approach, by projecting a distance value into the document results as well as storing the last seen values before any changes that would not make them unique.
The general concept is "advancing the minimum distance" to enable each page of results to get gradually "further away" from the source point of origin used in the query.
I have a set of 2.8 million docs with sets of tags that I'm querying with ElasticSearch, but many of these docs can be grouped together by one ID. I want to query my data using the tags, and then aggregate them by the ID that repeats. Often my search results have tens of thousands of documents, but I only want to aggregate the top 100 results of the search. How can I constrain an aggregation to only the top 100 results from a query?
Sampler Aggregation :
A filtering aggregation used to limit any sub aggregations' processing
to a sample of the top-scoring documents.
"aggs": {
"bestDocs": {
"sampler": {
// "field": "<FIELD>", <-- optional, Controls diversity using a field
"shard_size":100
},
"aggs": {
"bestBuckets": {
"terms": {
"field": "id"
}
}
}
}
}
This query will limit the sub aggregation to top 100 docs from the result and then bucket them by ID.
Optionally, you can use the field or script and max_docs_per_value settings to control the maximum number of documents collected on any one shard which share a common value.
The size parameter can be set to define how many term buckets should be returned out of the overall terms list.
By default, the node coordinating the search process will request each shard to provide its own top size term buckets and once all shards respond, it will reduce the results to the final list that will then be returned to the client. This means that if the number of unique terms is greater than size, the returned list is slightly off and not accurate (it could be that the term counts are slightly off and it could even be that a term that should have been in the top size buckets was not returned).
If set to 0, the size will be set to Integer.MAX_VALUE.
Here is an example code to return top 100:
{
"aggs" : {
"products" : {
"terms" : {
"field" : "product",
"size" : 100
}
}
}
}
You can refer to this for more information.
You can use the min_doc_count parameter
{
"aggs" : {
"products" : {
"terms" : {
"field" : "product",
"min_doc_count" : 100
}
}
}
}
Here is the doc "schema":
{
type: "offer",
product: "xxx",
price: "14",
valid_from: [2012, 7, 1, 0, 0, 0]
}
There are a lot of such documents with many valid dates in the past and the future and a lot of times two or three offers in the same month. I can't find a way to make the following view: Given a date, give me a list of the products and their running offer for that date.
I think I need to emit the valid_date field in order to set the endkey of the query to the given date and then I need to reduce on the max of this field, which means i can't emit it.
Have I got it wrong? I am totally new to the map/reduce concept. Any suggestions on how to do it?
I'm really thrown by your comments about wanting to reduce, based on your requirements you want just a map function - no reduce. Here's the map function based on what you asked for:
function(d) {
if( d.type === 'offer' ) {
var dd = d.valid_from;
dd[1] = ( '0' + ( dd[1] + 1 )).slice(-2); // remove the +1 if 7 is July not August
dd[2] = ( '0' + dd[2] ).slice(-2);
emit( dd.slice(0,3).join('-') );
}
}
Then to show all offers valid for a given day you'd query this view with params like:
endkey="2012-08-01"&include_docs=true
For a monitoring an application with CouchDB I need to sum up a field of my data (for example the time needed to execute a method that has been logged).
That's no problem for me with map-reduce, but I need to sum up only the data recorded in a special time slice.
Example records:
{_id: 1, methodID:1, recorded: 100, timeneeded: 10},
{_id: 2, methodID:1, recorded: 200, timeneeded: 11},
{_id: 3, methodID:2, recorded: 200, timeneeded: 2},
{_id: 4, methodID:1, recorded: 300, timeneeded: 6},
{_id: 5, methodID:2, recorded: 310, timeneeded: 3},
{_id: 6, methodID:1, recorded: 400, timeneeded: 9}
Now I would like to get just the sum of timeneeded of all records that have been recorded in the range of 200 to 350 and grouped by methodID. (That would be 17 for methodID:1 and 5 for methodID:2.)
How can I do that?
I now tried it with a list function that's using WickedGrey's idea. See my functions here:
map function:
function(doc) {
emit([ doc.recorded], {methodID:doc.methodID, timeneeded:doc.timeneeded});
}
list function:
"function(head, req) {
var combined_values = {};
var row;
while (row = getRow()) {
if( row.values.methodID in combined_values) {
combined_values[ row.values.methodID] +=row.values.timeneeded;
}
else {
combined_values[ row.values.methodID] = row.values.timeneeded;
}
}
for(var methodID in combined_values){
send( toJSON({method: methodID, timeneeded:combined_values[methodID]}) );
}
}"
Now I have to problems:
1. I always get the results as a file and my firefox asks me if I want to download it, instead of viewing it in the browser like when I query a classic view.
2. As I understand the thing, the results are now calculated on the fly, in the list function. I expect this to be not really fast with hundrets of millions of records... Any ideas how to get it faster?
Thank you for your help!
andy
You can't use a map key to filter by one set of criteria, but group by another in CouchDB. However, you can filter the keys by time range, and group with a reduce function. Try something like this:
function map(doc) {
emit(doc.recorded, {doc.methodID: doc.timeneeded});
}
function reduce(key, values, rereduce) {
var combined_values = {};
for (var i in values) {
var totals = values[i];
for (var methodID in totals) {
if (methodID in combined_values) {
combined_values[methodID] += totals[methodID];
}
else {
combined_values[methodID] = totals[methodID];
}
}
}
return combined_values;
}
That should allow you to specify a start/end key, and with group_level=0 should get you a value containing the dictionary that you're looking for.
Edit: Also, this thread might be of interest:
http://couchdb-development.1959287.n2.nabble.com/reduce-limit-error-td2789734.html
It discusses an option to turn off the reduce must shrink message, and further down the list provides other ways of achieving the same goal: using a list function. That might be a better approach that what I've outlined here. :(
function map(doc) {
if(doc.methodID && doc.recorded && doc.timeneeded) {
emit([doc.methodID,doc.recorded], doc.timeneeded);
}
}
//reduce
_sum