Arangodb AQL nested subqueries relying on the data from another - arangodb

I currently have three collections that need to be routed into one endpoint. I want to get the Course collection sort it, then from that course, I have to use nested subqueries to fetch a random review(there could be multiple tied to the same course) and also get the related user.
User{
name:
_id:User/4638
key: ...}
Review{
_from: User/4638
_to: Course/489
date: ....}
Course{
_id: Course/489
title: ...}
The issue I'm having is fetching the user based on the review. I've tried MERGE, but that seems to limit the query to one use when there should be multiple. Below is the current output using LET.
"course": {
"_key": "789",
"_id": "Courses/789",
"_rev": "_ebjuy62---",
"courseTitle": "Pandas Essential Training",
"mostRecentCost": 15.99,
"hours": 20,
"averageRating": 5
},
"review": [
{
"_key": "543729",
"_id": "Reviews/543729",
"_from": "Users/PGBJ38",
"_to": "Courses/789",
"_rev": "_ebOrt9u---",
"rating": 2
}
],
"user": []
},
Here is the current LET subquery method I'm using. I was wondering if there was anyway to pass or maybe nest the subqueries so that user can read review. Currently I try to pass the LET var but that isn't read in the output since a blank array is shown.
FOR c IN Courses
SORT c.averageRating DESC
LIMIT 3
LET rev = (FOR r IN Reviews
FILTER c._id == r._to
SORT RAND()
LIMIT 1
RETURN r)
LET use = (FOR u IN Users
FILTER rev._from == u._id
RETURN u)
RETURN {course: c, review: rev, user: use}`

The result of the first LET query, rev, is an array with one element. You can re-write the complete query two ways:
Set rev to the first element of the LET query result:
FOR c IN Courses
SORT c.averageRating DESC
LIMIT 3
LET rev = (FOR r IN Reviews
FILTER c._id == r._to
SORT RAND()
LIMIT 1
RETURN r)[0]
LET use = (FOR u IN Users
FILTER rev._from == u._id
RETURN u)
RETURN {course: c, review: rev, user: use}
I use this variant in my own projects.
Access the first elememt og rev in the second LET query:
FOR c IN Courses
SORT c.averageRating DESC
LIMIT 3
LET rev = (FOR r IN Reviews
FILTER c._id == r._to
SORT RAND()
LIMIT 1
RETURN r)
LET use = (FOR u IN Users
FILTER rev[0]._from == u._id
RETURN u)
RETURN {course: c, review: rev, user: use}
This is untested, the syntax might need slight changes. And you have to look at cases where there aren't any reviews - I can't say how this behaves in that case from the top of my head.

Related

Hyperledger composer, queries breaking with more than 2 WHERE conditions

Here's my query
query Q_BUY_SELL{
description: "Select filtered Orders "
statement:
SELECT namespace.Order
WHERE ((((orderType != _$filterType AND orderStatus == _$filterStatus) AND bidTokenPrice == _$bidTokenPrice) AND orderer != _$orderer) AND property == _$property )
}
And here's how i'm using it
return query('Q_BUY_SELL', {
filterStatus: 'PENDING',
filterType: 'SELL',
bidTokenPrice: 10,
orderer:'resource:com.contrachain.User#o1',
property:'resource:com.contrachain.Property#p2'
})
.then(function (assets) {
console.log(assets);
// Some alterations to assets
Here's the only Asset in my db which i wasn't expecting in result because of the 'orderer' field. (See orderer != _$orderer in query)
{
"$class": "com.contrachain.Order",
"orderId": "Thu Feb 22 2018 15:57:05 GMT+0530 (IST)-30",
"orderType": "BUY",
"orderStatus": "PENDING",
"bidTokenPrice": 10,
"tokenQuantity": 30,
"orderTime": "2018-02-22T10:27:05.089Z",
"orderer": "resource:com.contrachain.User#o1",
"property": "resource:com.contrachain.Property#p2"
}
But it's still there in the response in console.
TLDR; I have 5 conditions(1,2,3,4,5) in the query Q_BUY_SELL out of which (1,5) are working fine but the 2nd, 3rd and 4th conditions are not being applied to the results.
I feel silly posting this question as the problem seems trivial, but i've been stuck with this for a while now and need some external perspective to identify what i'm missing here.
**UPDATE: Relevant part of the Models **
asset Property identified by propertyId {
o String propertyId
--> User owner
}
asset Order identified by orderId {
o String orderId
o OrderType orderType
o OrderStatus orderStatus default = 'PENDING'
o Double bidTokenPrice
o Double tokenQuantity
o DateTime orderTime
--> User orderer
--> Property property
}
abstract participant Account identified by emailId {
o String emailId
o String name default = ''
o DateTime joiningDate
o Boolean isActive default=false
}
participant User extends Account {
o Double balanceINR default=0.0
}
transaction PlaceOrder {
o OrderType orderType
o Double bidTokenPrice
o Double tokenQuantity
o DateTime orderTime
--> User orderer
--> Property property
}
enum OrderType {
o BUY
o SELL
}
enum OrderStatus {
o PENDING
o SUCCESSFUL
o VOID
}
its difficult to replicate without the model. But I suggest to 'pare' it back to 4 criteria in your query to begin with (so - remove the property resource comparison for example) and see if it does/doesn't return the orderer (as you wrote). In any case - I would ALSO create a second record so the query does return 'something' that IS EXPECTED to be a match (and hopefully omits the record that shouldn't match), for testing - just so you can see that the query returns an orderer matching your criteria etc etc - bur first try see if the query works with 4 criteria including orderer check. What I'm suggesting is to see if there's a breakage in the aggregation of criteria (or not).
an example of building queries in a transaction and parsing is shown here FYI -> https://github.com/hyperledger/composer/blob/master/packages/composer-tests-functional/systest/data/transactions.queries.js
just to say I don't see the problems you're seeing for multiple criteria - I've tried 5 and six, mixing up - all worked fine?. Perhaps you can give feedback.
I went and tested some queries as follows with parameterised query with criteria (essentially the same as you did, for a different business network) - in online playground (So I create the queries.qry file then call the query in my TP function ( FYI that and the sample function code are at the bottom):
Numbers(below) represent left-side 'fields' in query definition (these fields are shown in the query below (ordinal, left to right)) - my 'data' is at the bottom - essentially I always (except for one test) 'miss out' on a record with tradingSymbol == "1"
query myQryTest {
description: "Select all commodities"
statement:
SELECT org.acme.trading.Commodity
WHERE (tradingSymbol != _$tradingSymbol AND (description == _$description OR description == _$description) AND mainExchange == _$mainExchange AND quantity == _$quantity AND owner == _$owner )
}
}
RESULTS:
(1 AND 2 AND 3 AND 4 AND 5) - ie all with '==' comparisons.(so 1st criteria would changed to be tradingSymbol == "1" above -just for this one test only). All worked fine. 1 record.
(!= 1 AND 2 AND 3 AND 4 AND = 5) - so one negation in criteria. Worked fine. Got the results I wanted (2 records - see my sample data below)
(!=1 AND (2 = "2" OR 2 == "3") AND 3 AND 4 AND 5) - 6 criteria, as shown above - one in parentheses (where field match is an OR). Worked fine. Got the right results. I changed the description in record 3 to be "4" and I get one record,
{
"$class": "org.acme.trading.Commodity",
"tradingSymbol": "1",
"description": "2",
"mainExchange": "3",
"quantity": 4,
"owner": "resource:org.acme.trading.Trader#1"
}
{
"$class": "org.acme.trading.Commodity",
"tradingSymbol": "2",
"description": "2",
"mainExchange": "3",
"quantity": 4,
"owner": "resource:org.acme.trading.Trader#1"
}
{
"$class": "org.acme.trading.Commodity",
"tradingSymbol": "3",
"description": "2",
"mainExchange": "3",
"quantity": 4,
"owner": "resource:org.acme.trading.Trader#1"
}
you can try this out for yourself - you can try this out with the trade-network from online playground https://composer-playground.mybluemix.net/ (deploy trade-network sample) and create a Trader trader 1 record in the Participant.
/**
* Try access elements of an array of B assets
* #param {org.acme.trading.Qry} qry - function to access A elements (and B assets)
* #transaction
*/
function myQryfunc(qry) {
return query('myQryTest', {
"tradingSymbol": "1",
"description": "2",
"mainExchange": "3",
"quantity": 4,
"owner": "resource:org.acme.trading.Trader#1"
})
.then(function (results) {
var promises = [];
for (var n = 0; n < results.length; n++) {
var asset = results[n];
console.log('Query result is ' + (n+1) + ', object is ' + asset);
console.log('Asset identifier is ' + asset.getIdentifier());
}
});
}
Model definition for my Qry transaction (to call it in playground) is
transaction Qry {
--> Commodity commodity
}

ArangoDB Faceted Search Performance

We are evaluating ArangoDB performance in space of facets calculations.
There are number of other products capable of doing the same, either via special API or query language:
MarkLogic Facets
ElasticSearch Aggregations
Solr Faceting etc
We understand, there is no special API in Arango to calculate factes explicitly.
But in reality, it is not needed, thanks for a comprehensive AQL it can be easily achieved via simple query, like:
FOR a in Asset
COLLECT attr = a.attribute1 INTO g
RETURN { value: attr, count: length(g) }
This query calculate a facet on attribute1 and yields frequency in the form of:
[
{
"value": "test-attr1-1",
"count": 2000000
},
{
"value": "test-attr1-2",
"count": 2000000
},
{
"value": "test-attr1-3",
"count": 3000000
}
]
It is saying, that across my entire collection attribute1 took three forms (test-attr1-1, test-attr1-2 and test-attr1-3) with related counts provided.
Pretty much we run a DISTINCT query and aggregated counts.
Looks simple and clean. With only one, but really big issue - performance.
Provided query above runs for !31 seconds! on top of the test collection with only 8M documents.
We have experimented with different index types, storage engines (with rocksdb and without), investigating explanation plans at no avail.
Test documents we use in this test are very concise with only three short attributes.
We would appreciate any input at this point.
Either we doing something wrong. Or ArangoDB simply is not designed to perform in this particular area.
btw, ultimate goal would be to run something like the following in under-second time:
LET docs = (FOR a IN Asset
FILTER a.name like 'test-asset-%'
SORT a.name
RETURN a)
LET attribute1 = (
FOR a in docs
COLLECT attr = a.attribute1 INTO g
RETURN { value: attr, count: length(g[*])}
)
LET attribute2 = (
FOR a in docs
COLLECT attr = a.attribute2 INTO g
RETURN { value: attr, count: length(g[*])}
)
LET attribute3 = (
FOR a in docs
COLLECT attr = a.attribute3 INTO g
RETURN { value: attr, count: length(g[*])}
)
LET attribute4 = (
FOR a in docs
COLLECT attr = a.attribute4 INTO g
RETURN { value: attr, count: length(g[*])}
)
RETURN {
counts: (RETURN {
total: LENGTH(docs),
offset: 2,
to: 4,
facets: {
attribute1: {
from: 0,
to: 5,
total: LENGTH(attribute1)
},
attribute2: {
from: 5,
to: 10,
total: LENGTH(attribute2)
},
attribute3: {
from: 0,
to: 1000,
total: LENGTH(attribute3)
},
attribute4: {
from: 0,
to: 1000,
total: LENGTH(attribute4)
}
}
}),
items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),
facets: {
attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a),
attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a),
attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a),
attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 return a)
}
}
Thanks!
Turns out main thread has happened on ArangoDB Google Group.
Here is a link to a full discussion
Here is a summary of current solution:
Run custom build of the Arango from a specific feature branch where number of performance improvements has been done (hope they should make it to a main release soon)
No indexes are required for a facets calculations
MMFiles is a preferred storage engine
AQL should be written to use "COLLECT attr = a.attributeX WITH COUNT INTO length" instead of "count: length(g)"
AQL should be split into smaller pieces and run in parallel (we are running Java8's Fork/Join to spread facets AQLs and then join them into a final result)
One AQL to filter/sort and retrieve main entity (if required. while sorting/filtering add corresponding skiplist index)
The rest are small AQLs for each facet value/frequency pairs
In the end we have gained >10x performance gain compare to an original AQL provided above.

How to find the most occurring attribute values count in documents of a ArangoDB collection?

I have a collection in ArangoDB where each document contains some attributes like
{
"contributor_name": "Rizano",
"action": "create",
"id": 3633,
"type": "newusers",
"logtitle": "What to do",
"timestamp": "2006-07-05",
"contributor_id": 7878
}
The collection contains millions of documents. Now I want to find out which contributor_name is most occurring in the documents and their count.
You can simply group by contributor_name and use the special COLLECT syntax variant WITH COUNT INTO ... to efficient compute how often each value occurs in the dataset:
FOR doc IN coll
COLLECT name = doc.contributor_name WITH COUNT INTO count
RETURN { name, count }
The result may look like this:
[
{ "name": "Rizano", "count": 5 },
{ "name": "Felipe", "count": 8 },
...
]
You may merge the result together like this if you prefer that format:
[
{
"Rizano": 5,
"Felipe": 8
}
...
]
Query:
RETURN MERGE(
FOR doc IN coll
COLLECT name = doc.contributor_name WITH COUNT INTO count
RETURN { [name]: count }
)
You may also sort by count and limit the result to the most occurring values, e.g. like this (only top contributor):
FOR doc IN coll
COLLECT name = doc.contributor_name WITH COUNT INTO count
SORT count DESC
LIMIT 1
RETURN { name, count }
There's also COLLECT AGGREGATE, although there should be no difference in performance for this particular query:
FOR doc IN coll
COLLECT name = doc.contributor_name AGGREGATE count = LENGTH(1)
SORT count DESC
LIMIT 1
RETURN { name, count }
The value passed to LENGTH doesn't really matter, all we want is that it returns a length of 1 (thus increasing the counter by 1 for the given contributor).

How to construct a query to filter data from a cloudant database using python?

I am a complete novice in python. I have connected to a database in my cloudant account through python. I need to construct a query for filtering the data.
Here x is an unixdate object which i have calculated using the formula :-
s = str(datetime.date.today() - datetime.timedelta(days=2))
x = datetime.datetime.strptime(s, "%Y-%m-%d").timestamp()
Method 1:
query = Query(my_database, selector={"type": "add", "uid": { "$gt": 0 }, "$_viewts": {"$gt": x }})
This query raises an HTTP error stating
invalid operator for: $_viewts
Method 2:
query = Query(my_database, selector={"type": "add", "uid": { "$gt": 0 }, "_viewts": {"$gt": x }})
This query runs when I remove the $ sign from the key $_viewts but produces an empty list as a result.
Any solutions/suggestions would be of great help.
Thanks

Mongoose query returning repeated results

The query receives a pair of coordinates, a maximum Distance radius, a "skip" integer and a "limit" integer. The function should return the closest and newest locations according to the position given. There is no visible error in my code, however, when I call the query again, it returns repeated results. "skip" variable is updated according to the results returned.
Example:
1) I make query with skip = 0, limit = 10. I receive 10 non-repeated locations.
2) Query is called again now, skip = 10, limit = 10. I receive another 10 locations with repeated results from the first query.
QUERY
Locations.find({ coordinates :
{ $near : [ x , y ],
$maxDistance: maxDistance }
})
.sort('date_created')
.skip(skip)
.limit(limit)
.exec(function(err, locations) {
console.log("[+]Found Locations");
callback(locations);
});
SCHEMA
var locationSchema = new Schema({
date_created: { type: Date },
coordinates: [],
text: { type: String }
});
I have tried looking everywhere for a solution. My only option would be versions of Mongo? I use mongoose 4.x.x and mongodb is like 2.5.6. I believe. Any ideas?
There are a couple of things to consider here in the sort of results that you want, with the first consideration being that you have a "secondary" sort criteria in the "date_created" to deal with.
The basic problem there is that the $near operator and like operators in MongoDB do not at present "project" any field to indicate the "distance" from the queried location, and simply just "default sort" the data. So in order to do that "secondary" sort, a field with the "distance" needs to be present. There are therefore other options for this.
The second case is that "skip" and "limit" style paging is horrible form performance on large sets of data and should be avoided where you can. So it's better to select data based on a "range" where it occurs rather than "skip" through all the results you have previously displayed.
The first thing to do here is use a command that can "project" the distance into the document along with the other information. The aggregation command of $geoNear is good for this, and especially since we want to do other sorting:
var seenIds = [],
lastDistance = null,
lastDate = null;
Locations.aggregate(
[
{ "$geoNear": {
"near": [x,y],
"maxDistance": maxDistance
"distanceField": "dist",
"limit": 10
}},
{ "$sort": { "dist": 1, "date_created": -1 }
],
function(err,results) {
results.forEach(function(result) {
if ( ( result.dist != lastDistance ) || ( result.date_created != lastDate ) ) {
seenIds = [];
lastDistance = result.dist;
lastDate = result.date_created;
}
seenIds.push(result._id);
});
// save those variables to session or other persistence
// do something with results
}
)
That is the first iteration of your results where you fetch the first 10. Noting the logic inside the loop, where each document in the results is inspected for either a change in the "date_created" or the projected "dist" field now present in the document and where this occurs the "seenIds" array is wiped of all current entries. The general action is that all the variables are tested and possibly updated on each iteration and where there is no change then items are added to the list of "seenIds".
All those three variables being worked on need to be stored somewhere awaiting the next request. For web applications the session store is ideal, but different approaches vary. You just want those values to be recalled when we start the next request, as on the next and subsequent iterations we alter the query a bit:
Locations.aggregate(
[
{ "$geoNear": {
"near": [x,y],
"maxDistance": maxDistance,
"minDistance": lastDistance,
"distanceField": "dist",
"limit": 10,
"query": {
"_id": { "$nin": seenIds },
"date_created": { "$lt": lastDate }
}
}},
{ "$sort": { "dist": 1, "date_created": -1 }
],
function(err,results) {
results.forEach(function(result) {
if ( ( result.dist != lastDistance ) || ( result.date_created != lastDate ) ) {
seenIds = [];
lastDistance = result.dist;
lastDate = result.date_created;
}
seenIds.push(result._id);
});
// save those variables to session or other persistence
// do something with results
}
)
So there the "minDistance" parameter is entered as you want to exclude any of the "nearer" results that have already been seen, and the additional checks are placed in the query with the "date_created" needing to be "less than" the "lastDistance" recorded as well since we are in descending order of sort, with the final "sure" filter in excluding any "_id" values that were recorded within the list because the values had not changed.
Now with geospatial data that "seenIds" list is not likely to grow as generally you are not going to find things all at the same distance, but it is a general process of paging a sorted list of data like this, so it is worth understanding the concept.
So if you want to be able to use a secondary field to sort on with geospatial data and also considering the "near" distance then this is the general approach, by projecting a distance value into the document results as well as storing the last seen values before any changes that would not make them unique.
The general concept is "advancing the minimum distance" to enable each page of results to get gradually "further away" from the source point of origin used in the query.

Resources