How to unwind data held in edges with a "common neighbors" style query? - arangodb

I have a simple model with a single A Document collection
[{ _key: 'doc1', id: 'a/doc1', name: 'Doc 1' }, { _key: 'doc2', id: 'a/doc2', name: 'Doc 2' }]
and a single B Edge collection, joining documents A with an held weight integer on each edge.
[{ _key: 'xxx', id: 'b/xxx', _from: 'a/doc1', _to: 'a/doc2', weight: 256 }]
I'm trying to make a "common neighbors" style query, that takes 2 document as an input, and yields common neighbors of those inputs, along with respective weights (of each side).
For example with doc1 and doc26 input, here is the goal to achieve :
[
{ _key: 'doc6', weightWithDoc1: 43, weightWithDoc26: 57 },
{ _key: 'doc12', weightWithDoc1: 98, weightWithDoc26: 173 },
{ _key: 'doc21', weightWithDoc1: 3, weightWithDoc26: 98 },
]
I successfully started by targeting a single side :
FOR associated, association
IN 1..1
ANY ${d1}
${EdgeCollection}
SORT association.weight DESC
LIMIT 20
RETURN { _key: associated._key, weight: association.weight }
Then successfully went on with the INTERSECTION logic of the documentation
FOR proj IN INTERSECTION(
(FOR associated, association
IN 1..1
ANY ${d1}
${EdgeCollection}
RETURN { _key: associated._key }),
(FOR associated, association
IN 1..1
ANY ${d2}
${EdgeCollection}
RETURN { _key: associated._key })
)
LIMIT 20
RETURN proj
But I'm now struggling at extracting the weight of each side, as unwinding it on the inner RETURN clauses will make them exclusive on the intersection; thus returning nothing.
Questions :
Is there any way to make some kind of "selective INTERSECTION", grouping some fields in the process ?
Is there an alternative to INTERSECTION to achieve my goal ?
Bonus question :
Ideally, after successfully extracting weightWithDoc1 and weightWithDoc26, I'd like to SORT DESC by weightWithDoc1 + weightWithDoc26

I managed to find an acceptable answer myself
FOR associated IN INTERSECTION(
(FOR associated
IN 1..1
ANY ${doc1}
${EdgeCollection}
RETURN { _key: associated._key }),
(FOR associated
IN 1..1
ANY ${doc2}
${EdgeCollection}
RETURN { _key: associated._key })
)
LET association1 = FIRST(FOR association IN ${EdgeCollection}
FILTER association._from == CONCAT(${DocCollection.name},'/',MIN([${doc1._key},associated._key])) AND association._to == CONCAT(${DocCollection.name},'/',MAX([${doc1._key},associated._key]))
RETURN association)
LET association2 = FIRST(FOR association IN ${EdgeCollection}
FILTER association._from == CONCAT(${DocCollection.name},'/',MIN([${doc2._key},associated._key])) AND association._to == CONCAT(${DocCollection.name},'/',MAX([${doc2._key},associated._key]))
RETURN association)
SORT (association1.weight+association2.weight) DESC
LIMIT 20
RETURN { _key: associated._key, weight1: association1.weight, weight2: association2.weight }
I believe re-selecting after intersecting is not ideal and not the most performant solution, so I'm leaving it open for now to wait for a better answer.

Related

Arangodb AQL nested subqueries relying on the data from another

I currently have three collections that need to be routed into one endpoint. I want to get the Course collection sort it, then from that course, I have to use nested subqueries to fetch a random review(there could be multiple tied to the same course) and also get the related user.
User{
name:
_id:User/4638
key: ...}
Review{
_from: User/4638
_to: Course/489
date: ....}
Course{
_id: Course/489
title: ...}
The issue I'm having is fetching the user based on the review. I've tried MERGE, but that seems to limit the query to one use when there should be multiple. Below is the current output using LET.
"course": {
"_key": "789",
"_id": "Courses/789",
"_rev": "_ebjuy62---",
"courseTitle": "Pandas Essential Training",
"mostRecentCost": 15.99,
"hours": 20,
"averageRating": 5
},
"review": [
{
"_key": "543729",
"_id": "Reviews/543729",
"_from": "Users/PGBJ38",
"_to": "Courses/789",
"_rev": "_ebOrt9u---",
"rating": 2
}
],
"user": []
},
Here is the current LET subquery method I'm using. I was wondering if there was anyway to pass or maybe nest the subqueries so that user can read review. Currently I try to pass the LET var but that isn't read in the output since a blank array is shown.
FOR c IN Courses
SORT c.averageRating DESC
LIMIT 3
LET rev = (FOR r IN Reviews
FILTER c._id == r._to
SORT RAND()
LIMIT 1
RETURN r)
LET use = (FOR u IN Users
FILTER rev._from == u._id
RETURN u)
RETURN {course: c, review: rev, user: use}`
The result of the first LET query, rev, is an array with one element. You can re-write the complete query two ways:
Set rev to the first element of the LET query result:
FOR c IN Courses
SORT c.averageRating DESC
LIMIT 3
LET rev = (FOR r IN Reviews
FILTER c._id == r._to
SORT RAND()
LIMIT 1
RETURN r)[0]
LET use = (FOR u IN Users
FILTER rev._from == u._id
RETURN u)
RETURN {course: c, review: rev, user: use}
I use this variant in my own projects.
Access the first elememt og rev in the second LET query:
FOR c IN Courses
SORT c.averageRating DESC
LIMIT 3
LET rev = (FOR r IN Reviews
FILTER c._id == r._to
SORT RAND()
LIMIT 1
RETURN r)
LET use = (FOR u IN Users
FILTER rev[0]._from == u._id
RETURN u)
RETURN {course: c, review: rev, user: use}
This is untested, the syntax might need slight changes. And you have to look at cases where there aren't any reviews - I can't say how this behaves in that case from the top of my head.

How to group results in ArangoDb into single record?

I have the list of events of certain type, structured on the following example:
{
createdAt: 123123132,
type: STARTED,
metadata: {
emailAddress: "foo#bar.com"
}
}
The number of types is predefined (START, STOP, REMOVE...). Users produce one or more events during time.
I want to get the following aggregation:
For each user, calculate the number of events for each type.
My AQL query looks like this:
FOR event IN events
COLLECT
email = event.metadata.emailAddress,
type = event.type WITH COUNT INTO count
LIMIT 10
RETURN {
email,
t: {type, count}
}
This produces the following output:
{ email: '_84#example.com', t: { type: 'CREATE', count: 203 } }
{ email: '_84#example.com', t: { type: 'DEPLOY', count: 214 } }
{ email: '_84#example.com', t: { type: 'REMOVE', count: 172 } }
{ email: '_84#example.com', t: { type: 'START', count: 204 } }
{ email: '_84#example.com', t: { type: 'STOP', count: 187 } }
{ email: '_95#example.com', t: { type: 'CREATE', count: 189 } }
{ email: '_95#example.com', t: { type: 'DEPLOY', count: 173 } }
{ email: '_95#example.com', t: { type: 'REMOVE', count: 194 } }
{ email: '_95#example.com', t: { type: 'START', count: 213 } }
{ email: '_95#example.com', t: { type: 'STOP', count: 208 } }
...
i.e. I got a row for each type. But I want results like this:
{ email: foo#bar.com, count1: 203, count2: 214, count3: 172 ...}
{ email: aaa#fff.com, count1: 189, count2: 173, count3: 194 ...}
...
OR
{ email: foo#bar.com, CREATE: 203, DEPLOY: 214, ... }
...
i.e. to group again the results.
I also need to sort the results (not the events) by the counts: to return e.g. the top 10 users with max number of CREATE events.
How to do that?
ONE SOLUTION
One solution is here, check the accepted answer for more.
FOR a in (FOR event IN events
COLLECT
emailAddress = event.metadata.emailAddress,
type = event.type WITH COUNT INTO count
COLLECT email = emailAddress INTO perUser KEEP type, count
RETURN MERGE(PUSH(perUser[* RETURN {[LOWER(CURRENT.type)]: CURRENT.count}], {email})))
SORT a.create desc
LIMIT 10
RETURN a
You could group by user and event type, then group again by user keeping only the type and already calculated event type counts. In the second aggregation, it is important to know into which groups the events fall to construct the result. An array inline projection can be used for that to keep the query short:
FOR event IN events
COLLECT
emailAddress = event.metadata.emailAddress,
type = event.type WITH COUNT INTO count
COLLECT email = emailAddress INTO perUser KEEP type, count
RETURN MERGE(PUSH(perUser[* RETURN {[CURRENT.type]: CURRENT.count}], {email}))
Another way would be to group by user and keep event types, then group the types in a subquery. But it is significantly slower in my test (without any indexes defined at least):
FOR event IN events
LET type = event.type
COLLECT
email = event.metadata.emailAddress INTO groups KEEP type
LET byType = (
FOR t IN groups[*].type
COLLECT t2 = t WITH COUNT INTO count
RETURN {[t2]: count}
)
RETURN MERGE(PUSH(byType, {email}))
Returning the top 10 users with the most CREATE events is much simpler. Filter for CREATE event type, then group by user and count the number of events, sort by this number in descending order and return the first 10 results:
FOR event IN events
FILTER event.type == "CREATE"
COLLECT email = event.metadata.emailAddress WITH COUNT INTO count
SORT count DESC
LIMIT 10
RETURN {email, count}
EDIT1: Return one document per user with event types grouped and counted (like in the first query), but capture the MERGE result, sort by the count of one particular event type (here: CREATE) and return the top 10 users for this type. The result is the same as with the solution given in the question. It spares the subquery a la FOR a IN (FOR event IN events ...) ... RETURN a however:
FOR event IN events
COLLECT
emailAddress = event.metadata.emailAddress,
type = event.type WITH COUNT INTO count
COLLECT email = emailAddress INTO perUser KEEP type, count
LET ret = MERGE(PUSH(perUser[* RETURN {[CURRENT.type]: CURRENT.count}], {email}))
SORT ret.CREATE DESC
LIMIT 10
RETURN ret
EDIT2: Query to generate example data (requires a collection events to exist):
FOR i IN 1..100
LET email = CONCAT(RANDOM_TOKEN(RAND()*4+4), "#example.com")
FOR j IN SPLIT("CREATE,DEPLOY,REMOVE,START,STOP", ",")
FOR k IN 1..RAND()*150+50
INSERT {metadata: {emailAddress: email}, type: j} INTO events RETURN NEW

ArangoDB Faceted Search Performance

We are evaluating ArangoDB performance in space of facets calculations.
There are number of other products capable of doing the same, either via special API or query language:
MarkLogic Facets
ElasticSearch Aggregations
Solr Faceting etc
We understand, there is no special API in Arango to calculate factes explicitly.
But in reality, it is not needed, thanks for a comprehensive AQL it can be easily achieved via simple query, like:
FOR a in Asset
COLLECT attr = a.attribute1 INTO g
RETURN { value: attr, count: length(g) }
This query calculate a facet on attribute1 and yields frequency in the form of:
[
{
"value": "test-attr1-1",
"count": 2000000
},
{
"value": "test-attr1-2",
"count": 2000000
},
{
"value": "test-attr1-3",
"count": 3000000
}
]
It is saying, that across my entire collection attribute1 took three forms (test-attr1-1, test-attr1-2 and test-attr1-3) with related counts provided.
Pretty much we run a DISTINCT query and aggregated counts.
Looks simple and clean. With only one, but really big issue - performance.
Provided query above runs for !31 seconds! on top of the test collection with only 8M documents.
We have experimented with different index types, storage engines (with rocksdb and without), investigating explanation plans at no avail.
Test documents we use in this test are very concise with only three short attributes.
We would appreciate any input at this point.
Either we doing something wrong. Or ArangoDB simply is not designed to perform in this particular area.
btw, ultimate goal would be to run something like the following in under-second time:
LET docs = (FOR a IN Asset
FILTER a.name like 'test-asset-%'
SORT a.name
RETURN a)
LET attribute1 = (
FOR a in docs
COLLECT attr = a.attribute1 INTO g
RETURN { value: attr, count: length(g[*])}
)
LET attribute2 = (
FOR a in docs
COLLECT attr = a.attribute2 INTO g
RETURN { value: attr, count: length(g[*])}
)
LET attribute3 = (
FOR a in docs
COLLECT attr = a.attribute3 INTO g
RETURN { value: attr, count: length(g[*])}
)
LET attribute4 = (
FOR a in docs
COLLECT attr = a.attribute4 INTO g
RETURN { value: attr, count: length(g[*])}
)
RETURN {
counts: (RETURN {
total: LENGTH(docs),
offset: 2,
to: 4,
facets: {
attribute1: {
from: 0,
to: 5,
total: LENGTH(attribute1)
},
attribute2: {
from: 5,
to: 10,
total: LENGTH(attribute2)
},
attribute3: {
from: 0,
to: 1000,
total: LENGTH(attribute3)
},
attribute4: {
from: 0,
to: 1000,
total: LENGTH(attribute4)
}
}
}),
items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),
facets: {
attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a),
attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a),
attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a),
attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 return a)
}
}
Thanks!
Turns out main thread has happened on ArangoDB Google Group.
Here is a link to a full discussion
Here is a summary of current solution:
Run custom build of the Arango from a specific feature branch where number of performance improvements has been done (hope they should make it to a main release soon)
No indexes are required for a facets calculations
MMFiles is a preferred storage engine
AQL should be written to use "COLLECT attr = a.attributeX WITH COUNT INTO length" instead of "count: length(g)"
AQL should be split into smaller pieces and run in parallel (we are running Java8's Fork/Join to spread facets AQLs and then join them into a final result)
One AQL to filter/sort and retrieve main entity (if required. while sorting/filtering add corresponding skiplist index)
The rest are small AQLs for each facet value/frequency pairs
In the end we have gained >10x performance gain compare to an original AQL provided above.

SQL all rows in child 'categories' and child child 'categories' : recursive?

I'm having trouble writing a query that solves the following problem, which I believe needs some kind of recursiveness:
I have a table with houses, each of them having a specific house_type, p.e. house, bungalow, etc. The house_types inherit from each other, also declared in a table called house_types.
table: houses
id | house_type
1 | house
2 | bungalow
3 | villa
etcetera...
table: house_types
house_type | parent
house | null
villa | house
bungalow | villa
etcetera...
In this logic, a bungalow is also a villa and a villa is also house. So when I want to get all villas, house 2 and 3 should show up, when I want to get all houses, house 1, 2 and 3 should show up, when I want all bungalows, only house 3 should show up.
Is a recursive query the answer and how should I work this out. I use knex/objection.js in a node.js application.
Here is a recursive CTE that gets every pair in the hierarchy:
with recursive house_types as (
select 'house' as housetype, null as parent union all
select 'villa', 'house' union all
select 'bungalow', 'villa'
),
cte(housetype, alternate) as (
select housetype, housetype as alternate
from house_types
union all
select ht.housetype, cte.alternate
from cte join
house_types ht
on cte.housetype = ht.parent
)
select *
from cte;
(The house_types CTE is just to set up the data.)
You can then join this to other data to get any level of the hierarchy.
To start with #gordon-linoffs answer is awesome. I'm just here to add specifics how to do this with knex / objection.js.
That sounds like pretty nasty db design. I would denormalise the type data so that queries would be easier to make without recursive common table expressions (knex doesn't support them currently).
Anyways here is some runnable code how to do objection.js models and type info denormalisation on JavaSript side for being able to make queries that you are trying to do: https://runkit.com/mikaelle/stackoverflow-43554373
Since stackoverflow likes to have code also contained in the answer I'll copy paste it here too. Example uses sqlite3 as DB backend but the same code works also with postgres.
const _ = require('lodash');
require("sqlite3");
const knex = require("knex")({
client: 'sqlite3',
connection: ':memory:'
});
const { Model } = require('objection');
// init schema and test data
await knex.schema.createTable('house_types', table => {
table.string('house_type');
table.string('parent').references('house_types.house_type');
});
await knex.schema.createTable('houses', table => {
table.increments('id');
table.string('house_type').references('house_types.house_type');
});
await knex('house_types').insert([
{ house_type: 'house', parent: null },
{ house_type: 'villa', parent: 'house' },
{ house_type: 'bungalow', parent: 'villa' }
]);
await knex('houses').insert([
{id: 1, house_type: 'house' },
{id: 2, house_type: 'villa' },
{id: 3, house_type: 'bungalow' }
]);
// show initial data from DB
await knex('houses')
.join('house_types', 'houses.house_type', 'house_types.house_type');
// create models
class HouseType extends Model {
static get tableName() { return 'house_types' };
// http://vincit.github.io/objection.js/#relations
static get relationMappings() {
return {
parent: {
relation: Model.HasOneRelation,
modelClass: HouseType,
join: {
from: 'house_types.parent',
to: 'house_types.house_type'
}
}
}
}
}
class House extends Model {
static get tableName() { return 'houses' };
// http://vincit.github.io/objection.js/#relations
static relationMappings() {
return {
houseType: {
relation: Model.HasOneRelation,
modelClass: HouseType,
join: {
from: 'houses.house_type',
to: 'house_types.house_type'
}
}
}
}
}
// get all houses and all house types with recursive eager loading
// http://vincit.github.io/objection.js/#eager-loading
JSON.stringify(
await House.query(knex).eager('houseType.parent.^'), null, 2
);
// however code above doesn't really allow you to filter
// queries nicely and is pretty inefficient so as far as I know recursive
// with query is only way how to do it nicely with pure SQL
// since knex doesn't currently support them we can first denormalize housetype
// hierarchy (and maybe cache this one if data is not changing much)
const allHouseTypes = await HouseType.query(knex).eager('parent.^');
// initialize house types with empty arrays
const denormalizedTypesByHouseType = _(allHouseTypes)
.keyBy('house_type')
.mapValues(() => [])
.value();
// create denormalized type array for every type
allHouseTypes.forEach(houseType => {
// every type should be returned with exact type e.g. bungalow is bungalow
denormalizedTypesByHouseType[houseType.house_type].push(houseType.house_type);
let parent = houseType.parent;
while(parent) {
// bungalow is also villa so when searched for villa bungalows are returned
denormalizedTypesByHouseType[parent.house_type].push(houseType.house_type);
parent = parent.parent;
}
});
// just to see that denormalization did work as expected
console.log(denormalizedTypesByHouseType);
// all villas
JSON.stringify(
await House.query(knex).whereIn('house_type', denormalizedTypesByHouseType['villa']),
null, 2
);

Sort using a field in an embedded doc in an array without considering other equivalent fields in mongodb

I have collection called Products.
Documents of Products look like this:
{
id: 123456,
recommendationByCategory: [
{ categoryId: a01,
recommendation: 3
},
{
categoryId: 0a2,
recommendation: 8
},
{
categoryId: 0b10
recommendation: 99
},
{
categoryId : 0b5
recommendation: 1
}
]
}
{
id: 567890,
recommendationByCategory: [
{ categoryId: a7,
recommendation: 3
},
{
categoryId: 0a2,
recommendation: 1
},
{
categoryId: 0b10
recommendation: 999
},
{
categoryId : 0b51
recommendation: 12
}
]
}
I want to find all the docs that contain categoryId: 0a2 in recommendationByCategory, but want to get sorted using the recommendation of the category 0a2 alone in asc order. It must not consider recommendations of other categoryId. I need id: 567890 followed by id: 123456.
I cannot use aggregation. Is it possible using Mongodb/Mongoose? I tried giving sort option of 'recommendationByCategory.recommendation: 1' but it's not working.
Expected Query: db.collection('products').find({'recommendaionByCategory.categoryId': categoryId}).sort({'recommendationByCategory.recommendation: 1'})
Expected Result:
[
{doc with id:567890},
{doc with id: 123456}
]
If you cannot use mapReduce or the aggregation pipeline, there is no easy way to both search for the matching embedded document and sort on that document's prop.
I would recommend doing the find as you do above (note the typo in the find nested key), and then sorting in-memory:
const categoryId = '0a2';
const findRec = doc => doc.recommendationByCategory.find({ categoryId }).recommendation;
db.collection('products')
.find({'recommendationByCategory.categoryId': categoryId})
.then(docs => docs.sort((a, b) => findRec(a) < findRec(b));
In regard to the Aggregation Pipeline being resource-intensive: it is several orders of magnitude more efficient than a Map-Reduce query, and solves your particular issue. Either you accept that this task will be run at a certain cost and frequency, taking into account Mongo's built-in caching, or you restructure your document schema to allow you to make this query more efficiently.

Resources