Can I filter multiple collections? - arangodb

I want to filter multiple collections, to return only documents who have those requirements, the problem is when there is more than one matching value in one collection, the elements shown are repeated.
FOR TurmaA IN TurmaA
FOR TurmaB IN TurmaB
FILTER TurmaA.Disciplinas.Mat >10
FILTER TurmaB.Disciplinas.Mat >10
RETURN {TurmaA,TurmaB}
Screenshot of the problem

What your query does is to iterate over all documents of the first collection, and for each record it iterates over the second collection. The applied filters reduce the number of results, but this is not how you should go about it as it is highly inefficient.
Do you actually want to return the union of the matches from both collections?
(SELECT ... UNION SELECT ... in SQL).
What you get with your current approach are all possible combinations of the documents from both collections. I believe what you want is:
LET a = (FOR t IN TurmaA FILTER t.Disciplinas.Mat > 10 RETURN t)
LET b = (FOR t IN TurmaB FILTER t.Disciplinas.Mat > 10 RETURN t)
FOR doc IN UNION(a, b)
RETURN doc
Both collections are filtered individually in sub-queries, then the results are combined and returned.
Another solution would be to store all documents in one collection Turma and have another attribute e.g. Type with a value of "A" or "B". Then the query would be as simple as:
FOR t IN Turma
FILTER t.Disciplinas.Mat > 10
RETURN t
If you want to return TurmaA documents only, you would do:
FOR t IN Turma
FILTER t.Disciplinas.Mat > 10 AND t.Type == "A"
RETURN t
BTW. I recommend to call variables different from collection names, e.g. t instead of Turma if there is a collection Turma.

Related

ArangoDB: Collect syntax statement error when adding the sum function for an collection attribute

I am trying to add all the Amounts in the edge collection and also extract the days from the date attribute in the edge collection named Transaction.
However, I am getting error in the collect statement.
for d in Transaction
filter d._to == "Account/123"
COLLECT aggregate ct =count(d._id),
aggregate totamnt=sum(d.Amount),
aggregate daysactive= count(distinct date_trunc(d.Time))
return distinct {"Incoming Accounts":length, "Days Active": daysactive}
If I understand what you want to achieve correctly, this is a query to achieve it:
FOR d IN Transaction
FILTER d._to == "Account/123"
COLLECT AGGREGATE length = COUNT_UNIQUE(d._id),
totamnt = SUM(d.Amount),
daysactive = COUNT_UNIQUE(DATE_TRUNC(d.Time, "day"))
RETURN {
"Incoming Accounts": length ,
"Days Active": LENGTH(daysactive),
"Total Amount": totamnt
}
Note: The distinct is not necessary, I include the total amount in the return value, and specified "day" as the unit to truncate the date to.
I tested this slightly adapted on a collection of mine and got sensible results.

Timeseries differencing - ArangoDB (AQL or Python)

I have a collection which holds documents, with each document having a data observation and the time that the data was captured.
e.g.
{
_key:....,
"data":26,
"timecaptured":1643488638.946702
}
where timecaptured for now is a utc timestamp.
What I want to do is get the duration between consecutive observations, with SQL I could do this with LAG for example, but with ArangoDB and AQL I am struggling to see how to do this at the database. So effectively the difference in timestamps between two documents in time order. I have a lot of data and I don't really want to pull it all into pandas.
Any help really appreciated.
Although the solution provided by CodeManX works, I prefer a different one:
FOR d IN docs
SORT d.timecaptured
WINDOW { preceding: 1 } AGGREGATE s = SUM(d.timecaptured), cnt = COUNT(1)
LET timediff = cnt == 1 ? null : d.timecaptured - (s - d.timecaptured)
RETURN timediff
We simply calculate the sum of the previous and the current document, and by subtracting the current document's timecaptured we can therefore calculate the timecaptured of the previous document. So now we can easily calculate the requested difference.
I only use the COUNT to return null for the first document (which has no predecessor). If you are fine with having a difference of zero for the first document, you can simply remove it.
However, neither approach is very straight forward or obvious. I put on my TODO list to add an APPEND aggregate function that could be used in WINDOW and COLLECT operations.
The WINDOW function doesn't give you direct access to the data in the sliding window but here is a rather clever workaround:
FOR doc IN collection
SORT doc.timecaptured
WINDOW { preceding: 1 }
AGGREGATE d = UNIQUE(KEEP(doc, "_key", "timecaptured"))
LET timediff = doc.timecaptured - d[0].timecaptured
RETURN MERGE(doc, {timediff})
The UNIQUE() function is available for window aggregations and can be used to get at the desired data (previous document). Aggregating full documents might be inefficient, so a projection should do, but remember that UNIQUE() will remove duplicate values. A document _key is unique within a collection, so we can add it to the projection to make sure that UNIQUE() doesn't remove anything.
The time difference is calculated by subtracting the previous' documents timecaptured value from the current document's one. In the case of the first record, d[0] is actually equal to the current document and the difference ends up being 0, which I think is sensible. You could also write d[-1].timecaptured - d[0].timecaptured to achieve the same. d[1].timecaptured - d[0].timecaptured on the other hand will give you the inverted timestamp for the first record because d[1] is null (no previous document) and evaluates to 0.
There is one risk: UNIQUE() may alter the order of the documents. You could use a subquery to sort by timecaptured again:
LET timediff = doc.timecaptured - (
FOR dd IN d SORT dd.timecaptured LIMIT 1 RETURN dd.timecaptured
)[0]
But it's not great for performance to use a subquery. Instead, you can use the aggregation variable d to access both documents and calculate the absolute value of the subtraction so that the order doesn't matter:
LET timediff = ABS(d[-1].timecaptured - d[0].timecaptured)

ClickHouse- Search within nested fields

I have a nested field named items.productName wherein I want to check if the product name contains a particular string.
SELECT * FROM test WHERE hasAny(items.productName,['Samsung'])
This works only when the product name is Samsung.
I have tried array join
SELECT
*
FROM test
ARRAY JOIN items
WHERE items.productName LIKE '%Samsung%'
This works but it is very slow (~1 sec for 5 million records)
Is there a way to perform like within hasAny?
You can achieve this using arrayFilter function. ClickHouse docs
Query
Select * from test where arrayFilter(x -> x LIKE '%Samsung%', items.productName) != []
If you do not use != [] then you will get an error "DB::Exception: Illegal type Array(String) of column for filter. Must be UInt8 or Nullable(UInt8) or Const variants of them."

Many to many AQL query

I have 2 collections and one edge collection. USERS, FILES and FILES_USERS.
Im trying to get all FILES documents, that has the field "what" set to "video", for a specific user, but also embed another document, also from the collection FILES, but where the "what" is set to "trailer" and belongs to the "video" into the results.
I have tried the below code but its not working correctly, im getting a lot of duplicate results...its a mess. Im definitely doing it wrong.
FOR f IN files
FILTER f.what=="video"
LET trailer = (
FOR f2 IN files
FILTER f2.parent_key==f._key
AND f2.what=="trailer"
RETURN f2
)
FOR x IN files_users
FILTER x._from=="users/18418062"
AND x.owner==true
RETURN DISTINCT {f,trailer}
There may be a better way to do this with graph query syntax, but try this. Adjust the UNIQUE functions based on your data-model.
LET user_files = UNIQUE(FOR u IN FILES_USERS
FILTER u._from == "users/18418062" AND u.owner
RETURN u._to)
FOR uf IN user_files
FOR f IN files
FILTER f._key == uf AND f.what == "video"
LET trailers = UNIQUE(FOR t IN files
FILTER t.parent_key == f._key AND t.what == "trailer"
RETURN t)
RETURN {"video": f, "trailers": trailers}
Well, check to see If you have duplicate data as suggested by TMan, however check your query syntax too. It appears that you have no link between your f subquery and the x in the main query. That would cause the query to potentially return a lot of dups if there are multiple records in collection files_users for user users/18418062
Try adding a join in the main query. Something like:
FOR x IN files_users
FILTER x._from=="users/18418062"
AND x.owner==true
AND x._to == f._id
RETURN DISTINCT {f,trailer}
On a related note, if you run into performance issues doing a subquery for trailers , you could instead try just doing a join and array expansion and see if that works for your case

How to check if ArangoDB query is not empty?

I would like to make an exists PostgreSQL query.
Let's say I have a Q ArangoDB query (AQL). How can I check if Q returns any result?
Example:
Q = "For u in users FILTER 'x#example.com' = u.email"
What is the best way to do it (most performant)?
I have ideas, but couldn't find an easy way to measure the performance:
Idea 1: using Length:
RETURN LENGTH(%Q RETURN 1) > 0
Idea 2: using Frist:
RETURN First(%Q RETURN 1) != null
Above, %Q is a substitution for the query defined at the beginning.
I think the best way to achieve this for a generic selection query with a structure like
Q = "For u in users FILTER 'x#example.com' = u.email"
is to first add a LIMIT clause to the query, and only make it return a constant value (in contrast to the full document).
For example, the following query returns a single match if there is such document or an empty array if there is no match:
FOR u IN users FILTER 'x#example.com' == u.email LIMIT 1 RETURN 1
(please note that I also changed the operator from = to == because otherwise the query won't parse).
Please note that this query may benefit a lot from creating an index on the search attribute, i.e. email. Without the index the query will do a full collection scan and stop at the first match, whereas with the index it will just read at most a single index entry.
Finally, to answer your question, the template for the EXISTS-like query will then become
LENGTH(%Q LIMIT 1 RETURN 1)
or fleshed out via the example query:
LENGTH(FOR u IN users FILTER 'x#example.com' == u.email LIMIT 1 RETURN 1)
LENGTH(...) will return the number of matches, which in this case will either be 0 or 1. And it can also be used in filter conditions like as follows
FOR ....
FILTER LENGTH(...)
RETURN ...
because LENGTH(...) will be either 0 or 1, which in context of a FILTER condition will evaluate to either false or true.
Do you need and AQL solution?
Only the count:
var q = "For u in users FILTER 'x#example.com' = u.email";
var res = db._createStatement({query: q, count: true}).execute();
var ct = res.count();
Is the fastest I can think of.

Resources