ArangoDB Hash Index Ignored (?) - arangodb

My database currently consists of 3 document collections with between 250k to 1.5M documents. I set my own document _keys and have added Hash indexes on a few toplevel fields and lists (the lists containing references to other keys or (indexed) fields).
The collections A and C have an n:m relationship via B. The query I first came up with looks like this:
for a in collection_a
filter a.name != null
filter length(a.bs) > 0
limit 1
return {
'akey': a._key
, 'name': a.name
, 'cs': (
for b in collection_b
filter b.a == a._key
for c in collection_c
filter b.c == c._key
return c.name
)
}
This is excruciatingly slow. I also tried other approaches such as making the middle for a for b in a.bs (bs being a list of keys of collection_b documents).
Printing out explain() of the above query returns an immense cost and getExtra() indicates no indexes were used:
{
"stats" : {
"writesExecuted" : 0,
"writesIgnored" : 0,
"scannedFull" : 6009930,
"scannedIndex" : 0
},
"warnings" : [ ]
}
An alternate approach works as fast as I'd expected it to be in the first place:
for a in collection_a
filter a.name != null
filter length(a.bs) > 0
limit 1
return {
'akey': a._key
, 'name': a.name
, 'cs': (
for b in a.bs
return DOCUMENT(collection_c , DOCUMENT(collection_b, b).c ).name
)
}
But even here, no indexes appear to be used:
{
"stats" : {
"writesExecuted" : 0,
"writesIgnored" : 0,
"scannedFull" : 3000,
"scannedIndex" : 0
},
"warnings" : [ ]
}
One thing that may already explain this is, that hash indexes don't work for elements of a list (or I made a mistake when creating them)? The getExtras() of the second example would hint at this.
My expectation, however, would be that arangodb indexes all elements of the lists (such as a.bs) and the query optimizer should realize that indexed attributes are used in the query.
If I run for b in collection_b filter b.a == 'somekey', I get an instantaneous result as expected. And that's just running the middle for in isolation. Same behaviour when I run the innermost for in isolation.
Is this a bug? Is there an explanation for this behaviour? Am I doing something wrong in the first query? The AQL Examples themself use nested fors so that's what I naturally ended up trying first.

This has been fixed in release 2.3.2.
clarification: the query you posted is correct. There was an issue in release 2.3.0 that prevented indexes in subqueries being used.
This issue has been fixed in release 2.3.2.
The initial query you posted should properly use indexes in 2.3.2. If there is a hash index available on the join attributes, it should be used because the query only contains equality lookups.

Related

Presto subqueries: Key not present in map

I have been banging my head a while to Superset -> Presto (PrestoSQL) -> Prometheus combination (as Superset does not yet support Prometheus) and got stymied with an issue when trying to extract columns from Presto's map type column containing Prometheus labels.
In order to get necessary labels mapped as columns from Superset's point of view, I create extra table (or I guess a view in this case) in Superset on top of existing table which had following SQL for creating the necessary columns:
SELECT labels['system_name'] AS "system",labels['instance'] AS "instance","timestamp" AS "timestamp","value" AS "value" FROM "up"
This table is then used as a data source in Superset's chart which treats it as a subquery. The resulting SQL query created by Superset and then sent to Presto looks e.g. like this:
SELECT "system" AS "system",
"instance" AS "instance",
"timestamp" AS "timestamp",
"value" AS "value"
FROM
(SELECT labels['system_name'] AS "system",
labels['instance'] AS "instance",
"timestamp" AS "timestamp",
"value" AS "value"
FROM "up") AS "expr_qry"
WHERE "timestamp" >= from_iso8601_timestamp('2020-10-19T12:00:00.000000')
AND "timestamp" < from_iso8601_timestamp('2020-10-19T13:00:00.000000')
ORDER BY "timestamp" ASC
LIMIT 250;
However, what I get out from above is an error:
io.prestosql.spi.PrestoException: Key not present in map: system_name
at io.prestosql.operator.scalar.MapSubscriptOperator$MissingKeyExceptionFactory.create(MapSubscriptOperator.java:173)
at io.prestosql.operator.scalar.MapSubscriptOperator.subscript(MapSubscriptOperator.java:143)
at io.prestosql.$gen.CursorProcessor_20201019_165636_32.filter(Unknown Source)
After reading a bit about queries from Presto's user guide, I tried a modified query from command line by using WITH:
WITH x AS (SELECT labels['system_name'] AS "system",labels['instance'] AS "instance","timestamp" AS "timestamp","value" AS "value" FROM "up")
SELECT system, timestamp, value FROM x
WHERE "timestamp" >= from_iso8601_timestamp('2020-10-19T12:00:00.000000')
AND "timestamp" < from_iso8601_timestamp('2020-10-19T13:00:00.000000')
LIMIT 250;
And that went throught without any issues. But it seems that I have no way to define how Superset executes its queries, so I'm stuck with the first option. The question is, is there anything wrong with it which could be fixed?
I guess that one option (if everything else fails) would be defining extra tables in Presto side which would do the same trick for mapping the columns, thus hopefully avoiding above issue.
The map subscript operator in Presto requires that the key be present in the map. Otherwise, you get the failure you described.
If some keys can be missing, you can use the element_at function instead, which will return a NULL result:
Returns value for given key, or NULL if the key is not contained in the map.

Cosmos: DISTINCT results with JOIN and ORDER BY

I'm trying to write a query that uses a JOIN to perform a geo-spatial match against locations in a array. I got it working, but added DISTINCT in order to de-duplicate (Query A):
SELECT DISTINCT VALUE
u
FROM
u
JOIN loc IN u.locations
WHERE
ST_WITHIN(
{'type':'Point','coordinates':[loc.longitude,loc.latitude]},
{'type':'Polygon','coordinates':[[[-108,-43],[-108,-40],[-110,-40],[-110,-43],[-108,-43]]]})
However, I then found that combining DISTINCT with continuation tokens isn't supported unless you also add ORDER BY:
System.ArgumentException: Distict query requires a matching order by in order to return a continuation token. If you would like to serve this query through continuation tokens, then please rewrite the query in the form 'SELECT DISTINCT VALUE c.blah FROM c ORDER BY c.blah' and please make sure that there is a range index on 'c.blah'.
So I tried adding ORDER BY like this (Query B):
SELECT DISTINCT VALUE
u
FROM
u
JOIN loc IN u.locations
WHERE
ST_WITHIN(
{'type':'Point','coordinates':[loc.longitude,loc.latitude]},
{'type':'Polygon','coordinates':[[[-108,-43],[-108,-40],[-110,-40],[-110,-43],[-108,-43]]]})
ORDER BY
u.created
The problem is, the DISTINCT no longer appears to be taking effect because it returns, for example, the same record twice.
To reproduce this, create a single document with this data:
{
"id": "b6dd3e9b-e6c5-4e5a-a257-371e386f1c2e",
"locations": [
{
"latitude": -42,
"longitude": -109
},
{
"latitude": -42,
"longitude": -109
}
],
"created": "2019-03-06T03:43:52.328Z"
}
Then run Query A above. You will get a single result, despite the fact that both locations match the predicate. If you remove the DISTINCT, you'll get the same document twice.
Now run Query B and you'll see it returns the same document twice, despite the DISTINCT clause.
What am I doing wrong here?
Reproduced your issue indeed,based on my researching,it seems a defect in cosmos db distinct query. Please refer to this link:Provide support for DISTINCT.
This feature is broke in the data explorer. Because cosmos can only
return 100 results per page at a time, the distinct keyword will only
apply to a single page. So, if your result set contains more than 100
results, you may still get duplicates back - they will simply be on
separately paged result sets.
You could describe your own situation and vote up this feedback case.

In ArangoDB AQL how do I update fields of every traversed object, regardless of collection?

The UPDATE statement requires a collection name. In a graph, I want to traverse an edge and update each vertex visited.
Something like:
FOR v IN 1..50 INBOUND 'pmconfig/6376876' pm_content OPTIONS {uniqueVertices: "global"} [update v.contentsChanged=true]
Since the vertices may be found in different collections, do I have to partition the objects into per-collection lists?
Updating documents using a dynamic (run-time) collection name is not possible. For example, the following AQL query will not compile:
FOR v IN 1..50 INBOUND 'pmconfig/6376876' pm_content
UPDATE v WITH { contentsChanged : true } IN PARSE_IDENTIFIER(v._id).collection
The reason is that UPDATE and other data modification statements require their collection names to be known at query compile time. Using bind parameters will work, of course, but no runtime expressions are allowed.
One workaround is to run the traversal multiple times, each time with a hard-coded collection to update. This still requires knowledge of the collection names in advance. There may also be consistency issues if the graph changes in between:
FOR v IN 1..50 INBOUND 'pmconfig/6376876' pm_content
FILTER PARSE_IDENTIFIER(v._id).collection == 'vertexCollection1'
UPDATE v WITH { contentsChanged : true } IN vertexCollection1
FOR v IN 1..50 INBOUND 'pmconfig/6376876' pm_content
FILTER PARSE_IDENTIFIER(v._id).collection == 'vertexCollection2'
UPDATE v WITH { contentsChanged : true } IN vertexCollection2
...
Another workaround is to have the traversal already generate the per-collection lists, and issue follow-up queries for each collection. For example, the following query should return the keys to update per collection:
FOR v IN 1..50 INBOUND 'pmconfig/6376876' pm_content
LET parts = PARSE_IDENTIFIER(v._id)
COLLECT collection = parts.collection INTO keys = parts. key
RETURN { collection, keys }
Example result:
[
{
"collection" : "vertexCollection1",
"keys" : [
"abc"
]
},
{
"collection" : "vertexCollection2",
"keys" : [
"xyz",
"ddd"
]
}
]
Using the result structure, the follow-up update queries can be constructed easily and the updates be sent.

Why is sorting in arangodb slow?

I am experimenting to see whether arangodb might be suitable for our usecase.
We will have large collections of documents with the same schema (like an sql table).
To try some queries I have inserted about 90K documents, which is low, as we expect document counts in the order of 1 million of more.
Now I want to get a simple page of these documents, without filtering, but with descending sorting.
So my aql is:
for a in test_collection
sort a.ARTICLE_INTERNALNR desc
limit 0,10
return {'nr': a.ARTICLE_INTERNALNR}
When I run this in the AQL Editor, it takes about 7 seconds, while I would expect a couple of milliseconds or something like that.
I have tried creating a hash index and a skiplist index on it, but that didn't have any effect:
db.test_collection.getIndexes()
[
{
"id" : "test_collection/0",
"type" : "primary",
"unique" : true,
"fields" : [
"_id"
]
},
{
"id" : "test_collection/19812564965",
"type" : "hash",
"unique" : true,
"fields" : [
"ARTICLE_INTERNALNR"
]
},
{
"id" : "test_collection/19826720741",
"type" : "skiplist",
"unique" : false,
"fields" : [
"ARTICLE_INTERNALNR"
]
}
]
So, am I missing something, or is ArangoDB not suitable for these cases?
If ArangoDB needs to sort all the documents, this will be a relatively slow operation (compared to not sorting). So the goal is to avoid the sorting at all.
ArangoDB has a skiplist index, which keeps indexed values in sorted order, and if that can be used in a query, it will speed up the query.
There are a few gotchas at the moment:
AQL queries without a FILTER condition won't use an index.
the skiplist index is fine for forward-order traversals, but it has no backward-order traversal facility.
Both these issues seem to have affected you.
We hope to fix both issues as soon as possible.
At the moment there is a workaround to enforce using the index in forward-order using an AQL query as follows:
FOR a IN
SKIPLIST(test_collection, { ARTICLE_INTERNALNR: [ [ '>', 0 ] ] }, 0, 10)
RETURN { nr: a.ARTICLE_INTERNALNR }
The above picks up the first 10 documents via the index on ARTICLE_INTERNALNR with a condition "value > 0". I am not sure if there is a solution for sorting backwards with limit.

Error in Linq: The text data type cannot be selected as DISTINCT because it is not comparable

I've a problem with LINQ. Basically a third party database that I need to connect to is using the now depreciated text field (I can't change this) and I need to execute a distinct clause in my linq on results that contain this field.
I don't want to do a ToList() before executing the Distinct() as that will result in thousands of records coming back from the database that I don't require and will annoy the client as they get charged for bandwidth usage. I only need the first 15 distinct records.
Anyway query is below:
var query = (from s in db.tSearches
join sc in db.tSearchIndexes on s.GUID equals sc.CPSGUID
join a in db.tAttributes on sc.AttributeGUID equals a.GUID
where s.Notes != null && a.Attribute == "Featured"
select new FeaturedVacancy
{
Id = s.GUID,
DateOpened = s.DateOpened,
Notes = s.Notes
});
return query.Distinct().OrderByDescending(x => x.DateOpened);
I know I can do a subquery to do the same thing as above (tSearches contains unique records) but I'd rather a more straightfoward solution if available as I need to change a number of similar queries throughout the code to get this working.
No answers on how to do this so I went with my first suggestion and retrieved the unique records first from tSearch then constructed a subquery with the non unique records and filtered the search results by this subquery. Answer below:
var query = (from s in db.tSearches
where s.DateClosed == null && s.ConfidentialNotes != null
orderby s.DateOpened descending
select new FeaturedVacancy
{
Id = s.GUID,
Notes = s.ConfidentialNotes
});
/* Now filter by our 'Featured' attribute */
var subQuery = from sc in db.tSearchIndexes
join a in db.tAttributes on sc.AttributeGUID equals a.GUID
where a.Attribute == "Featured"
select sc.CPSGUID;
query = query.Where(x => subQuery.Contains(x.Id));
return query;

Resources