AQL support basic AQL for paging by LIMIT offset, count. But I need to get the total number of the query in order to know the total pages. How to get the total count of the query?
I know the LENGTH function to get the count of some collection, but maybe it doesn't suit for the following:
FOR v in 2 any 'Collection/id1' GRAPH 'graph-name' FILTER ... LIMIT 10 RETURN distinct v.
I want to get the total number, but I can't get it by RETURN distinct LENGTH(v)
I now can implement this in a ungraceful way:
LET nodeList=(FOR v IN 2 any 'Collection/id1' GRAPH 'graph-name' FILTER ... RETURN distinct v)
FOR v IN 2 any 'Collection/id1' GRAPH 'graph-name' FILTER ... limit 10 RETURN distinct {'nodes': v, 'total':LENGTH(nodeList)}
Is there any other good idea to get this?
I found this answer from the arangodb spring data project.
AqlQueryOptions has fullCount() function, to return the total count of the query.
and you can return the PageImpl which contains the query content and the pagination info.
Related
Given the following ArangoSearch query, I'm trying to search for the term Clim Do in multiple Views' fields.
LET QR1 = (
FOR doc_asview_global1 IN asview_global1
SEARCH
ANALYZER(
Like(doc_asview_global1.f_name, "%Clim%") OR
Like(doc_asview_global1.f_name, "%Do%"),"identity") OR
ANALYZER(
LEVENSHTEIN_MATCH(doc_asview_global1.f_name, "Clim",1,true) OR
LEVENSHTEIN_MATCH(doc_asview_global1.f_name, "Do",1,true),"text_en") OR
ANALYZER(
PHRASE(doc_asview_global1.f_name, "Clim Do"), "text_en")
LIMIT 10000
SORT BM25(doc_asview_global1) DESC
RETURN doc_asview_global1)
LET QR2 = (
FOR doc_asview_global2 IN asview_global2
SEARCH
ANALYZER(
Like(doc_asview_global2.f_cause, "%Clim%") OR
Like(doc_asview_global2.f_cause, "%Do%"),"identity") OR
ANALYZER(
LEVENSHTEIN_MATCH(doc_asview_global2.f_cause, "Clim",1,true) OR
LEVENSHTEIN_MATCH(doc_asview_global2.f_cause, "Do",1,true),"text_en") OR
ANALYZER(
PHRASE(doc_asview_global2.f_cause, "Clim Do"), "text_en")
LIMIT 10000
SORT BM25(doc_asview_global2) DESC
RETURN doc_asview_global2)
RETURN UNION_DISTINCT(QR1,QR2)
The problem here is that the results of each query is ranked alone. So is there any way to merge this search into a single query or either to rank the merged results together?
BTW could this query be optimized anyway?
In my AutoQuery request I have a left join specified so I can query on properties in the joined table.
public class ProductSearchRequest : QueryDb<Book>
, ILeftJoin<Book, BookAuthor>, ILeftJoin<BookAuthor, Author>
{}
If I use standard way of autoquery like so:
var q = AutoQuery.CreateQuery(request, base.Request);
var results = AutoQuery.Execute(request, q);
And 100 are being requested, then often less than 100 will be retuned as the Take() is based on results with a left join.
To remedy this I am doing this instead:
var q = AutoQuery.CreateQuery(request, base.Request);
q.OrderByExpression = null //throws error if orderby exists
var total = Db.Scalar<int>(q.Select(x => Sql.CountDistinct(x.Id))); //returns 0
var q1 = AutoQuery.CreateQuery(request, base.Request).GroupBy(x => x);
var results = Db.Select<Book>(q1);
return new QueryResponse<Book>
{
Offset = q1.Offset.GetValueOrDefault(0),
Total = total
Results = results
};
The group by appears to return correct number of results so paging works but the Total returns 0.
I also tried:
var total2 = (int)Db.Count(q1);
But even though q1 has a GroupBy() it returns the number of results including the left join and not the actual query
How can I get the true total of the query?
(Getting some official docs on how to do paging and totals with autoquery & left join would be very helpful as right now it's a bit confusing)
Your primary issue stems from trying to return a different total then the actual query AutoQuery executes. If you have multiple left joins, the total is the total results of the query it executes not the number of rows in your source table.
So you're not looking for the "True total", rather you're looking to execute a different query to get a different total than the query that's executed, but still deriving from the original query as its basis. First consider using normal INNER JOINS (IJoin<>) instead of LEFT JOINS so it only returns results for related rows in joined tables which the total will reflect accordingly.
Your total query that returns 0 is likely returning no results, so I'd look at looking at the query in an SQL Profiler so you can see the query that's executed. You can also enable logging of OrmLite queries with Debug logging enabled and in your AppHost:
OrmLiteUtils.PrintSql();
Also note that GroupBy() of the entire table is unusual, you would normally group by a single or multiple explicit selected columns, e.g:
.GroupBy(x => x.Id);
.GroupBy(x => new { x.Id, x.Name });
I'm sure there is an easy and fast way to do this but it's escaping me. I have a large dataset that has some duplicate records, and I want to get rid of the duplicates. (the duplicates are uniquely identified by one property, but the rest of the document should be identical as well).
I've attempted to create a new collection that only has unique values a few different ways, but they are all quite slow. For example:
FOR doc IN Documents
COLLECT docId = doc.myId, doc2 = doc
INSERT doc2 IN Documents2
or
FOR doc IN Documents
LET existing = (FOR doc2 IN Documents2
FILTER doc.myId == doc2.myId
RETURN doc2)
UPDATE existing WITH doc IN Documents2
or (this gives me a "violated unique constraint" error)
FOR doc IN Documents
UPSERT {myId: doc.myId}}]}
INSERT doc
UPDATE doc IN Documents2
TL;DR
It does not take that long to de-duplicate the records and write them to another collection (less than 60 seconds), at least on my desktop machine (Windows 10, Intel 6700K 4x4.0GHz, 32GB RAM, Evo 850 SSD).
Certain queries require proper indexing however, or they will last forever. Indexes require some memory, but compared to the needed memory during query execution for grouping the records, it is negligible. If you're short of memory, performance will suffer because the operating system needs to swap data between memory and mass storage. This is especially a problem with spinning disks, not so much with fast flash storage devices.
Preparation
I generated 2.2 million records with 5-20 random attributes and 160 chars of gibberish per attribute. In addition, every record has an attribute myid. 187k records have a unique id, 60k myids exist twice, and 70k three times. The collection size was reported as 4.83GB:
// 1..2000000: 300s
// 1..130000: 20s
// 1..70000: 10s
FOR i IN 1..2000000
LET randomAttributes = MERGE(
FOR j IN 1..FLOOR(RAND() * 15) + 5
RETURN { [CONCAT("attr", j)]: RANDOM_TOKEN(160) }
)
INSERT MERGE(randomAttributes, {myid: i}) INTO test1
Memory consumption before starting ArangoDB was at 3.4GB, after starting 4.0GB, and around 8.8GB after loading the test1 source collection.
Baseline
Reading from test1 and inserting all documents (2.2m) into test2 took 20s on my system, with a memory peak of ~17.6GB:
FOR doc IN test1
INSERT doc INTO test2
Grouping by myid without writing took approx. 9s for me, with 9GB RAM peak during query:
LET result = (
FOR doc IN test1
COLLECT myid = doc.myid
RETURN 1
)
RETURN LENGTH(result)
Failed grouping
I tried your COLLECT docId = doc.myId, doc2 = doc approach on a dataset with just 3 records and one duplicate myid. It showed that the query does not actually group/remove duplicates. I therefore tried to find alternative queries.
Grouping with INTO
To group duplicate myids together but retain the possibility to access the full documents, COLLECT ... INTO can be used. I simply picked the first document of every group to remove redundant myids. The query took about 40s for writing the 2m records with unique myid attribute to test2. I didn't measure memory consumption accurately, but I saw different memory peaks spanning 14GB to 21GB. Maybe truncating the test collections and re-running the queries increases the required memory because of some stale entries that get in the way somehow (compaction / key generation)?
FOR doc IN test1
COLLECT myid = doc.myid INTO groups
INSERT groups[0].doc INTO test2
Grouping with subquery
The following query showed a more stable memory consumption, peaking at 13.4GB:
FOR doc IN test1
COLLECT myid = doc.myid
LET doc2 = (
FOR doc3 IN test1
FILTER doc3.myid == myid
LIMIT 1
RETURN doc3
)
INSERT doc2[0] INTO test2
Note however that it required a hash index on myid in test1 to achieve a query execution time of ~38s. Otherwise the subquery will cause millions of collection scans and take ages.
Grouping with INTO and KEEP
Instead of storing the whole documents that fell into a group, we can assign just the _id to a variable and KEEP it so that we can look up the document bodies using DOCUMENT():
FOR doc IN test1
LET d = doc._id
COLLECT myid = doc.myid INTO groups KEEP d
INSERT DOCUMENT(groups[0].d) INTO test2
Memory usage: 8.1GB after loading the source collection, 13.5GB peak during the query. It only took 30 seconds for the 2m records!
Grouping with INTO and projection
Instead of KEEP I also tried a projection out of curiosity:
FOR doc IN test1
COLLECT myid = doc.myid INTO groups = doc._id
INSERT DOCUMENT(groups[0]) INTO test2
RAM was at 8.3GB after loading test1, and the peak at 17.8GB (there were actually two heavy spikes during the query execution, both going over 17GB). It took 35s to complete for the 2m records.
Upsert
I tried something with UPSERT, but saw some strange results. It turned out to be an oversight in ArangoDB's upsert implementation. v3.0.2 contains a fix and I get correct results now:
FOR doc IN test1
UPSERT {myid: doc.myid}
INSERT doc
UPDATE {} IN test2
It took 40s to process with a (unique) hash index on myid in test2, with a RAM peak around 13.2GB.
Delete duplicates in-place
I first copied all documents from test1 to test2 (2.2m records), then I tried to REMOVE just the duplicates in test2:
FOR doc IN test2
COLLECT myid = doc.myid INTO keys = doc._key
LET allButFirst = SLICE(keys, 1) // or SHIFT(keys)
FOR k IN allButFirst
REMOVE k IN test2
Memory was at 8.2GB (with only test2 loaded) and went up to 13.5GB during the query. It took roughly 16 seconds to delete the duplicates (200k).
Verification
The following query groups myid together and aggregates how often every id occurs. Run against the target collection test2, the result should be {"1": 2000000}, otherwise there are still duplicates. I double-checked the query results above and everything checked out.
FOR doc IN test2
COLLECT myid = doc.myid WITH COUNT INTO count
COLLECT c = count WITH COUNT INTO cc
RETURN {[c]: cc}
Conclusion
The performance appears to be reasonable with ArangoDB v3.0, although it may degrade if not enough RAM is available. The different queries completed roughly within the same time, but showed different RAM usage characteristics. For certain queries, indexes are necessary to avoid high computational complexity (here: full collection scans; 2,200,000,000,000 reads in the worst case?).
Can you try my presented solutions on your data and check what the performance is on your machine?
I have the following parse query which times out when the number of records is large.
var query = new Parse.Query("UserLocation");
query.withinMiles("geo", geo, MAX_LOCATION_RADIUS);
query.ascending("createdAt");
if (createdAt !== undefined) {
query.greaterThan("createdAt", createdAt);
}
query.limit(1000);
it runs ok if UserLocation table is small. But the query times out from time to time when the table has ~100k records:
[2015-07-15 21:03:30.879] [ERROR] [default] - Error while querying for locations: [latitude=39.959064, longitude=-75.15846]: {"code":124,"message":"operation was slow and timed out"}
UserLocation table has a latitude,longitude pair and a radius. Given a geo point (latitude,longitude), I'm trying to find the list of UserLocations whose circle (lat,long)+radius covers the given geo point. It doesn't seem like I can use the value from another column in the table for the distance query (something like query.withinMiles("geo", inputGeo, "radius"), where "geo" and "radius" are the column names for GeoPoint and radius). It also has the limit that query "limit" combined with "skip" can only return maximum of 10,000 records (1000 records at a time and skip 10 times). So I had to do a almost full table scan by using "createdAt" as a filter criteria and keep querying until the query doesn't return results any more.
Anyway I can improve the algorithm so that it doesn't time out on large data set?
It seems like OrmLite Select(predicate) function it brings back everything in the where clause (across the network) and then applies the .Take(x) on top of that.
I need a way to only bring back the TOP x so the results are faster and use less bandwidth.
Is there a way to limit TOP rows returned by OrmLite select (using a Linq Expression)?
Limit and Offset support is available using the Limit() expression, e.g::
Take 10 Rows
var rows = db.Select<Table>(q => q.Where(x => x.Name != null).Limit(10));
Skip 5 Rows, Take 10
var rows = db.Select<Table>(q => q.Where(x => x.Name != null).Limit(5,10));