ArangoDB AQL optimisation for a cluster configuration - arangodb

With data structure like
Departure -> Trip -> Driver
using an ArangoDB Spring Data derived query in the Trip repository like findByDriverIdNumberAndDepartureStartTimeBetween( String idNumber, String startTime, String endTime ) results in an AQL query like
WITH driver, departure
FOR e IN trip
FILTER
(FOR e1 IN 1..1 OUTBOUND e._id tripToDriver FILTER e1.idNumber == '999999-9999' RETURN 1)[0] == 1
AND
(FOR e1 IN 1..1 INBOUND e._id departureToTrip FILTER e1.startTime >= '2019-08-14T00:00:00' AND e1.startTime <= '2019-08-14T23:59:59' RETURN 1)[0] == 1
RETURN e
which performs fine (~1s) with a single instance setup, but after setting up a cluster with the Kubernetes ArangoDB Operator with default settings (3 nodes and coordinators) the query time increased tenfold, which is is probably due to sharding and multi-machine communication to fulfil the query.
This attempt to optimise the query gave better results, query time around 3 to 4 seconds:
WITH driver, departure
FOR doc IN trip
LET drivers = (FOR v IN 1..1 OUTBOUND doc tripToDriver RETURN v)
FILTER drivers[0].idNumber == '999999-9999'
LET departures = (FOR v in 1..1 INBOUND doc departureToTrip RETURN v)
FILTER departures[0].startTime >= '2019-08-14T00:00:00' AND departures[0].startTime <= '2019-08-14T23:59:59'
RETURN doc
But can I optimise the query further for the cluster setup, to come closer to the single instance query time of one second?

Related

Getting AutoQuery pagination to work with left join

In my AutoQuery request I have a left join specified so I can query on properties in the joined table.
public class ProductSearchRequest : QueryDb<Book>
, ILeftJoin<Book, BookAuthor>, ILeftJoin<BookAuthor, Author>
{}
If I use standard way of autoquery like so:
var q = AutoQuery.CreateQuery(request, base.Request);
var results = AutoQuery.Execute(request, q);
And 100 are being requested, then often less than 100 will be retuned as the Take() is based on results with a left join.
To remedy this I am doing this instead:
var q = AutoQuery.CreateQuery(request, base.Request);
q.OrderByExpression = null //throws error if orderby exists
var total = Db.Scalar<int>(q.Select(x => Sql.CountDistinct(x.Id))); //returns 0
var q1 = AutoQuery.CreateQuery(request, base.Request).GroupBy(x => x);
var results = Db.Select<Book>(q1);
return new QueryResponse<Book>
{
Offset = q1.Offset.GetValueOrDefault(0),
Total = total
Results = results
};
The group by appears to return correct number of results so paging works but the Total returns 0.
I also tried:
var total2 = (int)Db.Count(q1);
But even though q1 has a GroupBy() it returns the number of results including the left join and not the actual query
How can I get the true total of the query?
(Getting some official docs on how to do paging and totals with autoquery & left join would be very helpful as right now it's a bit confusing)
Your primary issue stems from trying to return a different total then the actual query AutoQuery executes. If you have multiple left joins, the total is the total results of the query it executes not the number of rows in your source table.
So you're not looking for the "True total", rather you're looking to execute a different query to get a different total than the query that's executed, but still deriving from the original query as its basis. First consider using normal INNER JOINS (IJoin<>) instead of LEFT JOINS so it only returns results for related rows in joined tables which the total will reflect accordingly.
Your total query that returns 0 is likely returning no results, so I'd look at looking at the query in an SQL Profiler so you can see the query that's executed. You can also enable logging of OrmLite queries with Debug logging enabled and in your AppHost:
OrmLiteUtils.PrintSql();
Also note that GroupBy() of the entire table is unusual, you would normally group by a single or multiple explicit selected columns, e.g:
.GroupBy(x => x.Id);
.GroupBy(x => new { x.Id, x.Name });

how to get total number of ArangoDB AQL for paging

AQL support basic AQL for paging by LIMIT offset, count. But I need to get the total number of the query in order to know the total pages. How to get the total count of the query?
I know the LENGTH function to get the count of some collection, but maybe it doesn't suit for the following:
FOR v in 2 any 'Collection/id1' GRAPH 'graph-name' FILTER ... LIMIT 10 RETURN distinct v.
I want to get the total number, but I can't get it by RETURN distinct LENGTH(v)
I now can implement this in a ungraceful way:
LET nodeList=(FOR v IN 2 any 'Collection/id1' GRAPH 'graph-name' FILTER ... RETURN distinct v)
FOR v IN 2 any 'Collection/id1' GRAPH 'graph-name' FILTER ... limit 10 RETURN distinct {'nodes': v, 'total':LENGTH(nodeList)}
Is there any other good idea to get this?
I found this answer from the arangodb spring data project.
AqlQueryOptions has fullCount() function, to return the total count of the query.
and you can return the PageImpl which contains the query content and the pagination info.

Cassandra Modelling for Date Range

Cassandra Newbie here. Cassandra v 3.9.
I'm modelling the Travellers Flight Checkin Data.
My Main Query Criteria is Search for travellers with a date range (max of 7 day window).
Here is what I've come up with with my limited exposure to Cassandra.
create table IF NOT EXISTS travellers_checkin (checkinDay text, checkinTimestamp bigint, travellerName text, travellerPassportNo text, flightNumber text, from text, to text, bookingClass text, PRIMARY KEY (checkinDay, checkinTimestamp)) WITH CLUSTERING ORDER BY (checkinTimestamp DESC)
Per day, I'm expecting upto a million records - resulting in the partition to have a million records.
Now my users want search in which the date window is mandatory (max a week window). In this case should I use a IN clause that spans across multiple partitions? Is this the correct way or should I think of re-modelling the data? Alternatively, I'm also wondering if issuing 7 queries (per day) and merging the responses would be efficient.
Your Data Model Seems Good.But If you could add more field to the partition key it will scale well. And you should use Separate Query with executeAsync
If you are using in clause, this means that you’re waiting on this single coordinator node to give you a response, it’s keeping all those queries and their responses in the heap, and if one of those queries fails, or the coordinator fails, you have to retry the whole thing
Source : https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Instead of using IN clause, use separate query of each day and execute it with executeAsync.
Java Example :
PreparedStatement statement = session.prepare("SELECT * FROM travellers_checkin where checkinDay = ? and checkinTimestamp >= ? and checkinTimestamp <= ?");
List<ResultSetFuture> futures = new ArrayList<>();
for (int i = 1; i < 4; i++) {
ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(i, i));
futures.add(resultSetFuture);
}
for (ResultSetFuture future : futures){
ResultSet rows = future.getUninterruptibly();
//You get the result set of each query, merge them here
}

fastest way of inserting into cassandra using python cassandra driver

I am inserting and updating multiple entries into a table in Cassandra using python Cassandra driver. Currently my code looks like:
cluster = Cluster()
session = cluster.connect('db')
for a in list:
if bool:
# calculate b
session.execute("UPDATE table SET col2 = %s WHERE col1 = %s", (b, a))
else:
# calculate b
session.execute("INSERT INTO table(col1, col2) VALUES(%s, %s)", (a, b))
This method of insertion and update is quite slow as the number of entries in the list (all are unique) which are to be inserted is very large. Is there any faster way of doing this?
Generally for this scenario, you will see the best performance by increasing the number of concurrent writes to Cassandra.
You can do this with the Datastax Python Cassandra driver using execute_concurrent
From your description, it is worth noting that for your case there is no difference between an Update and an Insert with Cassandra. (i.e. you can simply do the insert statement from your else clause for all values of (a, b).
You will want to create a prepared statement.
Rather than doing the inserts one at a time in your for-loop, consider pre-computing groups of (a,b) pairs as input for execute_concurrent; you can also write a generator or generator expression as input for execute_concurrent.
Example:
parameters = ((a, calculate_b(a)) for a in my_list)
execute_concurrent_with_args(my_session, my_prepared_statement, parameters)

Insert is 10 times faster than Update in Cassandra. Is it normal?

In my Java application accessing Cassandra, it can insert 500 rows per second, but only update 50 rows per second(actually the updated rows didn't exist).
Updating one hundred fields is as fast as updating one field.
I just use CQL statements in the Java application.
Is this situation normal? How can I improve my application?
public void InsertSome(List<Data> data) {
String insertQuery = "INSERT INTO Data (E,D,A,S,C,......) values(?,?,?,?,?,.............); ";
if (prepared == null)
prepared = getSession().prepare(insertQuery);
count += data.size();
for (int i = 0; i < data.size(); i++) {
List<Object> objs = getFiledValues(data.get(i));
BoundStatement bs = prepared.bind(objs.toArray());
getSession().execute(bs);
}
}
public void UpdateOneField(Data data) {
String updateQuery = "UPDATE Data set C=? where E=? and D=? and A=? and S=?; ";
if (prepared == null)
prepared = getSession().prepare(updateQuery);
BoundStatement bs = prepared.bind(data.getC(), data.getE(),
data.getD(), data.getA(), data.getS());
getSession().execute(bs);
}
public void UpdateOne(Data data) {
String updateQuery = "UPDATE Data set C=?,U=?,F........where E=? and D=? and A=? and S=? and D=?; ";
if (prepared == null)
prepared = getSession().prepare(updateQuery);
......
BoundStatement bs = prepared.bind(objs2.toArray());
getSession().execute(bs);
}
Schema:
Create Table Data (
E,
D,
A,
S,
D,
C,
U,
S,
...
PRIMARY KEY ((E
D),
A,
S)
) WITH compression = { 'sstable_compression' : 'DeflateCompressor', 'chunk_length_kb' : 64 }
AND compaction = { 'class' : 'LeveledCompactionStrategy' };
Another scenario:
I used the same application to access another cassandra cluster. The result was different. UPDATE was as fast as INSERT. But it only INSERT/UPDATE 5 rows per second. This cassandra cluster is the DataStax Enterprise running on GCE(I used the default DataStax Enterprise on Google Cloud Launcher)
So I think it's probably that some configurations are the reasons. But I don't know what they are.
Conceptually UPDATE and INSERT are the same so I would expect similar performance. UPDATE doesn't check to see if the data already exists (unless you are doing a lightweight transaction with IF EXISTS).
I noticed that each of your methods prepare a statement if it is not null. Is it possible the statement is being reprepared each time? That would add for a roundtrip for every method invocation. I also noticed that InsertSome does multiple inserts per invocation, where UpdateOne / UpdateOneField execute one statement. So if the statement were prepared every time, thats an invocation per update, where it's only done once per insert for a list.
Cassandra uses log-structured merge trees for an on-disk format, meaning all writes are done sequentially (the database is the append-only log). That implies a lower write latency.
At the cluster level, Cassandra is also able to achieve greater write scalability by partitioning the key space such that each machine is only responsible for a portion of the keys. That implies a higher write throughput, as more writes can be done in parallel.

Resources