Getting random records efficiently from ArangoDB

Getting random records efficiently from ArangoDB - arangodb

Here is a problem I would like an efficient AQL command for.
ArangoDB 3.4.0-RC.1, RocksDB storage engine
Collection gigs has 33,000 records
I have a primary index on _key
What I want to do is the following:
I want to pull 25 random records out of the collection.
I have looked at the following things I could do:
FOR g IN gigs
SORT RAND()
LIMIT 25
RETURN g
This takes on the order of 2.8 seconds on my machine.
RETURN NTH(gigs, 30)
Where 30 would be replaced by a random number.
It also takes 3 seconds.
I do not know if it is possible, but can I tell it to pick the nth record out of the primary index (the _key is all I really need)?
Any ideas on how I can get better results?

The challenge here is, that you need to combine indices with randomness, since rocksdb doesn't have a fast way to have any() give you a random document in the collection.
One good approach would be to combine a random value with a range comparison, and a LIMIT statement:
FOR g IN gigs
FILTER g.someIndexedNumber > #externalyGeneratedRandomNumber
LIMIT 25
RETURN g
You should use explain() to verify that your query is actually able to utilize an index.

Related

Performance of hazelcast using executors on imap with million entries

We are applying few predicates on imap containing just 100,000 objects to filter data. These predicates will change per user. While doing POC on my local machine (16 GB) with two nodes(each node shows 50000) and 100,000 records, I am getting output in 30 sec which is way more than querying the database directly.
Will increasing number of nodes reduce the time, I even tried with PagingPredicate but it takes around 20 sec for each page
IMap objectMap = hazelcastInstance.getMap("myMap");
MultiMap resultMap = hazelcastInstance.getMap("myResultMap");
/*Option 1 : passing hazelcast predicate for imap.values*/
objectMap.values(predicate).parallelStream().forEach(entry -> resultMap(userId, entry));
/*Option 2: applying java predicate to entrySet OR localkeyset*/
objectMap.entrySet.parallelstream().filter(predicate).forEach(entry -> resultMap(userId, entry));

More nodes will help, but the improvement is difficult to quantify. It could be large, it could be small.
Part of the work in the code sample involves applying a predicate across 100,000 entries. If there is no index, the scan stage checks 50,000 entries per node if there are 2 nodes. Double up to 4 nodes, each has 25,000 entries to scan and so the scan time will half.
The scan time is part of the query time, the overall result set also has to be formed from the partial results from each node. So doubling the number of nodes might nearly half the run time as a best case, or it might not be a significant improvement.
Perhaps the bigger question here is what are you trying to achieve ?
objectMap.values(predicate) in the code sample retrieves the result set to a central point, which then has parallelStream() applied to try to merge the results in parallel into a MultiMap. So this looks like more of an ETL than a query.
Use of executors as per the title, and something like objectMap.localKeySet(predicate) might allow this to be parallelised out better, as there would be no central point holding intermediate results.

How to perform a sum on a lot of entitities in Google Datastore?

I am trying to understand if Google Datastore can fit my needs.
I have a lot of entities and I need to perform a sum on a certain property.
Basically, I would like to be able to do select count(value1) from entity1 where [some filter], entity1 is an entity that keeps track of some sort of data in its field/property value1.
I know these kind of functions are not available in datastore since it's not a relational database, so the most immediate solution would be to perform a select and then calculate the sum on the result set in the application. So I would have something like (using nodejs, but i don't care about the language):
query = client.query(kind='Task')
query.add_filter('done', '=', False)
results = list(query.fetch())
total = 0
for(v in results)
total += v.value
The problem is that I have thousands of records, so results may be like 300 000 records.
What's the best way to do this without getting a bottleneck?

You can store a total sum in a separate entity. No matter how frequently users request it, you can return it within milliseconds.
When an entity which is included in the total changes, you change the total entity. For example, if a property changes from 300 to 500, you increase the total by 200. This way your total is always accurate.
If updates are very frequent, you can implement these updates to the total as tasks (Task Queue API) in order to prevent race conditions. These tasks will be executed very quickly, so your users will get very "fresh" total every time they ask.

Maybe the best way to make count on Google Datastore is the official solution: Shard Count.

ArangoDB - Performance issue with AQL query

I'm using ArangoDB for a Web Application through Strongloop.
I've got some performance problem when I run this query:
FOR result IN Collection SORT result.field ASC RETURN result
I added some index to speed up the query like skiplist index on the field sorted.
My Collection has inside more than 1M of records.
The application is hosted on n1-highmem-2 on Google Cloud.
Below some specs:
2 CPUs - Xeon E5 2.3Ghz
13 GB of RAM
10GB SSD
Unluckly, my query spend a lot of time to ending.
What can I do?
Best regards,
Carmelo

Summarizing the discussion above:
If there is a skiplist index present on the field attribute, it could be used for the sort. However, if its created sparse it can't. This can be revalidated by running
db.Collection.getIndexes();
in the ArangoShell. If the index is present and non-sparse, then the query should use the index for sorting and no additional sorting will be required - which can be revalidated using Explain.
However, the query will still build a huge result in memory which will take time and consume RAM.
If a large result set is desired, LIMIT can be used to retrieve slices of the results in several chunks, which will cause less stress on the machine.
For example, first iteration:
FOR result IN Collection SORT result.field LIMIT 10000 RETURN result
Then process these first 10,000 documents offline, and note the result value of the last processed document.
Now run the query again, but now with an additional FILTER:
FOR result IN Collection
FILTER result.field > #lastValue LIMIT 10000 RETURN result
until there are no more documents. That should work fine if result.field is unique.
If result.field is not unique and there are no other unique keys in the collection covered by a skiplist, then the described method will be at least an approximation.
Note also that when splitting the query into chunks this won't provide snapshot isolation, but depending on the use case it may be good enough already.

Retrieving many rows using a TableBatchOperation is not supported?

Here is a piece of code that initialize a TableBatchOperation designed to retrieve two rows in a single batch:
TableBatchOperation batch = new TableBatchOperation();
batch.Add(TableOperation.Retrieve("somePartition", "rowKey1"));
batch.Add(TableOperation.Retrieve("somePartition", "rowKey2"));
//second call throws an ArgumentException:
//"A batch transaction with a retrieve operation cannot contain
//any other operation"
As mentionned, an exception is thrown, and it seems not supported to retrieve N rows in a single batch.
This is a big deal to me, as I need to retrieve about 50 rows per request. This issue is as much performance wise as cost wise. As you may know, Azure Table Storage pricing is based on the amount of transactions, which means that 50 retrieve operations is 50 times more expensive than a single batch operation.
Have I missed something?
Side note
I'm using the new Azure Storage api 2.0.
I've noticed this question has never been raised on the web. This constraint might have been added recently?
edit
I found a related question here: Very Slow on Azure Table Storage Query on PartitionKey/RowKey List.
It seems using TableQuery with "or" on rowkeys will results with a full table scan.
There's really a serious issue here...

When designing your Partition Key (PK) and Row Key (RK) scheme in Azure Table Storage (ATS) your primary consideration should be how you're going to retrieve the data. As you've said each query you run costs both money, but more importantly time so you need to get all of the data back in one efficient query. The efficient queries that you can run on ATS are of these types:
Exact PK and RK
Exact PK, RK range
PK Range
PK Range, RK range
Based on your comments I'm guessing you've got some data that is similar to this:
PK RK Data
Guid1 A {Data:{...}, RelatedRows: [{PK:"Guid2", RK:"B"}, {PK:"Guid3", RK:"C"}]}
Guid2 B {Data:{...}, RelatedRows: [{PK:"Guid1", RK:"A"}]
Guid3 C {Data:{...}, RelatedRows: [{PK:"Guid1", RK:"A"}];}
and you've retrieved the data at Guid1, and now you need to load Guid2 and Guid3. I'm also presuming that these rows have no common denominator like they're all for the same user. With this in mind I'd create an extra "index table" which could look like this:
PK RK Data
Guid1-A Guid2-B {Data:{....}}
Guid1-A Guid3-C {Data:{....}}
Guid2-B Guid1-A {Data:{....}}
Guid2-B Guid1-A {Data:{....}}
Where the PK is the combined PK and RK of the parent and the RK is the combined PK and RK of the child row. You can then run a query which says return all rows with PK="Guid1-A" and you will get all related data with just one call (or two calls overall). The biggest overhead this creates is in your writes, so now when you right a row you also have to write rows for each of the related rows as well and also make sure that the data is kept up to date (this may not be an issue for you if this is a write once kind of scenario).
If any of my assumptions are wrong or if you have some example data I can update this answer with more relevant examples.

Try something like this:
TableQuery<DynamicTableEntity> query = new TableQuery<DynamicTableEntity>()
.Where(TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, "partition1"),
TableOperators.And,
TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.Equal, "row1"),
TableOperators.Or,
TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.Equal, "row2"))));

I know that this is an old question, but as Azure STILL does not support secondary indexes, it seems it will be relevant for some time.
I was hitting the same type of problem. In my scenario, I needed to lookup hundreds of items within the same partition, where there are millions of rows (imagine GUID as row-key). I tested a couple of options to lookup 10,000 rows
(PK && RK)
(PK && RK1) || (PK & RK2) || ...
PK && (RK1 || RK2 || ... )
I was using the Async API, with a maximum 10 degrees of parallelism (max 10 outstanding requests). I also tested a couple of different batch sizes (10 rows, 50, 100).
Test Batch Size API calls Elapsed (sec)
(PK && RK) 1 10000 95.76
(PK && RK1) || (PK && RK2) 10 1000 25.94
(PK && RK1) || (PK && RK2) 50 200 18.35
(PK && RK1) || (PK && RK2) 100 100 17.38
PK && (RK1 || RK2 || … ) 10 1000 24.55
PK && (RK1 || RK2 || … ) 50 200 14.90
PK && (RK1 || RK2 || … ) 100 100 13.43
NB: These are all within the same partition - just multiple rowkeys.
I would have been happy to just reduce the number of API calls. But as an added benefit, the elapsed time is also significantly less, saving on compute costs (at least on my end!).
Not too surprising, the batches of 100 rows delivered the best elapsed performance. There are obviously other performance considerations, especially network usage (#1 hardly uses the network at all for example, whereas the others push it much harder)
EDIT
Be careful when querying for many rowkeys. There is (or course) a URL length limitation to the query. If you exceed the length, the query will still succeed because the service can not tell that the URL was truncated. In our case, we limited the combined query length to about 2500 characters (URL encoded!)

Batch "Get" operations are not supported by Azure Table Storage. Supported operations are: Add, Delete, Update, and Merge. You would need to execute queries as separate requests. For faster processing, you may want to execute these queries in parallel.

Your best bet is to create a Linq/OData select query... that will fetch what you're looking for.
For better performance you should make one query per partition and run those queries simultaneously.
I haven't tested this personally, but think it would work.

How many entities do you have per partition? With one retrieve operation you can pull back up to 1000 records per query. Then you could do your Row Key filtering on the in memory set and only pay for 1 operation.
Another option is to do a Row Key range query to retrieve part of a partition in one operation. Essentially you specify an upper and lower bound for the row keys to return, rather than an entire partition.

Okay, so a batch retrieve operation, best case scenario is a table query. Less optimal situation would require parallel retrieve operations.
Depending on your PK, RK design you can based on a list of (PK, RK) figure out what is the smallest/most efficient set of retrieve/query operations that you need to perform. You then fetch all these things in parallel and sort out the exact answer client side.
IMAO, it was a design miss by Microsoft to add the Retrieve method to the TableBatchOperation class because it conveys semantics not supported by the table storage API.
Right now, I'm not in the mood to write something super efficient, so I'm just gonna leave this super simple solution here.
var retrieveTasks = new List<Task<TableResult>>();
foreach (var item in list)
{
retrieveTasks.Add(table.ExecuteAsync(TableOperation.Retrieve(item.pk, item.rk)));
}
var retrieveResults = new List<TableResult>();
foreach (var retrieveTask in retrieveTasks)
{
retrieveResults.Add(await retrieveTask);
}
This asynchronous block of code will fetch the entities in list in parallel and store the result in the retrieveResults preserving the order. If you have continuous ranges of entities that you need to fetch you can improve this by using a rang query.
There a sweet spot (that you'll have to find by testing this) is where it's probably faster/cheaper to query more entities than you might need for a specific batch retrieve then discard the results you retrieve that you don't need.
If you have a small partition you might benefit from a query like so:
where pk=partition1 and (rk=rk1 or rk=rk2 or rk=rk3)
If the lexicographic (i.e. sort order) distance is great between your keys you might want to fetch them in parallel. For example, if you store the alphabet in table storage, fetching a and z which are far apart is best to do with parallel retrieve operations while fetching a, b and c which are close together is best to do with a query. Fetching a, b c, and z would benefit from a hybrid approach.
If you know all this up front you can compute what is the best thing to do given a set of PKs and RKs. The more you know about how the underlying data is sorted the better your results will be. I'd advice a general approach to this one and instead, try to apply what you learn from these different query patterns to solve your problem.

Fastest way of querying for latest items in a Azure table?

I have a Azure table where customers post messages, there may be millions of messages in a single table. I want to find the fastest way of getting the messages posted within the last 10 minutes (which is how often I refresh the web page). Since only the partition key is indexed I have played with the idea of using the date & time the message was posted as a partition key, for example a string as a ISO8601 date format like "2009-06-15T13:45:30.0900000"
Example pseudo code:
var message = "Hello word!";
var messagePartitionKey = DateTime.Now.ToString("o");
var messageEntity = new MessageEntity(messagePartitionKey, message);
dataSource.Insert(messageEntity);
, and then query for the messages posted within the last 10 minutes like this (untested pseudo code again):
// Get the date and time 10 minutes ago
var tenMinutesAgo = DateTime.Now.Subtract(new TimeSpan(0, 10, 0)).ToString("o");
// Query for the latest messages
var latestMessages = (from t in
context.Messages
where t.PartitionKey.CompareTo(tenMinutesAgo) <= 0
select t
)
But will this be taken well by the index? Or will it cause a full table scan? Anyone have a better idea of doing this? I know there is a timestamp on each table item, but it is not indexed so it will be too slow for my purpose.

I think you've got the right basic idea. The query you've designed should be about as efficient as you could hope for. But there are some improvements I could offer.
Rather than using DateTime.Now, use Date.UtcNow. From what I understand instances are set to use Utc time as their base anyway, but this just makes sure you're comparing apples with apples and you can reliable convert the time back into whatever timezone you want when displaying them.
Rather than storing the time as .ToString("o") turn the time into ticks and store that, you'll end up with less formatting problems (sometimes you'll get the timezone specification at the end, sometimes not). Also if you always want to see these messages sorted from most recent to oldest you can subtract the number of ticks from the max number of ticks e.g.
var messagePartitionKey = (DateTime.MaxValue.Ticks - _contactDate.Ticks).ToString("d19");
It would also be a good idea to specify a row key. While it is highly unlikely that two messages will be posted with exactly the same time, it's not impossible. If you don't have an obvious row key, then just set it to be a Guid.

The Primary key for Table is the combination of PartitionKey and RowKey(which forms a clustered index).
In your case, just go for RowKey instead of ParitionKey(provide a constant value for this).
You can also follow the Diagnostics approach, like for every ten minutes create a new Partition Key. But this approach is mainly for requirements like Archieving/Purging etc.,

I would suggest doing something similar to what Diagnostics API is doing with WADPerformanceCountersTable. There PartitionKey groups a number of timestamps into a single item. Ie: it rounds all timestamps into nearest few minutes (say, nearest 5 minutes). This way you do not have a limited amount of partition keys and yet are still able to do ranged queries on them.
So, for example, you can have a PartitionKey that maps to each timestamp that is rounded into 00:00, 00:05, 00:10, 00:15, etc.. and then converted to Ticks

From my understanding using partition key with exact equal "=" will be much faster than less than using "<" or "greater than ">.
Also make sure to put more efforts if we can get the unique combination of partition key and row key for your condition.
Also make sure that you do less unique combinations of partition keys values to avoid more partitions.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Getting random records efficiently from ArangoDB - arangodb

Related

Performance of hazelcast using executors on imap with million entries

How to perform a sum on a lot of entitities in Google Datastore?

ArangoDB - Performance issue with AQL query

Retrieving many rows using a TableBatchOperation is not supported?

Fastest way of querying for latest items in a Azure table?

Categories

Resources