I have a query that compares two collections and finds the "missing" documents from one side. Both collections (existing and temp) contain about 250K documents.
FOR existing IN ExistingCollection
LET matches = (
FOR temp IN TempCollection
FILTER temp._key == existing._key
RETURN true
)
FILTER LENGTH(matches) == 0
RETURN existing
When this runs in a single-server environment (DB and Foxx are on the same server/container), this runs like lightning in under 0.5 seconds.
However, when I run this in a cluster (single DB, single Coordinator), even when the DB and Coord are on the same physical host (different containers), I have to add a LIMIT 1000 after the initial FOR existing ... to keep it from timing out! Still, this limited result returns in almost 7 seconds!
Looking at the Execution Plan, I see that there are several REMOTE and GATHER statements after the LET matches ... SubqueryNode. From what I can gather, the problem stems from the separation of the data storage and memory structure used to filter this data.
My question: can this type of operation be done efficiently on a cluster?
I need to detect obsolete (to be deleted) documents, but this is obviously not a feasible solution.
Your query executes one subquery for each document in the existing collection. Each subquery will require many HTTP roundtrips for setup, the actual querying and shutdown.
You can avoid subqueries with the following query. It loads all document _key's into RAM - but that should be no problem with your rather small collections.
LET ExistingCollection = (FOR existing IN c2 RETURN existing._key)
LET TempCollection = (FOR temp IN c1 RETURN temp._key)
RETURN MINUS(ExistingCollection, TempCollection)
Related
I was wondering about performance differences between dedicated views in CouchDb/PouchDb VS simply retrieving allDocs plus filtering them with Array.prototype.filter later on.
Let's say we want to get 5,000 todo docs stored in a database.
// Method 1: get all tasks with a dedicated view "todos"
// in CouchDB
function (doc) {
if (doc.type == "todo"){
emit(doc._id);
}
}
// on Frontend
var tasks = (await db.query('myDesignDoc/todos', {include_docs: true})).rows;
// Method 2: get allDocs, and then filter via Array.filter
var tasks = (await db.allDocs({include_docs: true})).rows;
tasks = tasks.filter(task => {return task.doc.type == 'todo'});
What's better? What are the pros and cons of each of the 2 methods?
The use of the view will scale better. But which is "faster" will depend on so many factors that you will need to benchmark for your particular case on your hardware, network and data.
For the "all_docs" case, you will effectively be transferring the entire database to the client, so network speed will be a large factor here as the database grows. If you do this as you have, by putting all the documents in an array and then filtering, you're going to hit memory usage limits at some point - you really need to process the results as a stream. This approach is O(N), where N is the number of documents in the database.
For the "view" case, a B-Tree index is used to find the range of matching documents. Only the matching documents are sent to the client, so the savings in network time and memory depend on the proportion of matching documents from all documents. Time complexity is O(log(N) + M) where N is the total number of documents and M is the number of matching documents.
If N is large and M is small then this approach should be favoured. As M approaches N, both approaches are pretty much the same. If M and N are unknown or highly variable, use a view.
You should consider one other thing - do you need the entire document returned? If you need only a few fields from large documents then views can return just those fields, reducing network and memory usage further.
Mango queries may also be of interest instead of views for this sort of query. You can create an index over the "type" field if the dataset size warrants it, but it's not mandatory.
Personally, I'd use a Mango query and add the index if/when necessary.
I'm using ArangoDB for a Web Application through Strongloop.
I've got some performance problem when I run this query:
FOR result IN Collection SORT result.field ASC RETURN result
I added some index to speed up the query like skiplist index on the field sorted.
My Collection has inside more than 1M of records.
The application is hosted on n1-highmem-2 on Google Cloud.
Below some specs:
2 CPUs - Xeon E5 2.3Ghz
13 GB of RAM
10GB SSD
Unluckly, my query spend a lot of time to ending.
What can I do?
Best regards,
Carmelo
Summarizing the discussion above:
If there is a skiplist index present on the field attribute, it could be used for the sort. However, if its created sparse it can't. This can be revalidated by running
db.Collection.getIndexes();
in the ArangoShell. If the index is present and non-sparse, then the query should use the index for sorting and no additional sorting will be required - which can be revalidated using Explain.
However, the query will still build a huge result in memory which will take time and consume RAM.
If a large result set is desired, LIMIT can be used to retrieve slices of the results in several chunks, which will cause less stress on the machine.
For example, first iteration:
FOR result IN Collection SORT result.field LIMIT 10000 RETURN result
Then process these first 10,000 documents offline, and note the result value of the last processed document.
Now run the query again, but now with an additional FILTER:
FOR result IN Collection
FILTER result.field > #lastValue LIMIT 10000 RETURN result
until there are no more documents. That should work fine if result.field is unique.
If result.field is not unique and there are no other unique keys in the collection covered by a skiplist, then the described method will be at least an approximation.
Note also that when splitting the query into chunks this won't provide snapshot isolation, but depending on the use case it may be good enough already.
I am working on ASP.NET Web Forms project and I use jquery datatable to visualize data fetched from SQL server. I need to pass the results for the current page and the total number of results for which by far I have this code :
var queryResult = query.Select(p => new[] { p.Id.ToString(),
p.Name,
p.Weight.ToString(),
p.Address })
.Skip(iDisplayStart)
.Take(iDisplayLength).ToArray();
and the result that I get when I return the result to the view like :
iTotalRecords = queryResult.Count(),
is the number of records that the user has chosen to see per page. Logical, but I haven't thought about it while building my Method chaining. Now I think about the optimal way to implement this. Since it's likely to use with relatively big amounts of data (10,000 rows, maybe more) I would like leave as much work as I can to the SQL server. However I found several questions asked about this, and the impression that I get is that I have to make two queries to the database, or manipulate the total result in my code. But I think this will won't be efficient when you have to work with many records.
So what can I do here to get best performance?
In regards to what you’re looking for I don’t think there is a simple answer.
I believe the only way you can currently do this is by running more than one query like you have already suggested, whether this would be encapsulated inside a stored procedure (SPROC) call or generated by EF.
However, I believe you can make optimsations to make your query run quicker.
First of all, every query execution MAY result in the query being recached as you are chaining your methods together, this means that the actual query being executed will need to be recompiled and cached by SQL Server (if that is your chosen technology) before being executed. This normally only takes a few milliseconds, but if the query being executed only takes a few milliseconds then this is relatively expensive.
Entity framework will translate this Linq query and execute it using derived tables. With a small result set of approx. 1k records to be paged your current solution maybe best suited. This would also depend upon on how complex your SQL filtering is as generated by your method chaining.
If your result set to be paged grows up towards 15k, I would suggest writing a SPROC to get the best performance and scalability which would insert the records into a temp table and run two queries against it, firstly to get the paged records, and secondly to get the total results.
alter proc dbo.usp_GetPagedResults
#Skip int = 10,
#Take int = 10
as
begin
select
row_number() over (order by id) [RowNumber],
t.Name,
t.Weight,
t.Address
into
#results
from
dbo.MyTable t
declare #To int = #Skip+#Take-1
select * from #results where RowNumber between #Skip and #To
select max(RowNumber) from #results
end
go
You can use the EF to map a SPROC call to entity types or create a new custom type containing the results and the number of results.
Stored Procedures with Multiple Results
I found that the cost of running the above SPROC was approximately a third of running the query which EF generated to get the same result based upon the result set size of 15k records. It was however three times slower than the EF query if only a 1K record result set due to the temp table creation.
Encapsulating this logic inside a SPROC allows the query to be refactored and optimised as your result set grows without having to change any application based code.
The suggested solution doesn’t use the derived table queries as created by the Entity Framework inside a SPROC as I found there was a marginal performance difference between running the SPROC and the query directly.
I'm trying to optimize a fuzzy search query. It's fairly large, as it searches most properties in the database for a single word. I have some questions about some things I've been doing to improve the search speed.
Test Info: I added about 10,000 nodes and I'm searching on about 40 properties. My query times are about 3-30 seconds depending on the criteria.
MATCH (n) WHERE
(n:Type__Exercise and ( n.description =~ '(?i).*criteria.*' or n.name =~ '(?i).*criteria.*' )) or
(n:Type__Fault and ( n.description =~ '(?i).*criteria.*' or n.name =~ '(?i).*criteria.*' ))
with n LIMIT 100
return count(n)
This is basically my query, but with a lot more OR clauses. I also use parameters when sending the query to the execution engine. I realize it's very expensive to use the regular expressions on every single property. I'm hoping I can get good enough performance without doing exact matches up to a certain amount of data (This application will only have 1-10 users querying at a time). This is a possible interim effort we're investigating until the new label indexes support full text queries.
First of all, how do I tell if my query was cached? I make a call to my server plug-in via the curl command and the times I'm seeing are almost identical each time I pass the same criteria (The time is for the entire curl command to finish). I'm using a single instance of the execution engine that was created by using the GraphDatabaseService that is passed in to the plug-in via a #Source parameter. How much of an improvement should I see if a query is cached?
Is there a query size where Neo4j doesn't bother caching the query?
How effective is the LIMIT clause at speeding up queries? I added one, but didn't see a great performance boost (for queries that do have results). Does the execution engine stop once it finds enough nodes?
My queries are ready-only, do I still have to wrap my calls with a transaction?
I could split up my query so I only search one property at a time or say 4 properties at a time. Then I could run the whole set of queries via the execution engine. It seems like this would be better for caching, but is there an added cost to running multiple small queries rather than one large one? What if I kicked off 10 threads? Would there be enough of a performance increase to make this worth while?
Is there a way to use parameters when using PROFILE in the Neo4j console? I've been trying to use this to see how many db hits I'm getting on my queries.
How effective is the Neo4j browser for comparing times it takes to execute a query?
Does caching happen here?
If I want to warm up Neo4j data for queries - can I run the exact queries I'm expecting? Does the query need to return data, or will a count type query warm the cache? As an alternative, should I just iterate over all the nodes? I'd rather just pull in the nodes that are likely to be searched vs all of them.
I think for the time being you'd be better served using the fulltext-legacy indexing facilities, I recently wrote a blog post about it: http://jexp.de/blog/2014/03/full-text-indexing-fts-in-neo4j-2-0/
If you don't want to do that:
I would probably also rewrite your query to turn it around:
MATCH (n)
WHERE
(n:Type__Exercise OR n:Type__Fault) AND
(n.description =~ '(?i).*criteria.*' OR n.name =~ '(?i).*criteria.*' )
You can probably also benefit a bit more by having a secondary "search" field that is just the concatenation of your description and name fields. You probably also want to improve your regexp like adding a word boundary \b left and right.
Regarding your questions:
First of all, how do I tell if my query was cached?
Your query will be cached if you use parameters (for the regexps) there is a configurable query-caches size (defaulting to 100 queries)
Is there a query size where Neo4j doesn't bother caching the query?
Neo4j currently caches all queries that come in regardless of size
My queries are ready-only, do I still have to wrap my calls with a transaction?
Cypher will create its own transaction. In general read transactions are mandatory. For cypher you need outer transactions if you want multiple queries to participate in the same tx-scope.
is there an added cost to running multiple small queries rather than one large one? What if I kicked off 10 threads? Would there be enough of a performance increase to make this worth while?
It depends smaller queries are executed more quickly (if they touch less of the total dataset) but you have to combine their results in the client.
If they touch the same nodes you do double work.
For bigger queries you have to watch out when you span up cross products or exponential path explosions.
Regarding running smaller queries with many threads
Good question, it should be faster there are currently some bottlenecks that we're about to remove. Just try it out.
Is there a way to use parameters when using PROFILE in the Neo4j console?
You can use the shell variables for that, with export name=value and list them with env
e.g.
export name=Lisa
profile match (n:User {name:{name}}) return n;
How effective is the Neo4j browser for comparing times it takes to execute a query?
The browser measures the complete roundtrip with potentially more data loading, so it's timing is not very accurate.
Warmup
The exact queries would make sense
You don't have to return data, it is enough to return count(*) but you should access the properties you want to access to make sure they are loaded.
Here is a piece of code that initialize a TableBatchOperation designed to retrieve two rows in a single batch:
TableBatchOperation batch = new TableBatchOperation();
batch.Add(TableOperation.Retrieve("somePartition", "rowKey1"));
batch.Add(TableOperation.Retrieve("somePartition", "rowKey2"));
//second call throws an ArgumentException:
//"A batch transaction with a retrieve operation cannot contain
//any other operation"
As mentionned, an exception is thrown, and it seems not supported to retrieve N rows in a single batch.
This is a big deal to me, as I need to retrieve about 50 rows per request. This issue is as much performance wise as cost wise. As you may know, Azure Table Storage pricing is based on the amount of transactions, which means that 50 retrieve operations is 50 times more expensive than a single batch operation.
Have I missed something?
Side note
I'm using the new Azure Storage api 2.0.
I've noticed this question has never been raised on the web. This constraint might have been added recently?
edit
I found a related question here: Very Slow on Azure Table Storage Query on PartitionKey/RowKey List.
It seems using TableQuery with "or" on rowkeys will results with a full table scan.
There's really a serious issue here...
When designing your Partition Key (PK) and Row Key (RK) scheme in Azure Table Storage (ATS) your primary consideration should be how you're going to retrieve the data. As you've said each query you run costs both money, but more importantly time so you need to get all of the data back in one efficient query. The efficient queries that you can run on ATS are of these types:
Exact PK and RK
Exact PK, RK range
PK Range
PK Range, RK range
Based on your comments I'm guessing you've got some data that is similar to this:
PK RK Data
Guid1 A {Data:{...}, RelatedRows: [{PK:"Guid2", RK:"B"}, {PK:"Guid3", RK:"C"}]}
Guid2 B {Data:{...}, RelatedRows: [{PK:"Guid1", RK:"A"}]
Guid3 C {Data:{...}, RelatedRows: [{PK:"Guid1", RK:"A"}];}
and you've retrieved the data at Guid1, and now you need to load Guid2 and Guid3. I'm also presuming that these rows have no common denominator like they're all for the same user. With this in mind I'd create an extra "index table" which could look like this:
PK RK Data
Guid1-A Guid2-B {Data:{....}}
Guid1-A Guid3-C {Data:{....}}
Guid2-B Guid1-A {Data:{....}}
Guid2-B Guid1-A {Data:{....}}
Where the PK is the combined PK and RK of the parent and the RK is the combined PK and RK of the child row. You can then run a query which says return all rows with PK="Guid1-A" and you will get all related data with just one call (or two calls overall). The biggest overhead this creates is in your writes, so now when you right a row you also have to write rows for each of the related rows as well and also make sure that the data is kept up to date (this may not be an issue for you if this is a write once kind of scenario).
If any of my assumptions are wrong or if you have some example data I can update this answer with more relevant examples.
Try something like this:
TableQuery<DynamicTableEntity> query = new TableQuery<DynamicTableEntity>()
.Where(TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, "partition1"),
TableOperators.And,
TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.Equal, "row1"),
TableOperators.Or,
TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.Equal, "row2"))));
I know that this is an old question, but as Azure STILL does not support secondary indexes, it seems it will be relevant for some time.
I was hitting the same type of problem. In my scenario, I needed to lookup hundreds of items within the same partition, where there are millions of rows (imagine GUID as row-key). I tested a couple of options to lookup 10,000 rows
(PK && RK)
(PK && RK1) || (PK & RK2) || ...
PK && (RK1 || RK2 || ... )
I was using the Async API, with a maximum 10 degrees of parallelism (max 10 outstanding requests). I also tested a couple of different batch sizes (10 rows, 50, 100).
Test Batch Size API calls Elapsed (sec)
(PK && RK) 1 10000 95.76
(PK && RK1) || (PK && RK2) 10 1000 25.94
(PK && RK1) || (PK && RK2) 50 200 18.35
(PK && RK1) || (PK && RK2) 100 100 17.38
PK && (RK1 || RK2 || … ) 10 1000 24.55
PK && (RK1 || RK2 || … ) 50 200 14.90
PK && (RK1 || RK2 || … ) 100 100 13.43
NB: These are all within the same partition - just multiple rowkeys.
I would have been happy to just reduce the number of API calls. But as an added benefit, the elapsed time is also significantly less, saving on compute costs (at least on my end!).
Not too surprising, the batches of 100 rows delivered the best elapsed performance. There are obviously other performance considerations, especially network usage (#1 hardly uses the network at all for example, whereas the others push it much harder)
EDIT
Be careful when querying for many rowkeys. There is (or course) a URL length limitation to the query. If you exceed the length, the query will still succeed because the service can not tell that the URL was truncated. In our case, we limited the combined query length to about 2500 characters (URL encoded!)
Batch "Get" operations are not supported by Azure Table Storage. Supported operations are: Add, Delete, Update, and Merge. You would need to execute queries as separate requests. For faster processing, you may want to execute these queries in parallel.
Your best bet is to create a Linq/OData select query... that will fetch what you're looking for.
For better performance you should make one query per partition and run those queries simultaneously.
I haven't tested this personally, but think it would work.
How many entities do you have per partition? With one retrieve operation you can pull back up to 1000 records per query. Then you could do your Row Key filtering on the in memory set and only pay for 1 operation.
Another option is to do a Row Key range query to retrieve part of a partition in one operation. Essentially you specify an upper and lower bound for the row keys to return, rather than an entire partition.
Okay, so a batch retrieve operation, best case scenario is a table query. Less optimal situation would require parallel retrieve operations.
Depending on your PK, RK design you can based on a list of (PK, RK) figure out what is the smallest/most efficient set of retrieve/query operations that you need to perform. You then fetch all these things in parallel and sort out the exact answer client side.
IMAO, it was a design miss by Microsoft to add the Retrieve method to the TableBatchOperation class because it conveys semantics not supported by the table storage API.
Right now, I'm not in the mood to write something super efficient, so I'm just gonna leave this super simple solution here.
var retrieveTasks = new List<Task<TableResult>>();
foreach (var item in list)
{
retrieveTasks.Add(table.ExecuteAsync(TableOperation.Retrieve(item.pk, item.rk)));
}
var retrieveResults = new List<TableResult>();
foreach (var retrieveTask in retrieveTasks)
{
retrieveResults.Add(await retrieveTask);
}
This asynchronous block of code will fetch the entities in list in parallel and store the result in the retrieveResults preserving the order. If you have continuous ranges of entities that you need to fetch you can improve this by using a rang query.
There a sweet spot (that you'll have to find by testing this) is where it's probably faster/cheaper to query more entities than you might need for a specific batch retrieve then discard the results you retrieve that you don't need.
If you have a small partition you might benefit from a query like so:
where pk=partition1 and (rk=rk1 or rk=rk2 or rk=rk3)
If the lexicographic (i.e. sort order) distance is great between your keys you might want to fetch them in parallel. For example, if you store the alphabet in table storage, fetching a and z which are far apart is best to do with parallel retrieve operations while fetching a, b and c which are close together is best to do with a query. Fetching a, b c, and z would benefit from a hybrid approach.
If you know all this up front you can compute what is the best thing to do given a set of PKs and RKs. The more you know about how the underlying data is sorted the better your results will be. I'd advice a general approach to this one and instead, try to apply what you learn from these different query patterns to solve your problem.