How to resolve ArangoDB Foxx deadlocking problems? - arangodb

Under testing, our Foxx app is getting into "deadlock detected" problems. These seem to be caused by traversal queries. Apriori, it is difficult if not impossible to know which tables will be used during traversal. However, I did take one specific case where I could determine the number of tables and wrap the AQL in a transaction for testing:
var result = db._executeTransaction({ "collections" : { "read" : [ "pmlibrary", "pmvartype", "pmvariant", "pmproject", "pmsite", "pmpath", "pmattic" ] }, "action" : "function(){var db=require(\"#arangodb\").db;var res=db._query(\"FOR o IN ['pmlibrary/199340787'] FOR v,e,p IN 0..7 INBOUND o pm_child RETURN p.vertices\");return res.toArray()}" });
FYI, the list of tables in the collections do not include the edge tables.
However, the deadlocks on this statement continue. I'm not sure what to try next. Thanks.

I'm not an indexing/performance expert with ArangoDB, but I recently had some issues with deadlocking too and by reconstructing and reshaping the queries had huge performance gains, sometimes queries that took 120 seconds then took 0.2 seconds.
A key thing I did was try to help ArangoDB know when to use an Index, and make sure that index is there to be used.
In your example query, there is a problem that stops ArangoDB from knowing an index occurs:
FOR o IN ['pmlibrary/199340787']
FOR v,e,p IN 0..7 INBOUND o pm_child
RETURN p.vertices
The key issue with this is that your original FOR loop is using a value out of an array, which may hinder ArangoDB identifying an index. You could easily replace your o with a parameter, and put in your 'pmlibrary/199340787' value directly, and with the Explain button see if it identifies it can use an index.
One issue I found was trying to make it super clear to use an index, which sometimes meant adding additional LET commands to allow ArangoDB to build the contents of arrays, and then it seems to know what index to use better.
If you were to test the performance of the query above, versus something like:
LET my_targets = (FOR o IN pmlibrary FILTER o._key == '199340787' RETURN o._id)
FOR o IN my_targets
FOR v,e,p IN 0..7 INBOUND o._id pm_child
RETURN p.vertices
It may seem counter intuitive, but with testing I found amazing performance gains by breaking queries apart to help ArangoDB notice it could use an Index.
If any queries are against the values within Arrays or if you are using dynamic attribute names, then it won't always use an index even if one exists.
Also I found that the server will cache response for queries, so if you want to test changes to a long running query and want to avoid cache hits on your second + queries, I had to stop and restart the arangodb3 service.
I hope that helps give you a target for shuffling around your query to work with indexes.
Remember you already have inbuilt indexes on _id, _key, _from, _to, so you need to try and make sure it's going to use other indexes that can help speed up your query.

The proper solution was to use a WITH clause rather than a transaction.

Related

Inconsistent results when sorting documents by _doc

I want to fetch elasticsearch hits using the sort+search_after paging mechanism.
The elasticsearch documentation states:
_doc has no real use-case besides being the most efficient sort order. So if you don’t care about the order in which documents are returned, then you should sort by _doc. This especially helps when scrolling.
However, when performing the same query multiple times, I get different results. More specifically, the first hit alternates randomly between two different hits, where the returned sort field is 0 for one hit, and some specific number for the other.
This obviously breaks the paging as it relies on the value returned in sorting to be later fed into sort_after for the next query.
No data is being written to the index while I am querying it, so this is not because of refreshes.
My questions are therefore:
Is it wrong to sort by _doc for paging? Seems the results I get are inconsistent.
How does sorting by _doc work internally? The documentation is lacking in this regard as it simply states the sort is performed by "index order".
The data was written to the index in parallel using Spark. I thought the problem might have been the parallel write combined with the "index order" sorting, however I did not manage to replicate this behavior with other indicies which were also written to in Spark.
es 7, index contains 2 shards, one primary and one replica
cheers.
The reason this happened is that the index consists of 2 shards. One primary and one replica. The documents were not indexed in the same order. Thus, the order of the results depends on the shard they were returned from. This is fine when using scrolling because Elasticsearch keeps an inner state of the results, but not with paging, which is stateless.

Querying a mongoDB database collection with over 60k documents

I have tried searching similar answers and have applied the solutions. The solutions seem not to work in my case. I am querying a mongoose collection that contains 60k documents, and I need all 60k to apply combinatorics. Hence, can't apply limits. I can reduce the data volume by querying multiple times depending on a certain property, but that will be costly in terms of performance as well. I don't see what else to try. Can someone help me out?
I am using this simple code for now:
StagingData.find({})
.lean()
.exec(function(err, results){
console.log(results) //I don't get any output
}
When I use:
let data = await StagingData.find({}).lean() //it takes forever
What should I do?
You might want to apply indexing first, e.g. precomputing some values as a separate operation, parallel processing, etc. For this you may want to jump to a a different technology maybe Elasticsearch, Spark etc depend on your code.
You may also want to identify what is the bottleneck in your process: memory, processor. Try experimenting with a shorter set of documents and see how quickly you get results. With this you might be able to infer how long will it take to do it for the whole dataset.
You may also trying breaking down your operation into smaller chunks and identifying the cost of processing etc.

Are Objection.js WHERE IN queries slowing my Node.js application down?

Ok, so this is basically a yes/no answer. I have a node application running on Heroku with a postgres plan. I use Objection.js as ORM. I'm facing 300+ms response times on most of the endpoints, and I have a theory about why this is. I would like to have this confirmed.
Most of my api endpoints do around 5-10 eager loads. Objection.js handles eager loading by doing additional WHERE IN queries and not by doing one big query with a lot of JOINS. The reason for this is that it was easier to build that way and that it shouldn't harm performance too much.
But that made me thinking: Heroku Postgres doesn't run on the same heroku dyno as the node application I assume, so that means there is latency for every query. Could it be that all these latencies add up, causing a total of 300ms delay?
Summarizing: would it be speedier to build your own queries with Knex instead of generating them through Objection.js, in case you have a separately hosted database?
Speculating why this kind of slowdowns happen is mostly useless (I'll speculate a bit in the end anyways ;) ). First thing to do in this kind of situation would be to measure where that 300ms are used.
From database you should be able to see query times to spot if there are any slow queries causing the problems.
Also knex outputs some information about performance to console when you run it with DEBUG=knex:* environment variable set.
Nowadays also node has builtin profiling support which you can enable by setting --inspect flag when starting node. Then you will be able to connect your node process with chrome dev tools and see where node is using its time. From that profile you will be able to see for example if database query result parsing is dominating execution time.
The best way to figure out the slowness is to isolate the slow part of the app and inspect that carefully or even post that example to stackoverflow that other people can tell you why it might be slow. General explanation like this doesn't give much tool for other to help to resolve the real issue.
Objection.js handles eager loading by doing additional WHERE IN queries and not by doing one big query with a lot of JOINS. The reason for this is that it was easier to build that way and that it shouldn't harm performance too much.
With objection you can select which eager algorithm you like to use. In most of the cases (when there are one-to-many or many-to-many relations) making multiple queries is actually more performant compared to using join, because with join amount of data explodes and transfer time + result parsing on node side will take too much time.
would it be speedier to build your own queries with Knex instead of generating them through Objection.js, in case you have a separately hosted database?
Usually no.
Speculation part:
You mentioned that Most of my api endpoints do around 5-10 eager loads.. Most of the times when I have encountered this kind of slowness in queries the reason has been that app is querying too big chunks of data from database. When query returns for example tens of thousands rows, it will be some megabytes of JSON data. Only parsing that amount of data from database to JavaScript objects takes hundreds of milliseconds. If your queries are causing also high CPU load during that 300ms then this might be your problem.
Slow queries. Somtimes database doesn't have indexes set correctly so queries just will have to scan all the tables linearly to get the results. Checking slow queries from DB logs will help to find these. Also if getting response is taking long, but CPU load is low on node process, this might be the case.
Hereby I can confirm that every query takes approx. 30ms, regardless the complexity of the query. Objection.js indeed does around 10 separate queries because of the eager loading, explaining the cumulative 300ms.
Just FYI; I'm going down this road now ⬇
I've started to delve into writing my own more advanced SQL queries. It seems like you can do pretty advanced stuff, achieving similar results to the eager loading of Objection.js
select
"product".*,
json_agg(distinct brand) as brand,
case when count(shop) = 0 then '[]' else json_agg(distinct shop) end as shops,
case when count(category) = 0 then '[]' else json_agg(distinct category) end as categories,
case when count(barcode) = 0 then '[]' else json_agg(distinct barcode.code) end as barcodes
from "product"
inner join "brand" on "product"."brand_id" = "brand"."id"
left join "product_shop" on "product"."id" = "product_shop"."product_id"
left join "shop" on "product_shop"."shop_code" = "shop"."code"
left join "product_category" on "product"."id" = "product_category"."product_id"
left join "category" on "product_category"."category_id" = "category"."id"
left join "barcode" on "product"."id" = "barcode"."product_id"
group by "product"."id"
This takes 19ms on 1000 products, but usually the limit is 25 products, so very performant.
As mentioned by other people, objection uses multiple queries instead of joins by default to perform eager loading. It's a safer default than completely join based loading which can become really slow in some cases. You can read more about the default eager algorithm here.
You can choose to use the join based algoritm simply by calling the joinEager instead method of eager. joinEager executes one single query.
Objection also used to have a (pretty stupid) default of 1 parallel queries per operation, which meant that all queries inside an eager call were executed sequentially. Now that default has been removed and one should get a better performance even in cases like yours.
Your trick of using json_agg is pretty clever and actually avoids the slowness problems I referred to that can arise in some cases when using joinEager. However, that cannot be easily used for nested loading or with other database engines.

How to implement custom pagination with LINQ2Entities using 1 call to the database

I am working on ASP.NET Web Forms project and I use jquery datatable to visualize data fetched from SQL server. I need to pass the results for the current page and the total number of results for which by far I have this code :
var queryResult = query.Select(p => new[] { p.Id.ToString(),
p.Name,
p.Weight.ToString(),
p.Address })
.Skip(iDisplayStart)
.Take(iDisplayLength).ToArray();
and the result that I get when I return the result to the view like :
iTotalRecords = queryResult.Count(),
is the number of records that the user has chosen to see per page. Logical, but I haven't thought about it while building my Method chaining. Now I think about the optimal way to implement this. Since it's likely to use with relatively big amounts of data (10,000 rows, maybe more) I would like leave as much work as I can to the SQL server. However I found several questions asked about this, and the impression that I get is that I have to make two queries to the database, or manipulate the total result in my code. But I think this will won't be efficient when you have to work with many records.
So what can I do here to get best performance?
In regards to what you’re looking for I don’t think there is a simple answer.
I believe the only way you can currently do this is by running more than one query like you have already suggested, whether this would be encapsulated inside a stored procedure (SPROC) call or generated by EF.
However, I believe you can make optimsations to make your query run quicker.
First of all, every query execution MAY result in the query being recached as you are chaining your methods together, this means that the actual query being executed will need to be recompiled and cached by SQL Server (if that is your chosen technology) before being executed. This normally only takes a few milliseconds, but if the query being executed only takes a few milliseconds then this is relatively expensive.
Entity framework will translate this Linq query and execute it using derived tables. With a small result set of approx. 1k records to be paged your current solution maybe best suited. This would also depend upon on how complex your SQL filtering is as generated by your method chaining.
If your result set to be paged grows up towards 15k, I would suggest writing a SPROC to get the best performance and scalability which would insert the records into a temp table and run two queries against it, firstly to get the paged records, and secondly to get the total results.
alter proc dbo.usp_GetPagedResults
#Skip int = 10,
#Take int = 10
as
begin
select
row_number() over (order by id) [RowNumber],
t.Name,
t.Weight,
t.Address
into
#results
from
dbo.MyTable t
declare #To int = #Skip+#Take-1
select * from #results where RowNumber between #Skip and #To
select max(RowNumber) from #results
end
go
You can use the EF to map a SPROC call to entity types or create a new custom type containing the results and the number of results.
Stored Procedures with Multiple Results
I found that the cost of running the above SPROC was approximately a third of running the query which EF generated to get the same result based upon the result set size of 15k records. It was however three times slower than the EF query if only a 1K record result set due to the temp table creation.
Encapsulating this logic inside a SPROC allows the query to be refactored and optimised as your result set grows without having to change any application based code.
The suggested solution doesn’t use the derived table queries as created by the Entity Framework inside a SPROC as I found there was a marginal performance difference between running the SPROC and the query directly.

Neo4j Optimization Questions for Server Plug-in Queries

I'm trying to optimize a fuzzy search query. It's fairly large, as it searches most properties in the database for a single word. I have some questions about some things I've been doing to improve the search speed.
Test Info: I added about 10,000 nodes and I'm searching on about 40 properties. My query times are about 3-30 seconds depending on the criteria.
MATCH (n) WHERE
(n:Type__Exercise and ( n.description =~ '(?i).*criteria.*' or n.name =~ '(?i).*criteria.*' )) or
(n:Type__Fault and ( n.description =~ '(?i).*criteria.*' or n.name =~ '(?i).*criteria.*' ))
with n LIMIT 100
return count(n)
This is basically my query, but with a lot more OR clauses. I also use parameters when sending the query to the execution engine. I realize it's very expensive to use the regular expressions on every single property. I'm hoping I can get good enough performance without doing exact matches up to a certain amount of data (This application will only have 1-10 users querying at a time). This is a possible interim effort we're investigating until the new label indexes support full text queries.
First of all, how do I tell if my query was cached? I make a call to my server plug-in via the curl command and the times I'm seeing are almost identical each time I pass the same criteria (The time is for the entire curl command to finish). I'm using a single instance of the execution engine that was created by using the GraphDatabaseService that is passed in to the plug-in via a #Source parameter. How much of an improvement should I see if a query is cached?
Is there a query size where Neo4j doesn't bother caching the query?
How effective is the LIMIT clause at speeding up queries? I added one, but didn't see a great performance boost (for queries that do have results). Does the execution engine stop once it finds enough nodes?
My queries are ready-only, do I still have to wrap my calls with a transaction?
I could split up my query so I only search one property at a time or say 4 properties at a time. Then I could run the whole set of queries via the execution engine. It seems like this would be better for caching, but is there an added cost to running multiple small queries rather than one large one? What if I kicked off 10 threads? Would there be enough of a performance increase to make this worth while?
Is there a way to use parameters when using PROFILE in the Neo4j console? I've been trying to use this to see how many db hits I'm getting on my queries.
How effective is the Neo4j browser for comparing times it takes to execute a query?
Does caching happen here?
If I want to warm up Neo4j data for queries - can I run the exact queries I'm expecting? Does the query need to return data, or will a count type query warm the cache? As an alternative, should I just iterate over all the nodes? I'd rather just pull in the nodes that are likely to be searched vs all of them.
I think for the time being you'd be better served using the fulltext-legacy indexing facilities, I recently wrote a blog post about it: http://jexp.de/blog/2014/03/full-text-indexing-fts-in-neo4j-2-0/
If you don't want to do that:
I would probably also rewrite your query to turn it around:
MATCH (n)
WHERE
(n:Type__Exercise OR n:Type__Fault) AND
(n.description =~ '(?i).*criteria.*' OR n.name =~ '(?i).*criteria.*' )
You can probably also benefit a bit more by having a secondary "search" field that is just the concatenation of your description and name fields. You probably also want to improve your regexp like adding a word boundary \b left and right.
Regarding your questions:
First of all, how do I tell if my query was cached?
Your query will be cached if you use parameters (for the regexps) there is a configurable query-caches size (defaulting to 100 queries)
Is there a query size where Neo4j doesn't bother caching the query?
Neo4j currently caches all queries that come in regardless of size
My queries are ready-only, do I still have to wrap my calls with a transaction?
Cypher will create its own transaction. In general read transactions are mandatory. For cypher you need outer transactions if you want multiple queries to participate in the same tx-scope.
is there an added cost to running multiple small queries rather than one large one? What if I kicked off 10 threads? Would there be enough of a performance increase to make this worth while?
It depends smaller queries are executed more quickly (if they touch less of the total dataset) but you have to combine their results in the client.
If they touch the same nodes you do double work.
For bigger queries you have to watch out when you span up cross products or exponential path explosions.
Regarding running smaller queries with many threads
Good question, it should be faster there are currently some bottlenecks that we're about to remove. Just try it out.
Is there a way to use parameters when using PROFILE in the Neo4j console?
You can use the shell variables for that, with export name=value and list them with env
e.g.
export name=Lisa
profile match (n:User {name:{name}}) return n;
How effective is the Neo4j browser for comparing times it takes to execute a query?
The browser measures the complete roundtrip with potentially more data loading, so it's timing is not very accurate.
Warmup
The exact queries would make sense
You don't have to return data, it is enough to return count(*) but you should access the properties you want to access to make sure they are loaded.

Resources