ArangoDB Slow Query - arangodb

I'm new to ArangoDB, having trouble optimizing my queries and am hoping for some help.
The query I've provided below is a real example that I'm struggling with, 758.078 ms on my dev database but on staging, with a much larger dataset, it takes 531.511 s.
I'll also provide the size of each of the edge tables I'm traversing in dev and staging. Any help is really appreciated.
for doc in document
filter repo._key == "my-key"
for v, e, p in 3 any doc edge1, edge2, edge3
options {uniqueVertices: 'global', bfs: true}
filter DATE_ISO8601(p.vertices[2].date) > DATE_ISO8601("2017-09-04T00:00:01Z")
and DATE_ISO8601(p.vertices[2].date) < DATE_ISO8601("2017-09-15T23:59:59Z")
limit 1
return {
commit: p.vertices[2].hash,
date: p.vertices[2].date,
message: p.vertices[2].message,
author: p.vertices[1].email,
loc: p.vertices[3].stats.additions
}
DEV
edge1: 2,638
edge2: 2,560
edge3: 386
STAGING
edge1: 5,438,811
edge2: 5,544,028
edge3: 423,545

The query is potentially slow because the filter condition
filter
DATE_ISO8601(p.vertices[2].date) > DATE_ISO8601("2017-09-04T00:00:01Z"
and
DATE_ISO8601(p.vertices[2].date) < DATE_ISO8601("2017-09-15T23:59:59Z")
is not applied during the traversal but only afterwards.
Potentially this is due to the function calls (to DATE_ISO8601) in the filter conditions. If your date values are stored as numbers, can you check whether the following filter condition speeds up the query:
filter
p.vertices[2].date > DATE_TIMESTAMP("2017-09-04T00:00:01Z"
and
p.vertices[2].date < DATE_TIMESTAMP("2017-09-15T23:59:59Z")
That modified filter condition should allow pulling the filter condition inside the traversal, so it is executed earlier.
You can verify the query execution plans using db._explain(<query string goes here>); in the ArangoShell or from the web interface's AQL editor.

Probably a bit late but it will help someone. Using the DATE function will make the query much slower, so if possible remove the DATE function. For example
enter image description here
You can see that the query filter command uses a Date function that is ~ 7s. If you do not use the date function, it will run faster than ~ 0.5s. These two lines will query the data of 2018-09-29.

Related

[Shopware6]: How can I add SQL Filter to Criteria?

So, the criteria are already quite powerful. Yet I came across a case I seem to not be able to replicate on the criteria object.
I needed to filter out all entries that were not timely relevant.
In a world, where you'd be able to mix SQL with the field definition, it would look like this:
...->addFilter(
new RangeFilter('DATEDIFF(NOW(), INTERVAL createdAt DAY)', [RangeFilter::LTE => 1])
)
Unfortunately that doesn't work in our world.
When i pass the criteria to a searchfunction, i only get:
"DATEDIFF(NOW(), INTERVAL createdAt DAY)" is not a field on xyz
I tried to do it with ->addExtensions and several other experiments, but i couldn't get it to work. I resorted to using the queryBuilder from Doctrine, using queryParts, but the data i'm getting is not very clean and not assigned to an ORMEntity like it should be.
Is it possible to write a criteria that incooperates native SQL filtering?
The DAL is designed in a way that should explicitly not accept SQL statements as it is a core concept of the abstraction. As the DAL offers extendibility for third party extensions it should be preferred to raw SQL in most cases. I would suggest writing a lightweight query that only fetches the IDs using your SQL query and then use these pre-filtered IDs to fetch complete data sets using the DAL.
$ids = (new QueryBuilder($connection))
->select(['LOWER(HEX(id))'])
->from('product')
->where('...')
->execute()
->fetchFirstColumn();
$criteria = new Criteria($ids);
This should offer the best of both worlds, the freedom of using raw SQL and the extendibility features of the DAL.
In your specific case you could also just take the current day, remove the amount of days that should have passed and use this threshold date to compare it to the creation date:
$now = new \DateTimeImmutable();
$dateInterval = new \DateInterval('P1D');
$thresholdDate = $now->sub($dateInterval);
// filter to get all with a creation date greater than now -1 day
$filter = new RangeFilter(
'createdAt',
[RangeFilter::GTE => $thresholdDate->format(Defaults::STORAGE_DATE_TIME_FORMAT)]
);

Optimise Pymongo query time

I am running a query on mongo db and looking for a solution(s) to optimise the time taken.
my query is like collection.find({'nameId':989080880,'Date':{'$gte':startDate}})
what I did is as below
pd.DataFrame(collection.find({'nameId':989080880,'Date':{'$gte':startDate}}))
this query took: x ms
then i tried
document=[]
for doc in collection.find({'nameId':989080880,'Date':{'$gte':startDate}}):
document.append(doc)
but it gave only a 15% improvement over x ms
Can not index as 'nameId' is a longinteger and indexing will require much more RAM etc.
looking forward to some suggestions
First is to optimize list using list comprehension, sth like:
document=[doc for doc in collection.find({'nameId':989080880,'Date':{'$gte':startDate}})]
Second, If you don't need all the field in your doc, just use projection, it an object containing all the fields u need to return.
with projection = { name: 1, address: 1} in your find query, all your documents only return name, address field, and also _id.
For more information, you can read here

Referencing external doc in CouchDB view

I am scraping an 90K record database using JSON-RPC and I am trying to put in some basic error checking. I want to start by scraping the database twice using two different settings and adding a prefix to the second scrape. This way I can check to ensure that the two settings are not producing different records (due to dropped updates, etc). I wanted to implement the comparison using a view which compares each document from the first scrape with it's twin produced by the second scrape and then emit the names of records with a difference between them.
However, I cannot quite figure out how to pull in another doc in the view, everything I have read only discusses external docs using the emit() function, which is too late to permit me to compare it. In the example below, the lookup() function would grab the referenced document.
Is this just not possible?
function(doc) {
if(doc._id.slice(0,1)!=='$' && doc._id.slice(0,1)!== "_"){
var otherDoc = lookup('$test" + doc._id);
if(otherDoc){
var keys = doc.value.keys();
var same = true;
keys.forEach(function(key) {
if ((key.slice(0,1) !== '_') && (key.slice(0,1) !=='$') && (key!=='expires')) {
if (!Object.equal(otherDoc[key], doc[key])) {
same = false;
}
}
});
if(!same){
emit(doc._id, 1);
}
}
}
}
Context
You are correct that this is not possible in CouchDB. The whole point of the map function is that it must be idempotent, otherwise you lose all the other nice benefits of a pre-calculated index.
This is why you cannot access external resources in the map function, whether they be other records or the clock. Any time you run a map you must always get the same result if you put the same record into it. Since there are no relationships between records in CouchDB, you cannot promise that this is possible.
Solution
However, you can still achieve your end goal, just be different means. Some possibilities...
Assuming there is some meaningful numeric value in each doc, you could use a view to take the sum of all those values and group them by which import you did ({key: <batch id>, value: <meaningful number>}). Then compare the two numbers in your client or the browser to see if they match.
A brute force approach would be to use a view to pair the docs that should match. Each doc is on a different row, but they're grouped by a common field. Then iterate through the entire index comparing the pairs. This would certainly be the quickest to code and doesn't depend on your application or data.
Implement a validation function to enforce a schema on your data. Just be warned that this will reduce your write throughput since each written record will be piped out of Erlang and into the JS engine. Also, this is only applicable if you're worried about properly formed records instead of their precise content, which might not be the case.
Instead of your different batch jobs creating different docs, have them place them into the same doc. The structure might look like this: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } } Then your validation function could compare them or you could create a view that indexes all the docs that don't match. All depends on where in your pipeline you want to do the error checking and correction.
Personally I like the last option better, but only if you don't plan to use the database as is in production. Ie., you wouldn't want to carry around all that extra data in each record.
Hope that helps.
Cheers.

Querying azure table storage for null values

Does anyone know the proper way to query azure table storage for a null value. From what I've read, it's possible (although there is a bug which prevents it on development storage). However, I keep getting the following error when I do so on the live cloud storage:
One of the request inputs is not valid.
This is a dumbed down version of the LINQ query that I've put together.
var query = from fooBar in fooBarSVC.CreateQuery<FooBar>("FooBars")
where fooBar.PartitionKey == kPartitionID
&& fooBar.Code == kfooBarCode
&& fooBar.Effective_Date <= kFooBarDate.ToUniversalTime()
&& (fooBar.Termination_Date > kFooBarDate.ToUniversalTime() || fooBar.Termination_Date == null)
select fooBar;
If I run the query without checking for null, it works fine. I know a possible solution would be to run a second query on the collection that this query brings back. I don't mind doing that if I need to, but would like to know if I can get this approach to work first.
Anyone see anything obvious I'm doing wrong?
The problem is that because azure table storage does not have a schema, the null column actually doesn't exist. This is why your query is not valid. there is no such thing as a null column in table storage. You could do something like store an empty string if you really have to. Really though the fundamental issue here is that Azure table storage really is not built to be queried by any columns other than partition key and row key. Every time you make a query on one of these non-standard columns you are doing a table scan. If you start to get lots of data you are going to have a very high rate of query time outs. I would suggest setting up a manual index for these types of queries. For example, you could store the same data in the same table but with different values for the Row key. Ultimately, if your are app is not getting crazy high usage I would just use SQL Azure as it will be much more flexible for the types of queries you are doing.
Update: Azure has a great guide on table storage design that I would recommend reading. http://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/
I just had this problem and found a nice little ninja-trick to actually test for nulls. Although I'm using the Azure Storage interface directly, I'm 90% sure it will work for LINQ too if you do the same.
Here's what I did to check if Price (Int32?) is null:
not (Price lt 0 or Price gt 0)
I'm guessing in your case you can do the same in LINQ by testing if fooBar.Termination_Date is less or greater than DateTime.UtcNow for example. Something like this:
where fooBar.PartitionKey == kPartitionID
&& fooBar.Code == kfooBarCode
&& fooBar.Effective_Date <= kFooBarDate.ToUniversalTime()
&& (fooBar.Termination_Date > kFooBarDate.ToUniversalTime()
|| (not (fooBar.Termination_Date < DateTime.UtcNow
or fooBar.Termination_Date > DateTime.UtcNow))
select fooBar;
For a string column called MyColumn I was able to type: not(MyColumn gt '')
Mike S answer above put me on the right path.
For strings, we can compare to empty string.
IsNotBlank(value)
Can be:
(Value gt '')
Using the Azure Tables client library for .NET. to query for null Guid values.
In the sample code, the property's name is MyColumn.
var filter = Azure.Data.Tables.TableClient
.CreateQueryFilter($"not(MyColumn gt {Guid.Empty})");
The TableClient.CreateQueryFilter method will create the filter:
not(MyColumn gt guid'00000000-0000-0000-0000-000000000000')

Why is GetPaged() Executing two database calls?

I'm a bit new to subsonic (i.e. evaluating 3.0.0.3) and have come across a strange behavior in GetPaged(int pageIndex, int pageSize). When I execute the method it does two SQL calls. Any ideas why ?
Details
Lets say I have a "Cultures" table with 200 rows. In my code I do something like ...
var sonicCollection = from c in RTE.Culture.GetPaged(1, 25)
select c;
Now, I would expect this executes a single query returning the first 25 entries in my cultures table. When I watch SQL profiler I see two queries run by.
First this--
SELECT [dbo].[Cultures].[cultureCode], [dbo].[Cultures].[cultureName]
FROM [dbo].[Cultures]
Then This--
SELECT *
FROM (SELECT ROW_NUMBER() OVER (
ORDER BY cultureID ASC) AS Row,
[dbo].[Cultures].[cultureCode], [dbo].[Cultures].[cultureName]
FROM [dbo].[Cultures]
)
AS PagedResults
WHERE Row >= 1 AND Row <= 25
I expect the 2nd query to roll by, as it is the one returning the 25 rows I politely requested of subsonic. The first query, however, appears to return 200 rows (at least according to SQL profiler).
Any ideas what's going on?
It's a bug in the code. The code actually queries every record and then iterates over each one for the count. I've created an issue in the github repo here:
https://github.com/subsonic/SubSonic-3.0/issues/259
You can download the source, fix the issue and recompile pretty easily. I've done this and its fixed my issue.
You just want to use RTE.Culture.GetPaged() - it runs the paged query for you.

Resources