Efficiently count Documents with different values for a given field

Efficiently count Documents with different values for a given field - arangodb

I am trying to count the number of documents that are in each possible state in a particular Arango collection.
This should be possible in 1 pass over all of the documents using a bucket-sort like strategy where you iterate over all documents, if the value for the state hasn't been seen before, you add a counter with a value of 1 to a list. If you have seen that state before, you increment the counter. Once you've reached the end, you'll have a counter for each possible state in the DB that indicates how many documents are currently stored with that state.
I can't seem to figure out how to write this type of logic in AQL to submit as a query. Current strategy is like this:
Loop over all documents, filtering only docs of a particular state.
Loop over all documents, filtering only docs of a different particular state.
...
All states have been filtered.
Return size of each set
This works, but I'm sure it's much slower than it should be. This also means that if we add a new state, we have to update the query to loop over all docs an additional time, filtering based on the new state. A bucket-sort like query would be quick, and would need no updating as new states are created as well.
If these were the documents:
{A}
{B}
{B}
{C}
{A}
Then I'd like the result to be
{ A:2, B:2, C:1 }
Where A,B,&C are values for a particular field. Current strategy filters like so
LET docsA = (
FOR doc in collection
FILTER doc.state == A
RETURN doc
)
Then manually construct the return object calling LENGTH on each list of docs
Any help or additional info would be greatly appreciated

What about using a COLLECT function? (see docs here)
FOR doc IN collection
COLLECT s = doc.state WITH COUNT INTO c
RETURN { state: s, count: c }
This would return something like:
[
{ state: 'A', count: 23 },
{ state: 'B', count: 2 },
{ state: 'C', count: 45 }
]
Would that accomplish what you are after?

Related

Creating Test data for ArangoDB

Hi I would like to insert random test data into an edge collection called Transaction with the fields _id, Amount and TransferType with random data. I have written the following code below, but it is showing a syntax error.
FOR i IN 1..30000
INSERT {
_id: CONCAT('Transaction/', i),
Amount:RAND(),
Time:Rand(DATE_TIMESTAMP),
i > 1000 || u.Type_of_Transfer == "NEFT" ? u.Type_of_Transfer == "IMPS"
} INTO Transaction OPTIONS { ignoreErrors: true }

Your code has multiple issues:
When you are creating a new document you can either not specify the _key key and Arango will create one for you, or you specify one as a string to be used. _id as a key will be ignored.
RAND() produces a random number between 0 and 1, so it needs to be multiplied in order to make it into the range you want you might need to round it, if you need integer values.
DATE_TIMESTAMP is a function and you have given it as a parameter to the RAND() function which needs no parameter. But because it generates a numerical timestamp (milliseconds since 1970-01-01 00:00 UTC), actually it's not needed. The only thing you need is the random number generation shifted to a range that makes sense (ie: not in the 1970s)
The i > 1000 ... line is something I could only guess what it wanted to be. Here the key for the JSON object is missing. You are referencing a u variable that is not defined anywhere. I see the first two parts of a ternary operator expression (cond ? true_value : false_value) but the : is missing. My best guess is that you wanted to create a Type_of_transfer key with value of "NEFT" when i>1000 and "IMPS" when i<=1000
So, I rewrote your AQL and tested it
FOR i IN 1..30000
INSERT {
_key: TO_STRING(i),
Amount: RAND()*1000,
Time: ROUND(RAND()*100000000+1603031645000),
Type_of_Transfer: i > 1000 ? "NEFT" : "IMPS"
} INTO Transaction OPTIONS { ignoreErrors: true }

Timeseries differencing - ArangoDB (AQL or Python)

I have a collection which holds documents, with each document having a data observation and the time that the data was captured.
e.g.
{
_key:....,
"data":26,
"timecaptured":1643488638.946702
}
where timecaptured for now is a utc timestamp.
What I want to do is get the duration between consecutive observations, with SQL I could do this with LAG for example, but with ArangoDB and AQL I am struggling to see how to do this at the database. So effectively the difference in timestamps between two documents in time order. I have a lot of data and I don't really want to pull it all into pandas.
Any help really appreciated.

Although the solution provided by CodeManX works, I prefer a different one:
FOR d IN docs
SORT d.timecaptured
WINDOW { preceding: 1 } AGGREGATE s = SUM(d.timecaptured), cnt = COUNT(1)
LET timediff = cnt == 1 ? null : d.timecaptured - (s - d.timecaptured)
RETURN timediff
We simply calculate the sum of the previous and the current document, and by subtracting the current document's timecaptured we can therefore calculate the timecaptured of the previous document. So now we can easily calculate the requested difference.
I only use the COUNT to return null for the first document (which has no predecessor). If you are fine with having a difference of zero for the first document, you can simply remove it.
However, neither approach is very straight forward or obvious. I put on my TODO list to add an APPEND aggregate function that could be used in WINDOW and COLLECT operations.

The WINDOW function doesn't give you direct access to the data in the sliding window but here is a rather clever workaround:
FOR doc IN collection
SORT doc.timecaptured
WINDOW { preceding: 1 }
AGGREGATE d = UNIQUE(KEEP(doc, "_key", "timecaptured"))
LET timediff = doc.timecaptured - d[0].timecaptured
RETURN MERGE(doc, {timediff})
The UNIQUE() function is available for window aggregations and can be used to get at the desired data (previous document). Aggregating full documents might be inefficient, so a projection should do, but remember that UNIQUE() will remove duplicate values. A document _key is unique within a collection, so we can add it to the projection to make sure that UNIQUE() doesn't remove anything.
The time difference is calculated by subtracting the previous' documents timecaptured value from the current document's one. In the case of the first record, d[0] is actually equal to the current document and the difference ends up being 0, which I think is sensible. You could also write d[-1].timecaptured - d[0].timecaptured to achieve the same. d[1].timecaptured - d[0].timecaptured on the other hand will give you the inverted timestamp for the first record because d[1] is null (no previous document) and evaluates to 0.
There is one risk: UNIQUE() may alter the order of the documents. You could use a subquery to sort by timecaptured again:
LET timediff = doc.timecaptured - (
FOR dd IN d SORT dd.timecaptured LIMIT 1 RETURN dd.timecaptured
)[0]
But it's not great for performance to use a subquery. Instead, you can use the aggregation variable d to access both documents and calculate the absolute value of the subtraction so that the order doesn't matter:
LET timediff = ABS(d[-1].timecaptured - d[0].timecaptured)

Is there a way to add an incrementing id in one statement in MongoDB?

So I got a small database, It's not going to grow much more and I'm trying to get one document from the db in an API that I implemented in python so that with a given document Id I retrieve the document in the db. However, I find it a little hard to put the user to write a random number from the db. All I require is a function that modifies each document by setting an id field and to Auto-Increment. As I said, it's not going to grow that much and the performance isn't really an issue here.
So far what I've been able to do is this:
var i = 0
db.MyCollection.update({},
{$set : {"new_field":1}},
{upsert:false,
multi:true}
i ++;),
I achieved to set an id field but it sets the same number to each document (the count of every document) So let's say that if the db has 10 docs, it'll set the Id to 10.

Find-and-modify operation returns the document updated (before or after the update depending on returnDocument setting). You can use this with $inc to implement a counter. Ruby example where c is a collection:
irb(main):005:0> c['foo'].insert_one(counter:true,count:1)
=> #<Mongo::Operation::Insert::Result:0x8040 documents=[{"n"=>1, "opTime"=>{"ts"=>#<BSON::Timestamp:0x00005609f260b7e0 #seconds=1594961771, #increment=2>, "t"=>1}, "electionId"=>BSON::ObjectId('7fffffff0000000000000001'), "ok"=>1.0, "$clusterTime"=>{"clusterTime"=>#<BSON::Timestamp:0x00005609f260b538 #seconds=1594961771, #increment=2>, "signature"=>{"hash"=><BSON::Binary:0x8060 type=generic data=0x0000000000000000...>, "keyId"=>0}}, "operationTime"=>#<BSON::Timestamp:0x00005609f260b290 #seconds=1594961771, #increment=2>}]>
irb(main):011:0> c['foo'].find_one_and_update({counter:true},{'$inc':{count:1}})
=> {"_id"=>BSON::ObjectId('5f112f6b2c97a6281f63f575'), "counter"=>true, "count"=>1}
irb(main):012:0> c['foo'].find_one_and_update({counter:true},{'$inc':{count:1}})
=> {"_id"=>BSON::ObjectId('5f112f6b2c97a6281f63f575'), "counter"=>true, "count"=>2}
irb(main):013:0> c['foo'].find_one_and_update({counter:true},{'$inc':{count:1}})
=> {"_id"=>BSON::ObjectId('5f112f6b2c97a6281f63f575'), "counter"=>true, "count"=>3}
irb(main):014:0> c['foo'].find_one_and_update({counter:true},{'$inc':{count:1}})
=> {"_id"=>BSON::ObjectId('5f112f6b2c97a6281f63f575'), "counter"=>true, "count"=>4}

Why not just use this logic? Instead of updating all via one query, just launch multiple queries one by one? Mongo will do it pretty fast, even if you have >1M docs in database (according to your phrase: I got a small database) because pre-builded index on _id field.
this is a javasript code, but I guess, you'll understand the logic of it
let all_documents = db.MyCollection.find({});
for (let i = 0; i < all_documents.length; i++) {
db.MyCollection.update({_id: all_documents[i]._id }, {$set : {"new_field": i}}, {upsert:false})
}

MongoDB fetch rows before and after find result

I'm using MongoDB, (Mongoose on Node.js) I have a very large db of events, each event has a field seq (sequence), the order of the events.
I want to allow my users to find all the occurrences of a given event.
For example:
The user is searching for the event "ButtonClicked", I should return the all the locations that this event happened, in this example say [239, 1992, 5932]
This is easy, and I can just search for the requested event, and return the seq field.
Now I want to let the user view 20 events before, and 20 events after a specific seq.
It would have been great if I could do something like this:
db.events.find( { id:"ButtonClicked", seq: 1992 } ).before(20).after(20);
How can I do that?
Please note that the field seq might start with any number, and skip numbers, but it is incremental!
For example: [3,4,5,6,7,12,13,15,56,57...]
Also, note that the solution can ignore seq, I mentioned this field because I think that it can help the solution.
Thanks!

You could use comparison query operators, in particular $gte and $lte, using seq as a offset for the comparison.
Try:
var seqOffset = 1992;
db.events.find( { seq: { $gte: seqOffset - 20, $lte: seqOffset + 20 } } );
You could not get exactly 40 events, since as you mentioned seq might skip numbers.

Couchdb query for values calculated from key input

suppose i have the following data in my database:
[1,2],[2,1],[1,3],[3,1]...
were the numbers represent the a and b values of the formula a*x+b
what i now want is a query that returns the difference to a given point x,y.
for example: the point [2,6] is given. i want my query to return
[1,2] = -2 (1*2+2=4 4-6=-2)
[2,1] = -1 (2*2+1=5 5-6=-1)
[1,3] = -1 (1*2+3=5 4-6=-1)
[3,1] = 1 (3*2+1=7 7-6=-1)
I know how to do this in SQL but the data is already in a couchdb. I'm quite new to the NoSQL world and was wondering if something like this would be possible in couchdb.

what you can do is to use the standard MapReduce functionality of CouchDB.
Map is function you put in a view, which finds your data. You can have various criteria how to locate the docs you need. Next, if you specify so in the query with reduce=true, a reduce function is executed on each document that matched the map condition. You can use JavaScript to perform various operations on the document's values.
In your case, the map can look something like this:
function(doc) {
if(doc.a && doc.b) {
emit(doc._id,[doc.a, doc.b]);
}
}
then, the reduce gets called, like this:
function(keys, values, rereduce) {
var res;
//do something with values...
return res;
}
In your case keys will be list of document ID's and values will be the array of your a & b fields.
When you call the MapReduce (depending what method you use to access the DB), you should specify reduce=true.
Good resources on MapReduce (and on Views, Sorting and List funtions) are:
http://guide.couchdb.org/draft/views.html
http://www.slideshare.net/okurow/couchdb-mapreduce-13321353
Another way to go is to use a list function on the Map result, if you want to output the result in HTML. A good reason to use List function is that you can pass arguments to it with querystring, in your case it may be the point for which you want to calculate distances.
For detailed description on List functions, have a look here:
http://guide.couchdb.org/draft/transforming.html
Hope this helps.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Efficiently count Documents with different values for a given field - arangodb

Related

Creating Test data for ArangoDB

Timeseries differencing - ArangoDB (AQL or Python)

Is there a way to add an incrementing id in one statement in MongoDB?

MongoDB fetch rows before and after find result

Couchdb query for values calculated from key input

Categories

Resources