Suggestion on MongoDB indexing - node.js

I have the following MongoDB schema
Item
id
title : string
description : string
num_views : number
num_likes : number
num_unlikes : number
val_trending : number
val_rating : number
timestamp : number
val_trending is calculated based on the value of num_views, num_likes, num_dislikes, timestamp
val_rating is calculated based on the value of num_likes, num_dislikes, timestamp
I have to query Items based on num_views, val_trending or val_rating
i.e.
Item.find ().sort ({ val_trending : -1 }).limit (20)
Item.find ().sort ({ num_views : -1 }).limit (20)
num_views, num_likes and num_unlikes are updated when a user view, like or unlike an Item accordingly
val_trending is updated when a user view, like or unlike an Item
val_rating is updated when a user like or unlike an Item
That’s a lot of updating and that's what got me worried. I thought of indexing num_views, val_trending and val_rating for faster results. But this will also show down updates.
I will be querying Items several times every second and updates will occur more frequently as user might also like or unlike an Item after viewing an Item.
So, my question is: what sort of implementation (indexing) should I do in order to get the best performance?
Note: The DB will hold approximately 10K Items at start, and it will increase daily.

Related

Timeseries differencing - ArangoDB (AQL or Python)

I have a collection which holds documents, with each document having a data observation and the time that the data was captured.
e.g.
{
_key:....,
"data":26,
"timecaptured":1643488638.946702
}
where timecaptured for now is a utc timestamp.
What I want to do is get the duration between consecutive observations, with SQL I could do this with LAG for example, but with ArangoDB and AQL I am struggling to see how to do this at the database. So effectively the difference in timestamps between two documents in time order. I have a lot of data and I don't really want to pull it all into pandas.
Any help really appreciated.
Although the solution provided by CodeManX works, I prefer a different one:
FOR d IN docs
SORT d.timecaptured
WINDOW { preceding: 1 } AGGREGATE s = SUM(d.timecaptured), cnt = COUNT(1)
LET timediff = cnt == 1 ? null : d.timecaptured - (s - d.timecaptured)
RETURN timediff
We simply calculate the sum of the previous and the current document, and by subtracting the current document's timecaptured we can therefore calculate the timecaptured of the previous document. So now we can easily calculate the requested difference.
I only use the COUNT to return null for the first document (which has no predecessor). If you are fine with having a difference of zero for the first document, you can simply remove it.
However, neither approach is very straight forward or obvious. I put on my TODO list to add an APPEND aggregate function that could be used in WINDOW and COLLECT operations.
The WINDOW function doesn't give you direct access to the data in the sliding window but here is a rather clever workaround:
FOR doc IN collection
SORT doc.timecaptured
WINDOW { preceding: 1 }
AGGREGATE d = UNIQUE(KEEP(doc, "_key", "timecaptured"))
LET timediff = doc.timecaptured - d[0].timecaptured
RETURN MERGE(doc, {timediff})
The UNIQUE() function is available for window aggregations and can be used to get at the desired data (previous document). Aggregating full documents might be inefficient, so a projection should do, but remember that UNIQUE() will remove duplicate values. A document _key is unique within a collection, so we can add it to the projection to make sure that UNIQUE() doesn't remove anything.
The time difference is calculated by subtracting the previous' documents timecaptured value from the current document's one. In the case of the first record, d[0] is actually equal to the current document and the difference ends up being 0, which I think is sensible. You could also write d[-1].timecaptured - d[0].timecaptured to achieve the same. d[1].timecaptured - d[0].timecaptured on the other hand will give you the inverted timestamp for the first record because d[1] is null (no previous document) and evaluates to 0.
There is one risk: UNIQUE() may alter the order of the documents. You could use a subquery to sort by timecaptured again:
LET timediff = doc.timecaptured - (
FOR dd IN d SORT dd.timecaptured LIMIT 1 RETURN dd.timecaptured
)[0]
But it's not great for performance to use a subquery. Instead, you can use the aggregation variable d to access both documents and calculate the absolute value of the subtraction so that the order doesn't matter:
LET timediff = ABS(d[-1].timecaptured - d[0].timecaptured)

rethinkdb: How to orderby two attributes and use between on one of those

we have a rethinkdb with tickets in it. They have a createdAt with a timestamp in milliseconds and a priority attribute.
e.g.
{
createdAt: 12345,
priority: 4,
owner: "Bob",
description: "test",
status: "new"
}
rethinkdb.db('dev').table(tableId)
.orderBy({index: 'createdAt'})
.between(timeFrom,timeTo)
.filter(filter)
.skip(paginator).limit(20).run(this.connection);
We now have the following problem. We want a query that does two orderBy ... the first would be orderBy "priority" and also by "createdAt". So given the filter and the timespan it should return the tickets with the highest priority and inside the priority the oldest should be on top.
We tried to build a compound index with priority and createdAt. That did work, but the .between didn't work as intended on this index.
rethinkdb.db('dev').table('tickets').indexCreate('prioAndCreatedAt' [rethinkdb.row('priority'), rethinkdb.row('createdAt')]).run(this.connection)
with the query:
rethinkdb.db('dev').table(tableId)
.orderBy({index: 'prioAndCreatedAt'})
.between([rethinkdb.minval, timeFrom],[rethinkdb.maxval , timeTo])
.filter(filter)
.skip(paginator).limit(20).run(this.connection);
In our minds that should order by priority first and then by createdAt and with the .between we would ignore the priority (because of the .minval and .maxval) and the just get all the tickets between timeFrom and timeTo.
Buuuut also tickets where createdAt was smaller than timeFrom were returned. So this doesn't work like we planned.
Its like this "problem": RethinkDB Compound Index Weirdness Using Between
But we cant figure out another way for this.
Since
it should return the tickets with the highest priority and inside the priority the oldest should be on top
Is there a reason not to simply use 2 orderBy?
r.db('dev').table('tickets')
.between(timeFrom, timeTo, {index: 'createdAt'})
.orderBy('createdAt')
.orderBy(r.desc('priority'))
Then you can pipe your filter/paginator on this selection. It will provide tickets within the correct range, ordered by descending priority then by ascending creation date (the way SQL considers with ORDER BY priority, createdAt). And it avoids the (documented) behavior of between with compound indexes.
I think your query only supposed to work when the createdAt is also the primary key. Is it? Otherwise you can create an additional index on the createdAt field and use it in your between statement:
r.db('dev').table('tickets').indexCreate('createdAt', r.row('createdAt'))
r.db...
.between([rethinkdb.minval, timeFrom],[rethinkdb.maxval , timeTo], {index:"createdAt"})
you can also use multiple orderby as described by #Stock Overflaw, but it only works correctly if you put both conditions into one orderBy statement:
r.db('dev').table('tickets')
.between(timeFrom, timeTo, {index: 'createdAt'})
.orderBy(r.asc('createdAt'), r.asc('priority'))
keep in mind that this is less performant, because it doesn't use the indexes.

Cloudant 1 to many function

I’ve just started to use Cloudant and I just can’t get my head around the map functions. I’ve been fiddling with the data below but it isn’t working out as I expected.
The relationship is, a user can have many vehicles. A vehicle belongs to 1 user. The vehicle ‘userId’ is the key of the user. There is a bit of redundancy as in user the _id and userId is the same, guess later is not required.
Anyhow, how can I find for a/every user, the vehicles which belong to it? The closest I’ve come through trial and error is a result which displays the owner of every vehicle, but I would like it the other way round, the user and the vehicles belonging to it. All the examples I’ve found use another document which ‘joins’ two or more documents, but I don’t need to do that?
Any point in the right direction appreciated - I really have no idea.
function (doc) {
if (doc.$doctype == "vehicle")
{
emit(doc.userId, {_id: doc.userId});
}
}
EDIT: Getting closer. I'm not sure exactly what I was expecting, but the result seems a bit 'messy'. Row[0] is the user document, row[n > 0] are the vehicle documents. I guess it's fine when a startkey/endkey is used, but without the results are a bit jumbled up.
function (doc) {
if (doc.$doctype == 'user') {
emit([doc._id, 0], doc);
} else if (doc.$doctype == 'vehicle') {
emit([doc.userId, 1, doc._id], doc);
}
}
A user is described as,
{
"_id": "user:10",
"firstname": “firstnamehere",
"secondname": “secondnamehere",
"userId": "user:10",
"$doctype": "user"
}
a vehicle is described as,
{
"_id": "vehicle:4002”,
“name”: “avehicle”,
"userId": "user:10",
"$doctype": "vehicle",
}
You're getting in the right direction! You already got that right with the global IDs. Having the type of the document as part of the ID in some form is a very good idea, so that you don't get confused later (all documents are in the same "pot").
Here are some minor problems with your current solution (before getting to your actual question):
Don't emit the doc as value in emit(key, value). You can always ask for the document that belongs to a view row by querying with include_docs=true. Having the doc as view value increases the view indexes a lot. When you don't need a specific value, use emit(key, null).
You also don't need the ID in the emit value. You'll get the ID of the document that belongs to a view row as part of the row anyway.
View Collation
Now to your problem of aggregating the vehicles with their user. You got the basic pattern right. This pattern is called view collation, you can read more about it in the CouchDB docs (ignore that it is in the "Couchapp" section).
The trick with view collation is that you return two or more types of documents, but make sure that they are sorted in a way that allows for direct grouping. Thus it is important to understand how CouchDB sorts the view result. See the collation specification for more information on that one. An important key to understanding view collation is that rows with array keys are sorted by key elements. So when two rows have the same key[0], they sort by key[1]. If that's equal as well, key[2] is considered, and so on.
Your map function frist groups users and vehicles by user ID (key[0]). Your map function then uses the fact that 0 sorts before 1 in the second element of the key, so your view will contain the following:
user 1
vehicle of user 1
vehicle of user 1
vehicle of user 1
user 2
user 3
vehicle of user 3
user 4
etc.
As you can see, the vehicles of a user immediately follow their user. Thus you can group this result into aggregates without performing expensive sort or lookup operations.
Note that users are sorted according to their ID, and vehicles within users also according to their ID. This is because you use the IDs in the key array.
Creating Queries
Now that view isn't worth much if you can't query according to your needs. A view as you have it supports the following queries:
Get all users with their vehicles
Get a range of users with their vehicles
Get a single user with its vehicles
Get a single user without vehicles (you could also use the _all_docs view for that though)
Example query for "all users between user 1 and user 3 (inclusive) with their vehicles"
We want to query for a range, so we use startkey and endkey in the query:
startkey=["user:1", 0]
endkey=["user:3", 1, {}]
Note the use of {} as sentinel value, which is required so that the end key is larger than any row that has a key of ["user:3", 1, (anyConceivableVehicleId)]

mongodb: another "how to add a random record" thread

I've come across many of this same question here on StackOverflow. None providing a valid solid solution, so here we go:
I need to pick a random document from around 5 million documents in my MongoDB database in an efficient way.
I've tried getting the .count and using the .skip to get the random document, but it takes almost three seconds and very, very inefficient.
I can't make changes to the documents (like adding a "random") entry to each document or changing their _id's.
I've tried the solution of adding documents with an incremental _id (to pick a random _id to bypass using .skip) but this brought more headache than what it did when I try to add many documents in a short amount of time.
Adding data in an incremental way, or picking a random document, should not be this hard. I'm either missing some common knowledge, or doing something wrong, or this is what it really is..
Wanted to bring up the topic and get your responses.
Here is a way using the default ObjectId values for _id and a little math and logic.
// Get the "min" and "max" timestamp values from the _id in the collection and the
// diff between.
// 4-bytes from a hex string is 8 characters
var min = parseInt(db.collection.find()
.sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
max = parseInt(db.collection.find()
.sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
diff = max - min;
// Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;
// work out a "random" _id value in the range:
var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")
// Then query for the single document:
var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
.sort({ "_id": 1 }).limit(1).toArray()[0];
That's the general logic in shell representation and easily adaptable.
So in points:
Find the min and max primary key values in the collection
Generate a random number that falls between the timestamps of those documents.
Add the random number to the minimum value and find the first document that is greater than or equal to that value.
This uses "padding" from the timestamp value in "hex" to form a valid ObjectId value since that is what we are looking for. Using integers as the _id value is essentially simplier but the same basic idea in the points.

cassandra data model for web logging

Been playing around with Cassandra and I am trying to evaluate what would be the best data model for storing things like views or hits for unique page id's? Would it best to have a single column family per pageid, or 1 Super-column (logs) with columns pageid? Each page has a unique id, then would like to store date and some other metrics on the view.
I am just not sure which solution handles better scalability, lots of column family OR 1 giant super-column?
page-92838 { date:sept 2, browser:IE }
page-22939 { date:sept 2, browser:IE5 }
OR
logs {
page-92838 {
date:sept 2,
browser:IE
}
page-22939 {
date:sept 2,
browser:IE5
}
}
And secondly, how to handle lots of different date: entries for page-92838?
You don't need a column-family per pageid.
One solution is to have a row for each page, keyed on the pageid.
You could then have a column for each page-view or hit, keyed and sorted on time-UUID (assuming having the views in time-sorted order would be useful) or other unique, always-increasing counter. Note that all Cassandra columns are time-stamped anyway, so you would have a precise timestamp 'for free' regardless of what other time- or date- stamps you use. Using a precise time-UUID as the key also solves the problem of storing many hits on the same date.
The value of each column could then be a textual value or JSON document containing any other metadata you want to store (such as browser).
page-12345 -> {timeuuid1:metadata1}{timeuuid2:metadata2}{timeuuid3:metadata3}...
page-12346 -> ...
With cassandra, it is best to start with what queries you need to do, and model your schema to support those queries.
Assuming you want to query hits on a page, and hits by browser, you can have a counter column for each page like,
stats { #cf
page-id { #key
hits : # counter column for hits
browser-ie : #counts of views with ie
browser-firefox : ....
}
}
If you need to do time based queries, look at how twitters rainbird denormalizes as it writes to cassandra.

Resources