How to create graphs or historigrams with mongoDB - node.js

I need to make a graph where to show how many total users we have registered in a time interval. (The language is TypeScript but I can get an answer in another language or using the aggregate of mongoDB)
Example:
Day 1: 10 total users registered
Day 2: 139 ...
Day 3: 1230 ...
Day 4: 2838 ...
...
...
Current day: X number of users ... and so it would end.
It should be noted that all users have a field called createAt, which is of type date.
I tried to obtain the users by means of cubes but it is not an optimal solution.
const response = await this.userModel.aggregate([
{
$bucketAuto: {
groupBy: '$createdAt',
buckets: 4,
},
},
]);
console.log(response);
I have also thought about using mapReduce from mongoDB and pass the specified function to it. But in terms of performance, I would like to know if that could create the pipelines simply with aggregate. mapReduce would be a second option (slightly slower) and as a last option to get all the users (only with the CreateAt field) and process them in my backend.
Thank you in advance for your answers
Update
I also mean that with autoBucket it orders them by non-specific time intervals, it basically orders them by the number of users and groups them by the creation dates, also when i passed the dates to mongodb, with $bucket the result is not as expected
$bucketAuto Option
$bucket option
input Example:
const list = [
{
"createdAt": "2021-08-30T23:47:16.663Z",
"_id": "612d6e044007a95446848cef"
},
{
"createdAt": "2021-08-31T04:18:11.820Z",
"_id": "612dad830541fa001bb63671"
},
{
"createdAt": "2021-08-31T04:18:47.794Z",
"_id": "612dada70541fa001bb63674"
},
{
"createdAt": "2021-08-31T04:20:14.415Z",
"_id": "612dadfe0541fa001bb63678"
},
{
"createdAt": "2021-08-31T04:22:45.580Z",
"_id": "612dae950541fa001bb63682"
},
{
"createdAt": "2021-08-31T11:24:28.471Z",
"_id": "612e116c0541fa001bb63688"
},
{
"createdAt": "2021-08-31T18:47:09.452Z",
"_id": "612e792dba2a3e1d081c9f3d"
}
];

Related

Searching for sub-objects with a date range containing the queried date value

Let's say we're handling the advertising of various job openings across several channels (newspapers, job boards, etc.). For each channel, we can buy a "publication period" which will mean the channel will advertise our job openings during that period. How can we find the jobs for a given channel that have a publication period valid for today (i.e. starting on or before today, and ending on or after today)? The intent is to be able to generate a feed of "active" job openings that (e.g.) a job board can consume periodically to determine which jobs should be displayed to its users.
Another wrinkle is that each job opening is associated with a given tenant id: the feeds will have to be generated scoped to tenant and channel.
Let's say we have the following simplified documents (if you think the data should be modeled differently, please let me know also):
{
"_id": "A",
"tenant_id": "foo",
"name": "Job A",
"publication_periods": [
{
"channel": "linkedin",
"start": "2021-03-10T00:00:0.0Z",
"end": "2021-03-17T00:00:0.0Z"
},
{
"channel": "linkedin",
"start": "2021-04-10T00:00:0.0Z",
"end": "2021-04-17T00:00:0.0Z"
},
{
"channel": "monster.com",
"start": "2021-03-10T00:00:0.0Z",
"end": "2021-03-17T00:00:0.0Z"
}
]
}
{
"_id": "B",
"tenant_id": "foo",
"name": "Job B",
"publication_periods": [
{
"channel": "linkedin",
"start": "2021-04-10T00:00:0.0Z",
"end": "2021-04-17T00:00:0.0Z"
},
{
"channel": "monster.com",
"start": "2021-03-15T00:00:0.0Z",
"end": "2021-03-20T00:00:0.0Z"
}
]
}
{
"_id": "C",
"tenant_id": "foo",
"name": "Job C",
"publication_periods": [
{
"channel": "monster.com",
"start": "2021-05-15T00:00:0.0Z",
"end": "2021-05-20T00:00:0.0Z"
}
]
}
{
"_id": "D",
"tenant_id": "bar",
"name": "Job D",
"publication_periods": [
...
]
}
How can I query the jobs linked to tenant "foo" that have an active publication period for "monster.com" on for the date of 17.03.2021? (I.e. this query should return both jobs A and B.)
Note that the DB will contain documents of other (irrelevant) types.
Since I essentially need to "find all job openings containing an object in the publication_periods array having: CHAN as the channel value, "start" <= DATE, "end" >= DATE" it appears I'd require a Mango query to achieve this, as standard view queries don't provide comparison operators (if this is mistaken, please correct me).
Naturally, I want the Mango query to be executed only on relevant data (i.e. exclude documents that aren't job openings), but I can find references on how to do this (whether in the docs or elsewhere): all resources I found simply seem to define the Mango index on the entire set of documents, relying on the fact that documents where the indexed field is absent won't be indexed.
How can I achieve what I'm after?
Initially, I was thinking of creating a view that would emit the publication period information along with a {'_id': id} object in order to "JOIN" the job opening document to the matching periods at query time (per Best way to do one-to-many "JOIN" in CouchDB). However, I realized that I wouldn't be able to query this view as needed (i.e. "start" value before today, "end" value after today) since I wouldn't have a definite start/end key to use... And I have no idea how to properly leverage a Mango index/query for this. Presumably I'd have to create a partial index based on document type and the presence of publication periods, but how can I even index the multiple publication periods that can be located within a single document? Can a Mango index be defined against a specific view as opposed to all documents in the DB?
I stumbled upon this answer Mango search in Arrays indicating that I should be able to index the data with
{
"index": {
"fields": [
"tenant_id",
"publication_periods.[].channel",
"publication_periods.[].start",
"publication_periods.[].end"
]
},
"ddoc": "job-openings-periods-index",
"type": "json"
}
And then query them with
{
"selector": {
"tenant_id": "foo",
"publication_periods": {
"$elemMatch": {
"$and": [
{
"channel": "monster.com"
},
{
"start": {
"$lte": "2021-03-17T00:00:0.0Z"
}
},
{
"end": {
"$gte": "2021-03-17T00:00:0.0Z"
}
}
]
}
}
},
"use_index": "job-openings-periods-index"
"execution_stats": true
}
Sadly, I'm informed that the index "was not used because it does not contain a valid index for this query" and terrible performance, which I will leave for another question.

Cannot query on a date range, get back no results each time

I'm having a hard time understanding why I keep getting 0 results back from a query I am trying to perform. Basically I am trying to return only results within a date range. On a given table I have a createdAt which is a DateTime scalar. This basically gets automatically filled in from prisma (or graphql, not sure which ones sets this). So on any table I have the createdAt which is a DateTime string representing the DateTime when it was created.
Here is my schema for this given table:
type Audit {
id: ID! #unique
user: User!
code: AuditCode!
createdAt: DateTime!
updatedAt: DateTime!
message: String
}
I queried this table and got back some results, I'll share them here:
"getAuditLogsForUser": [
{
"id": "cjrgleyvtorqi0b67jnhod8ee",
"code": {
"action": "login"
},
"createdAt": "2019-01-28T17:14:30.047Z"
},
{
"id": "cjrgn99m9osjz0b67568u9415",
"code": {
"action": "adminLogin"
},
"createdAt": "2019-01-28T18:06:03.254Z"
},
{
"id": "cjrgnhoddosnv0b67kqefm0sb",
"code": {
"action": "adminLogin"
},
"createdAt": "2019-01-28T18:12:35.631Z"
},
{
"id": "cjrgnn6ufosqo0b67r2tlo1e2",
"code": {
"action": "login"
},
"createdAt": "2019-01-28T18:16:52.850Z"
},
{
"id": "cjrgq8wwdotwy0b67ydi6bg01",
"code": {
"action": "adminLogin"
},
"createdAt": "2019-01-28T19:29:45.616Z"
},
{
"id": "cjrgqaoreoty50b67ksd04s2h",
"code": {
"action": "adminLogin"
},
"createdAt": "2019-01-28T19:31:08.382Z"
}]
Here is my getAuditLogsForUser schema definition
getAuditLogsForUser(userId: String!, before: DateTime, after: DateTime): [Audit!]!
So to test I would want to get all the results in between the last and first.
2019-01-28T19:31:08.382Z is last
2019-01-28T17:14:30.047Z is first.
Here is my code that would inject into the query statement:
if (args.after && args.before) {
where['createdAt_lte'] = args.after;
where['createdAt_gte'] = args.before;
}
console.log(where)
return await context.db.query.audits({ where }, info);
In playground I execute this statement
getAuditLogsForUser(before: "2019-01-28T19:31:08.382Z" after: "2019-01-28T17:14:30.047Z") { id code { action } createdAt }
So I want anything that createdAt_lte (less than or equal) set to 2019-01-28T17:14:30.047Z and that createdAt_gte (greater than or equal) set to 2019-01-28T19:31:08.382Z
However I get literally no results back even though we KNOW there is results.
I tried to look up some documentation on DateTime scalar in the graphql website. I literally couldn't find anything on it, but I see it in my generated prisma schema. It's just defined as Scalar. With nothing else special about it. I don't think I'm defining it elsewhere either. I am using Graphql-yoga if that makes any difference.
(generated prisma file)
scalar DateTime
I'm wondering if it's truly even handling this as a true datetime? It must be though because it gets generated as a DateTime ISO string in UTC.
Just having a hard time grasping what my issue could possibly be at this moment, maybe I need to define it in some other way? Any help is appreciated
Sorry I misread your example in my first reply. This is what you tried in the playground correct?
getAuditLogsForUser(
before: "2019-01-28T19:31:08.382Z",
after: "2019-01-28T17:14:30.047Z"
){
id
code { action }
createdAt
}
This will not work since before and after do not refer to time, but are cursors used for pagination. They expect an id. Since id's are also strings this query does not throw an error but will not find anything. Here is how pagination is used: https://www.prisma.io/docs/prisma-graphql-api/reference/queries-qwe1/#pagination
What I think you want to do is use a filter in the query. For this you can use the where argument. The query would look like this:
getAuditLogsForUser(
where:{AND:[
{createdAt_lte: "2019-01-28T19:31:08.382Z"},
{createdAt_gte: "2019-01-28T17:14:30.047Z"}
]}
) {
id
code { action }
createdAt
}
Here are the docs for filtering: https://www.prisma.io/docs/prisma-graphql-api/reference/queries-qwe1/#filtering
OK so figured out it had to do with the fact that I used "after" and "before" as an argument variable. I have no clue why this completely screws everything up, but it just wont return ANY results if you have this as a argument. Very strange. Must be abstracting some other variable somehow, probably a bug on graphql's end.
As soon as I tried a new variable name, viola, it works.
This is also possible:
const fileData = await prismaClient.fileCuratedData.findFirst({
where: {
fileId: fileId,
createdAt: {
gte: fromdate}
},
});

Mongoose - Aggregation of two queries with condition

I've two different collections that are connected by the id of the garden. I've a list of gardens and I've a list of allocations where it will be stored the start and the end date of the allocation. I can check if a garden is allocated by verifying if today is between both dates in the allocation table.
Garden
{
"_id": "5b98df3c9275f2291c0d7dc3",
"id": "h1",
"size": 43
}
Allocation
{
"_id": "5b9bcb8ecb9dee0015150549",
"user": "5b9a2cd21eb58700141a3449",
"garden": "5b98df5c9275f2291c0d7dc6",
"start_date":"2018-09-14T00:00:00.000Z",
"end_date": "2018-11-14T00:00:00.000Z"
}
How can I return all the existing gardens with an aditional field 'ocupied' with true or false depending on if they exist on the allocation document between start_date and end_date?
I'd like to get an array of gardens with the following data
{
"_id": "5b98df3c9275f2291c0d7dc3",
"id": "h1",
"size": 43,
"occupied": true
}
You can do it one of two ways.
var today = ISODate();
Using $lookup
db.garden.aggregate([
{"$lookup":{
"from":"allocation",
"localField":"_id",
"foreignField":"garden",
"as":"garden"
}},
{"$unwind":"$garden"},
{"$addFields":{
"occupied":{
"$and":[
{"$gte":["$garden.start_date",today]},
{"$lt":["$garden.end_date",today]}
]
}
}},
{"$project":{"garden":0}}
])
Using $lookup with pipeline
db.garden.aggregate([
{"$lookup":{
"from":"allocation",
"let":{"garden_id":"$_id"},
"pipeline":[
{"$match":{"$expr":{"$eq":["$$garden_id","$garden"]},"start_date":{"$gte":today},"end_date":{"$lt":today}}}
],
"as":"garden"
}},
{"$addFields":{
"occupied":{"$gt":[{"$size":"$garden"},0]}
}},
{"$project":{"garden":0}}
])

Mongoose find time furthest from other times in db

I have a program that runs every 5 minutes and checks the last time a users data was updated. If it's been greater than 4 hours an update routine is called but as the service grows, I've seen some spikes in the number of calls at given times. I want to start spreading out the update times. Since I know each time the program updated each users data last, I was wondering if there was an elegant way to find the largest gap between times and set the new users update time to that?
Here's an example. Given the following data:
{
"_id": "1",
"updatedAt": "2018-01-17T01:12:33.807Z"
},{
"_id": "2",
"updatedAt": "2018-01-17T03:17:33.807Z"
},{
"_id": "3",
"updatedAt": "2018-01-17T02:22:33.807Z"
},{
"_id": "4",
"updatedAt": "2018-01-17T02:37:33.807Z"
}
The largest time between the given updates is 1 hour and 10 minutes between id: 1 and id: 3. I want a function that can find that largest gap of time and returns the a suggested update time for the next item added to the database of '2018-01-17T01:47:33.807Z'. Which was calculated by taking the 1 hour and 10 minutes and dividing it by 2 and then adding it to id: 1's date.
I would also like to spread out all the existing users update time but I suppose that would be a different function.
You can't use aggregation framework for a difference style comparison. However you can use map reduce to get the largest time diff between documents.
Something like
db.col.mapReduce(
function () {
if (typeof this.updatedAt != "undefined") {
var date = new Date(this.updatedAt);
emit(null, date);
}
},
function(key, dates) {
result = {"prev":dates[0].getTime(), "last":dates[0].getTime(), "diff":0}
for (var ix = 1; ix < dates.length; ix++) {
value = dates[ix].getTime();
curdiff = value - result.prev;
olddiff = result.diff;
if(olddiff < curdiff)
result = {"prev":value, "diff":curdiff, "last":result.prev};
}
return result;
},
{
"sort":{"updatedAt":1},
"out": { "inline": 1 },
"finalize":function(key, result) {
return new Date(result.last + result.diff/2);
}
}
)
Aggregation query:
db.col.aggregate([
{"$match":{"updatedAt":{"$exists":true}}},
{"$sort":{"updatedAt":1}},
{"$group":{
"_id":null,
"dates":{"$push":"$updatedAt"}
}},
{"$project":{
"_id":0,
"next":{
"$let":{
"vars":{
"result":{
"$reduce":{
"input":{"$slice":["$dates",1,{"$subtract":[{"$size":"$dates"},1]}]},
"initialValue":{"prev":{"$arrayElemAt":["$dates",0]},"last":{"$arrayElemAt":["$dates",0]},"diff":0},
"in":{
"$cond":[
{"$lt":["$$value.diff",{"$subtract":["$$this","$$value.prev"]}]},
{"prev":"$$this","last":"$$value.prev","diff":{"$subtract":["$$this","$$value.prev"]}},
"$$value"
]
}
}
}
},
"in":{
"$add":["$$result.last",{"$divide":["$$result.diff",2]}]
}
}
}
}}
])

Is reading whole object from DocumentDb faster and more efficient?

I'm trying to understand if it would actually be more efficient to read the entire document from Azure DocumentDb than it is to read a property that may have multiple objects in it?
Let's use this basketball team object as an example:
{
id: 123,
name: "Los Angeles Lakers",
coach: "Byron Scott",
players: [
{ id: 24, name: "Kobe Bryant" },
{ id: 3, name: "Anthony Brown" },
{ id: 4, name: "Ryan Kelly" },
]
}
If I want to get only a list of players, is it more efficient/faster for me to read the entire team document from which I can extract the players OR is it better to send SQL statement and try to read only the players from the document?
Returning only the players will be more efficient on the network, as you're returning less data. And, you should also be able to look at the Request Units burned for your query.
For example, I put your document into one of my collections and ran two queries in the portal (and if you do the same, and look at the bottom of the portal, you'll see the resulting Request Unit cost). I slightly modified your document with unique ID and quotes around everything, so I could load it via the portal:
{
"id": "basketball123",
"name": "Los Angeles Lakers",
"coach": "Byron Scott",
"players": [
{ "id": 24, "name": "Kobe Bryant" },
{ "id": 3, "name": "Anthony Brown" },
{ "id": 4, "name": "Ryan Kelly" }
]
}
I first selected just player data:
SELECT c.players FROM c where c.id="basketball123"
with an RU cost of 2.2:
I then asked for the entire document:
SELECT * FROM c where c.id="basketball123"
with an RU cost of 2.24:
Note: Your document size is very small, so there's really not much difference here. But at least you can see that returning a subset costs less than returning the entire document.

Resources