How to sort data by rating with Aws Lambda using nodeJS

How to sort data by rating with Aws Lambda using nodeJS - node.js

I have a db on Dynamodb. And writing some user scores to db. Also I have a lambda function which wrote it with nodejs. I want to get first 10 users who have most points. How could I scan this users?
Thanks a lot.

Max() in NoSQL is much trickier than in SQL. And it doesn't really scale - if you want very high scalability on achieving this let me know, but let's get back to the question.
Assuming your table looks like:
User
----------
userId - hashKey
score
...
Add a dummy category attribute to your table, which will be constant (for example value "A"). Create the index:
category - hash key
score - sort key
Query this index by hash key "A" in reserve order in order to get results much faster than a scan. But this scales to max 10GB (max partition size, all data being in same partition). Also make sure you project only needed attributes for this index, in order to save space.
You can go up to 30GB for example, by setting 3 categories ("A", "B", "C"), executing 3 queries and merge programatically the results. This will affect performance a bit, but still better than a full scan.
EDIT
var params = {
TableName: 'MyTableName',
Limit: 10,
// Set ScanIndexForward to false to display most recent entries first
ScanIndexForward: false,
KeyConditionExpression: 'category = : category',
ExpressionAttributeValues: {
':category': {
S: 'category',
},
},
};
dynamo.query(params, function(err, data) {
// handle data
});
source: https://www.debassociates.com/blog/query-dynamodb-table-from-a-lambda-function-with-nodejs-and-apex-up/

Related

Unable to execute a timeseries query using a timeuuid as the primary key

My goal is to do a sum of the messages_sent and emails_sent per each DISTINCT provider_id value for a given time range (fromDate < stats_date_id < toDate), but without specifying a provider_id. In other words, I need to know about any and all Providers within the specified time range and to sum their messages_sent and emails_sent.
I have a Cassandra table using an express-cassandra schema (in Node.js) as follows:
module.exports = {
fields: {
stats_provider_id: {
type: 'uuid',
default: {
'$db_function': 'uuid()'
}
},
stats_date_id: {
type: 'timeuuid',
default: {
'$db_function': 'now()'
}
},
provider_id: 'uuid',
provider_name: 'text',
messages_sent: 'int',
emails_sent: 'int'
},
key: [
[
'stats_date_id'
],
'created_at'
],
table_name: 'stats_provider',
options: {
timestamps: {
createdAt: 'created_at', // defaults to createdAt
updatedAt: 'updated_at' // defaults to updatedAt
}
}
}
To get it working, I was hoping it'd be as simple as doing the following:
let query = {
stats_date_id: {
'$gt': db.models.minTimeuuid(fromDate),
'$lt': db.models.maxTimeuuid(toDate)
}
};
let selectQueries = [
'provider_name',
'provider_id',
'count(direct_sent) as direct_sent',
'count(messages_sent) as messages_sent',
'count(emails_sent) as emails_sent',
];
// Query stats_provider table
let providerData = await db.models.instance.StatsProvider.findAsync(query, {select: selectQueries});
This, however, complains about needing to filter the results:
Error during find query on DB -> ResponseError: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance.
I'm guessing you can't have a primary key and do date range searches on it? If so, what is the correct approach to this sort of query?

So while not having used Express-Cassandra, I can tell you that running a range query on your partition key is a hard "no." The reason for this, is that Cassandra can't determine a single node for that query, so it has to poll every node. As that's essentially a full scan of your table across multiple nodes, it throws that error to prevent you from running a bad query.
However, you can run a range query on a clustering key, provided that you are filtering on all of the keys prior to it. In your case, if I'm reading this right, your PRIMARY KEY looks like:
PRIMARY KEY (stats_date_id, created_at)
That primary key definition is going to be problematic for two reasons:
stats_date_id is a TimeUUID. This is great for data distribution. But it sucks for query flexibility. In fact, you will need to provide that exact TimeUUID value to return data for a specific partition. As a TimeUUID has millisecond precision, you'll need to know the exact time to query down to the millisecond. Maybe you have the ability to do that, but usually that doesn't make for a good partition key.
Any rows underneath that partition (created_at) will have to share that exact time, which usually leads to a lot of 1:1 cardinality ratios for partition:clustering keys.
My advice on fixing this, is to partition on a date column that has a slightly lower level of cardinality. Think about how many provider messages are usually saved within a certain timeframe. Also pick something that won't store too many provider messages together, as you don't want unbound partition growth (Cassandra has a hard limit of 2 billion cells per partition).
Maybe something like: PRIMARY KEY (week,created_at)
So then your CQL queries could look something like:
SELECT * FROM stats_provider
WHERE week='201909w1'
AND created_at > '20190901'
AND created_at < '20190905';
TL;DR;
Partition on a time bucket not quite as precise as something down to the ms, yet large enough to satisfy your usual query.
Apply the range filter on the first clustering key, within a partition.

How to query count for each column in DynamoDB

I have a DynamoDB with 50 different columns labeled question1 - question 50. Each of these columns have either a, b, c, or d as answers to a multiple choice question. What is the most efficient way of getting the count of how many people answered 'a' for question1?
I'm trying to return the count of a, b, c, d for ALL questions, so I want to see how many answered a for question1, how many answered b for question 1, etc. So in the end I should have a count for each question and their answer.
Currently I have this, but I don't feel like it's efficient to type everything out. Is there a simplified way of doing this?
exports.handler = async function(event, ctx, callback) {
const params = {
ScanFilter: {
'question1' : {
ComparisonOperator: 'EQ',
AttributeValueList: {
S: 'a'
}
}
},
TableName : 'app',
Select: 'COUNT'
};
try {
data = await dynamoDb.scan(params).promise()
console.log(data)
}
catch (err) {
console.log(err);
}
}

You have missed mentioning two things - is this a one time operation for you or you need to do this regularly? and how many records do you have?
If this is a one time operation:
Since you have 50 questions and 4 options for each (200 combinations) and assuming you have a lot of data, the easiest solution is to export the entire data to a csv and do a pivot table there. This is easier than scanning entire table and doing aggregation operations in memory. Or you can export the table to s3 as json and use athena to run queries on the data.
If you need to do this regularly, you can do one of the following:
Save your aggregate counts as GSI in the same table or in a new table or somewhere else entirely. Enable and send streams to a lambda function. Increment these counts according to the new data coming in.
Use elastic search - Enable streams on your ddb and have a lambda function send them to an elastic search index. Index the current data as well. And then do aggregate queries on this index.

RDBMS's aggregate quite easily...DDB not so much.
Usual answer with DDB is to enable streams and have a lambda attached to the stream that calculates the needed aggregations and stores them in a separate record in DDB.
Read through the Using Global Secondary Indexes for Materialized Aggregation Queries section of the docs.

Query Dynamodb for multiple strings in a single field

I'm trying to query for two different values from the same field or column. In this instance I would like to retrieve rows where the fulfilled item are true or false. Below is an example of what I'm trying to do.
const params = {
TableName: "Orders",
IndexName: 'fulfilled-shippingId-index',
KeyConditionExpression: "fulfilled = :fulfilled",
//FilterExpression : 'contains(fulfilled=true) OR fulfilled=false',
ExpressionAttributeValues: {
":fulfilled": "true",
":fulfilled": "false"
}
let me know if this isn't possible or if there would be a different way to do this through a loop or maybe just multiple requests from the application? As of now it just returns the last Expression Attribute Value.
thanks!

Unfortunately, this isn't possible.
KeyConditionExpression
The condition must perform an equality test on a single partition key
value.
You must specify a single partition key value.
Using a FilterExpression here won't help since FilterExpressions are applied after the data is read. Also, a FilterExpression cannot contain partition key or sort key attributes.
Since fulfilled = true OR fulfilled = false is a tautology, I would recommend just using Scan to read all of the items in your fulfilled-shippingId-index

How should I attack a large GroupBy recordset in a JavaScript heavy stack?

I'm currently using Node.js and Firebase on a project, and I love both. My challenge is that I need to store millions of sales order rows that would look something like this:
{ companyKey: 'xxx',
orderKey : 'xxx',
rowKey : 'xxx',
itemKey : 'xxx',
orderQty: '5',
orderDate: '12/02/2015'
}
I'd like to query these records like the pseudocode below:
Select sum(orderQty) from mydb where companyKey = 'xxx' and itemKey = 'xxx' groupby orderDate
According to various reasons such as Firebase count group by, groupby in general can be a tough nut to crack. I've done it before using Oracle Materialized Views but would like to use some kind of service that just does all of that backend work for me so I can CRUD those sales orders without worrying about the aggregation maintenance. I read in another stackoverflow post that Keen.io might be a good approach to this problem.
How would the internet experts attack this problem if they were using a JavaScript heavy stack and they wanted an outside service to do aggregation by day for them?
A couple of points I'm considering. I'll update as they come up:
1) It seems I might have to take Keen.io off the list. It's $125 for 1M rows. I don't need all the power Keen.io provides, only aggregation by day.
2) Going the Sequelize + PostGreSQL seems to be a decent compromise. I can still use JavaScript, an ORM to alleviate the pain, and PostGreSQL hosting is usually cheap.

It sounds like you want to show a trend in sales of an item over time. That's a very good fit for an event data platform because showing trends over time is really native to the query language. In Keen IO, the idea of "grouping by time" is instead expressed as the concept of "timeframe" (e.g. previous_7_days) and "interval" (e.g. daily).
Here's how you would run that with a simple sum query in Keen:
var sum = new Keen.Query("sum", {
event_collection: "sales",
target_property: "orderQty",
timeframe: "previous_12_weeks",
interval: "weekly",
filters: [
{
property_name: "companyKey",
operator: "eq",
property_value: "xxx"
},
{
property_name: "itemKey",
operator: "eq",
property_value: "yyy"
}
]
});
In fact you could calculate the sum for ALL of your companies and products in a single query by using group_by.
var sum = new Keen.Query("sum", {
event_collection: "sales",
target_property: "orderQty",
timeframe: "previous_12_weeks",
interval: "weekly",
group_by: ["companyKey", "itemKey"]
});
Keen recently updated their pricing. Depending on the frequency of querying, something like this would be pretty light, in the $10s of dollars per month if you have millions of new transactions monthly.

Index multiple MongoDB fields, make only one unique

I've got a MongoDB database of metadata for about 300,000 photos. Each has a native unique ID that needs to be unique to protect against duplication insertions. It also has a time stamp.
I frequently need to run aggregate queries to see how many photos I have for each day, so I also have a date field in the format YYYY-MM-DD. This is obviously not unique.
Right now I only have an index on the id property, like so (using the Node driver):
collection.ensureIndex(
{ id:1 },
{ unique:true, dropDups: true },
function(err, indexName) { /* etc etc */ }
);
The group query for getting the photos by date takes quite a long time, as one can imagine:
collection.group(
{ date: 1 },
{},
{ count: 0 },
function ( curr, result ) {
result.count++;
},
function(err, grouped) { /* etc etc */ }
);
I've read through the indexing strategy, and I think I need to also index the date property. But I don't want to make it unique, of course (though I suppose it's fine to make it unique in combine with the unique id). Should I do a regular compound index, or can I chain the .ensureIndex() function and only specify uniqueness for the id field?

MongoDB does not have "mixed" type indexes which can be partially unique. On the other hand why don't you use _id instead of your id field if possible. It's already indexed and unique by definition so it will prevent you from inserting duplicates.
Mongo can only use a single index in a query clause - important to consider when creating indexes. For this particular query and requirements I would suggest to have a separate unique index on id field which you would get if you use _id. Additionally, you can create a non-unique index on date field only. If you run query like this:
db.collection.find({"date": "01/02/2013"}).count();
Mongo will be able to use index only to answer the query (covered index query) which is the best performance you can get.
Note that Mongo won't be able to use compound index on (id, date) if you are searching by date only. You query has to match index prefix first, i.e. if you search by id then (id, date) index can be used.

Another option is to pre aggregate in the schema itself. Whenever you insert a photo you can increment this counter. This way you don't need to run any aggregation jobs. You can also run some tests to determine if this approach is more performant than aggregation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string