How To Get Item From DynamoDB Based On Multiple (one primary) Attributes Lambda/NodeJS - node.js

My table structure in DynamoDB looks like the following:
uuid (Primary Key) | ip | userAgent
From within a NodeJS function inside of lambda, I would like to get the uuid of an item whose ip and useragent match the information I provide.
Scan becomes less and less efficient and more expensive over time as millions of items are added to the table every week.
Here is the code I am using to try and accomplish this:
function tieDown(sIP, uA){
const userQuery = {
Key : {
"ip" : "192.168.0.1",
"userAgent" : "sample"
},
TableName: "mytable"
};
return ddb.get(userQuery, function(err, data){
if (err) console.log(err.stack);
}).promise();
}
When this code executes, the following error is thrown ValidationException: The provided key element does not match the schema.
So I guess my questions are:
Is it even possible to get one specific item based on non-primary attributes
Are there any issues with the code sample I provided that could lead to this error being thrown? (I'm using DocumentClient so no need to explicitly declare strings, numbers etc.
Thanks!

You cannot get a single item using the get operation without specifying the partition key, and sort key if you have one. Scans should be avoided in most cases. What you probably need is a Global Secondary Index that allows you to query by ip and userAgent. Keep in mind that the records on a GSI are not guaranteed unique, so you may get more than one result.

Related

Which is the correct way to scan a table on DynamoDB?

As the title says, I want to know which is the best way to scan a table in Amazon DynamoDB, searching by another field than the primary key.
I searched about this and read a lot, but I found this solution for me:
let DynamoDBServiceObj = new AWS.DynamoDB({apiVersion: '2012-08-10'});
let params = {
ExpressionAttributeValues: {
':hash' : { S: req.param('wildcard') }
},
ProjectionExpression: 'directory',
FilterExpression: 'qrCode = :hash',
TableName: 'business'
};
let business = await DynamoDBServiceObj.scan(params).promise();
if (business.Count == 1) return res.ok();
else return res.view('404');
This works for me, but I also read that perform an scan on a table is a bad idea, for performance and pricing. But, how to do it then?
Which is the correct way to scan a table, searching by another than the primary key?
What is the difference between DocumentClient and DynamoDB Object?
I always use .get() for obtain what I want on DynamoDB. Is this a good or a bad practice?
I read these posts, and I suppose that GSI is the solution, but I don't understand how it works.
Global Secondary Indexes (GSI)
DynamoDB: Scan on multiple non key attribute
Step 4: Query and Scan the table
Querying and Scanning a DynamoDB Table
What is the difference between scan and query in dynamodb?
How to fetch/scan all items from AWS dynamodb using node.js
Like you said, scanning the table is not a great idea and you have already read about it. I would suggest two things.
Use a composite primary key (if you're not doing so yet). Using the combination of partition key and sort key gives you more possibilities to query (and not scan) your table depending on your frequent access patterns.
If you still need to query the table by an attribute other than the ones included in your composite primary key, you are right that the GSI is the solution. You can check this post on how the GSI works. Choose primary index for Global secondary index
You can think of a GSI as a copy of your table with a different primary key.

DynamoDB begins with not returning expected results

I'm using NodeJS and DynamoDB. I'm never used DynamoDB before, and primary a C# developer (where this would simply just be a .Where(x => x...) call, not sure why Amazon made it any more complicated then that). I'm trying to simply just query the table based on if an id starts with certain characters. For example, we have the year as the first 2 characters of the Id field. So something like this: 180192, so the year is 2018. The 20 part is irrelevant, just wanted to give a human readable example. So the Id starts with either 18 or 17 and I simply want to query the db for all rows that Id starts with 18 (for example, could be 17 or whatever). I did look at the documentation and I'm not sure I fully understand it, here's what I have so far that is just returning all results and not the expected results.
let params = {
TableName: db.table,
ProjectionExpression: "id,CompetitorName,code",
KeyConditionExpression: "begins_with(id, :year)",
ExpressionAttributeValues: {
':year': '18'
}
return db.docClient.scan(params).promise();
So as you can see, I'm thinking that this would be a begins_with call, where I look for 18 against the Id. But again, this is returning all results (as if I didn't have KeyConditionExpression at all).
Would love to know where I'm wrong here. Thanks!
UPDATE
So I guess begin_with won't work since it only works on strings and my id is not a string. As per commenters suggestion, I can use BETWEEN, which even that is not working either. I either get back all the results or Query key condition not supported error (if I use .scan, I get back all results, if I use .query I get the error)
Here is the code I'm trying.
let params = {
TableName: db.table,
ProjectionExpression: "id,CompetitorName,code",
KeyConditionExpression: "id BETWEEN :start and :end",
ExpressionAttributeValues: {
':start': 18000,
':end': 189999
}
};
return db.docClient.query(params).promise();
It seems as if there's no actual solution for what I was originally trying to do unfortunately. Which is a huge downfall of DynamoDB. There really needs to be some way to do 'where' using the values of columns, like you can in virtually any other language. However, I have to admit, part of the problem was the way that id was structured. You shouldn't have to rely on the id to get info out of it. Anyways, I did find another column DateofFirstCapture which using with contains (all the dates are not the same format, it's a mess) and using a year 2018 or 2017 seems to be working.
if you want to fetch data by id, add it as the partition key. If you want to get data by part of the string, you can use "begins with" on sort key.
begins_with (a, substr)— true if the value of attribute a begins with a particular substring.
source: https://docs.amazonaws.cn/en_us/amazondynamodb/latest/developerguide/Query.html
begins_with and between can only be used on sort keys.
For query you must always supply partition key.
So if you change your design to have unique partition key (or unique combo of partition/sort keys) and strings like 180192 as sort key you will be able to query begins_with(sortkey, ...).

Rethinkdb replace document if document exists, else insert document

I would like to insert a document if it doesn't exist (client_nr not found).
If this exists, replace the whole document with new values.
The only other this is, that the client_nr is not the primary key. The primary key is the default id created by rethinkdb database.
I tried the below code in node js, but nothing happened. The data is in the variable jsonArray. I use the for loop to go through the whole jsonArray.
Any idea how to solve this problem?
Thanks!!!
for(var Ticker in jsonArray){
r.db(db).table('trades').filter({client_nr: jsonArray[Ticker].client_nr}).forEach(function(post) {
return r.branch(
post.eq(null),
r.db(db).table('log').insert(jsonArray[Ticker]),
r.db(db).table('log').replace(jsonArray[Ticker])
)
}).run()
}
This is much easier to do if client_nr is your primary key. I'd consider doing that instead of using the autogenerated IDs. That will also enforce uniqueness on the field, which is probably what you want.
I was also a little confused by your example because your description made it sound like you wanted to be inserting/replacing into the same table that you're filtering on, but your example is referencing two different tables.
Assuming you want to be using a single table, something like this should do it:
TABLE.filter({client_nr: jsonArray[Ticker].client_nr}).replace(function(row) {
return r(jsonArray[Ticker]).merge(row.pluck('id'));
}).do(function(res) {
return r.branch(
res('replaced').add(res('unchanged')).eq(0),
TABLE.insert(jsonArray[Ticker]),
res);
})

Referencing external doc in CouchDB view

I am scraping an 90K record database using JSON-RPC and I am trying to put in some basic error checking. I want to start by scraping the database twice using two different settings and adding a prefix to the second scrape. This way I can check to ensure that the two settings are not producing different records (due to dropped updates, etc). I wanted to implement the comparison using a view which compares each document from the first scrape with it's twin produced by the second scrape and then emit the names of records with a difference between them.
However, I cannot quite figure out how to pull in another doc in the view, everything I have read only discusses external docs using the emit() function, which is too late to permit me to compare it. In the example below, the lookup() function would grab the referenced document.
Is this just not possible?
function(doc) {
if(doc._id.slice(0,1)!=='$' && doc._id.slice(0,1)!== "_"){
var otherDoc = lookup('$test" + doc._id);
if(otherDoc){
var keys = doc.value.keys();
var same = true;
keys.forEach(function(key) {
if ((key.slice(0,1) !== '_') && (key.slice(0,1) !=='$') && (key!=='expires')) {
if (!Object.equal(otherDoc[key], doc[key])) {
same = false;
}
}
});
if(!same){
emit(doc._id, 1);
}
}
}
}
Context
You are correct that this is not possible in CouchDB. The whole point of the map function is that it must be idempotent, otherwise you lose all the other nice benefits of a pre-calculated index.
This is why you cannot access external resources in the map function, whether they be other records or the clock. Any time you run a map you must always get the same result if you put the same record into it. Since there are no relationships between records in CouchDB, you cannot promise that this is possible.
Solution
However, you can still achieve your end goal, just be different means. Some possibilities...
Assuming there is some meaningful numeric value in each doc, you could use a view to take the sum of all those values and group them by which import you did ({key: <batch id>, value: <meaningful number>}). Then compare the two numbers in your client or the browser to see if they match.
A brute force approach would be to use a view to pair the docs that should match. Each doc is on a different row, but they're grouped by a common field. Then iterate through the entire index comparing the pairs. This would certainly be the quickest to code and doesn't depend on your application or data.
Implement a validation function to enforce a schema on your data. Just be warned that this will reduce your write throughput since each written record will be piped out of Erlang and into the JS engine. Also, this is only applicable if you're worried about properly formed records instead of their precise content, which might not be the case.
Instead of your different batch jobs creating different docs, have them place them into the same doc. The structure might look like this: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } } Then your validation function could compare them or you could create a view that indexes all the docs that don't match. All depends on where in your pipeline you want to do the error checking and correction.
Personally I like the last option better, but only if you don't plan to use the database as is in production. Ie., you wouldn't want to carry around all that extra data in each record.
Hope that helps.
Cheers.

Point-in-time restores of databases and documents using Cloudant

How can I save changes in CouchDB / Cloudant in order to later do point-in-time restores of my databases, or even specific documents?
We’re working on making this a first-class feature, but until we roll it out, this is how one of our customers did it:
You have collections, and within those collections, resources. So, you keep a logging database where every document has an ID like collection-resource, so for a collection named "cars" and a resource named "Ford", you'd have a document in your logging database named cars-ford. That document looks like this:
{
versions: [...]
}
Any time that resource is touched or modified, your application updates the logging document by appending the new version to the end of the versions field. That version might look like this:
{
timestamp: '...', # some integer timestamp, for sorting
doc: {...} # attributes of the document as of the save
}
We'll use that view to return a list of all versions of all documents, sorted by when each change occurred.
Then, here's how you use that to do restores and the like:
Getting the most recent version of a resource
Get the document in its entirety, and grab the last element in the versions field. That's the most recent version.
See all versions relative to a timestamp
We'll create a view to sort by timestamp. The view looks like this:
{
map: "function(doc) {
for(var i in doc.versions){
emit(doc.versions[i].timestamp, doc.versions[i].doc);
}
}"
}
Say our database is named loggy, the design doc where our views live is named restore, and the view itself is named time. Then we'll make a GET request to this URL:
{CLOUDANT_HOST}/loggy/_design/restore/_view/time?startkey='...'
...where the value for startkey is some timestamp. This, unmodified, will return every version after the indicated timestamp. Add limit=X and you'll get the X versions after the timestamp. Add descending=true and you'll get versions before the timestamp, instead of after.
See the Nth revision for a resource
Much like above, but we'll tweak our view a little:
{
map: "function(doc){
for(var i in doc.versions){
emit(i, doc.versions[i].doc);
}
}"
}
Now our view results are keyed by index rather than timestamp. So, instead of passing a timestamp to startkey, we just pass N to versions around the Nth revision.
Getting the number of revisions for a collection or resource
We'll use another view to group by collection and resource:
{
map: "function(doc){
// split te ID into collection and resource
var parts = doc._id.split('-');
// emit them as keys so we can group by them
emit([doc.parts[0], doc.parts[1]], null);
}",
reduce: "_count"
}
Use the query parameter group and group_level to group results by their keys. So, if we want the number of events that have touched resources in the cars collection, we would use a querystring like this:
?group=true&group_level=1&key="cars"
group groups results whose keys are the same, but group_level=1 says "only group on the first key", which in our case is the collection. key specifies to only return documents whose key matches the given value.
Getting all resources for a given collection
Using the _all_docs view, we'll use a querystring like this:
?reduce=false&startkey="{collection}-"&endkey="{collection}0"
Remember the reduce part of our function? That _count value means "return the number of records emitted by map". reduce=false means "Don't do that." Instead, only the map function is run.
That startkey and endkey pair uses how Cloudant sorts results to exclude everything but the values matching IDs that start with the given collection.
Updating docs
Once you've got the versions you'd like to restore, GET the current version of the resource, GET the past version from the loggy database, and PUT the past version to the resource using the current version's _rev value. Bam, restored. Rinse and repeat for point-in-time restore.

Resources