RethinkDB: How do I create a custom duplicate check on insert - node.js

I want to bulk insert an array of data using NodeJS and RethinkDB but I don't want to insert existing records (where name & value already has a record, I don't want to dupcheck on primary key id).
[
{name:"Robert", value:"1337"},
{name:"Martin", value:"0"},
{name:"Oskar", value:"1"}
]
If any of the above values already exist, don't insert, but update "value".
My current working solution is that I loop through the array and first check if it exists using a filter, if not, i insert it. But it's very slow on 10.000 records.

I don't think we have that kind of concept in RethinkDB. I tried to read the doc more. To insert a new document, use insert, to update field, use update, to replace to a whole new document, use replace(the primary key won't change)...So I don't think it's possible in RethinkDB.
Here is some way you can make it run faster:
Create a compound index contains those two fields: name and value
Then using that index to check for existence instead of using filter
Generate your own id field, instead of letting RethinkDB generated it. Therefore, you know the primary key, and use it to look up document with get which will be very fast.

I had a similar requirement in a RethinkDB project, but in that case the primary key was being checked for duplicates, and it was also custom instead of being auto-generated.
What you could do is run an async.series or async.waterfall two-step check. First pick a single object from your array, then filter the database for the name-value pairs of your current object. If the results come up null, it is unique. If not, you have a pre-existing record with same details.
Depending on the result, you can then pass on the control to next step which will either insert the new document or update existing one. It will be simpler if you use a flag for this in async.waterfall.

Related

Mass delete items in DynamoDB

I have a table called Media, where the primary key is the mediaId. I have an additional table called Media_Comments which has a commentId as the primary key, and a mediaId attribute that stores the mediaId that that comment is linked to. Same with Media_Likes, I have a primary key of mediaId and sort key of userId. I want to handle a case where a Media item is deleted by a user, which will then cause a mass deletion of all comments and likes of that Media item. I am currently writing this code in a lambda using Node.js.
I tried using a regular delete based on the condition of where 'mediaId = :mediaId', but it was complaining about needing a primary key for the table. Unfortunately, many times when I want a media item deleted I won't have specific key items available to fulfill that condition. I looked into trying to delete by a certain index, like setting a GSI on the mediaId in each table and deleting by that, unfortunately that does not seem to be an option either.
Basically, am I missing something? Is there actually a way to delete by an index? And if not, what would be the best way to do this? Setting a TTL for each item in dynamodb that is affiliated with the Media item? Or is there another recommended way to handle this problem?
Any help is greatly appreciated, thank you.

Node.JS/Express - how to avoid multiple database queries

I have a basic express app and im getting started with db queries and i want to know how to avoid multiple db queries because i dont think its efficient the way i do it :
app.get('/:word', function(req,res){
db.create({'name': word});
console.log('the word is ' + word);
});
What i want to do is :
get the word from the url
check if it exists in the datbaase (or previously requested because if it was then it was probably added already through this basic code)
if it doesn't exist then add it and then proceed to console.log
I want to add each word to my database once only and not run the db query again and again.
Here's what im thinking :
Not so efficient way
query to check if it exists before inserting one
Good way but i dont know how to start here
Cache the word being queried and maintain cache to prevent db queries
More info edit
I'm using mongodb via mongoose
the 'word' key is already unique so i know its not creating duplicate values
i dont want to run ANY db queries if that value or that url has already been hit once
The only way to check if the word already exists is to query the database before inserting. There are libraries (and also database) that implements the findOrCreate method, but this is always just an abstraction. Behind the scenes, the database will search for an existing value before writing.
If your database is huge and queryng is not suitable, you could use a cashing system (like Redis). But this definitely depends on your logic and your data size.
Probably you can just optimize the process just adding and index to the column you want be unique (I guess it's name?).
You could also define the column name as unique. When inserting, the database will throw you an error if the document already exists. But keep in mind again that, behind the scenes, the database is queryng for an existing same value before inserting. The advantage to have an "unique" column is that the index for this column is automatically created and also from your app logic (node js) you can just call the insert method and add a little bit error handling logic.
MongoDB will create any collections you use in your app if they do not already exist.
Insert Unique Value :
Create Unique Index to your key, So that the value will be added only once. If you try to add again it will throws an error to you.
To create Unique Index,
db.collection.createIndex( { "name": 1 }, { unique: true } )
Caching :
For caching, Store your data on cache system(Like: memory-cache, redis) on first time data will be query from MongoDB and then for subsequent need of data you can use cache system.
In mongo db you can use findOneAndUpdate with optional flag upsert: true documentation
To ensure that every word appears only once you should also set unique index on that field. However rememer that unique index is case sensitive so Cat and cat are different words.

How to check for duplication before creating a new document in CouchDB/Cloudant?

We want to check if a document already exists in the database with the same fields and values of a new object we are trying to save to prevent duplicated item.
Note: This question is not about updating documents or about duplicated document IDs, we only check the data to prevent saving a new document with the same data of an existing one.
Preferably we'd like to accomplish this with Mango/Cloudant queries and not rely on views.
The idea so far is:
1) Scan the the data that we are trying to save and dynamically create a selector that matches that document's structure. (We can't have the selectors hardcoded because we have types of many documents)
2) Query de DB with for any documents matching that selector to if any document already exists that matches those criteria.
However I wonder about the performance of this approach since many of the selector fields will not be indexed.
I also much rather follow best practices than create something out of the blue, but haven't been able to find any known solutions for this specific scenario.
If you happen to know of any, please share.
Option 1 - Define a meaningful ID for your documents
The ID could be a logical coposition or a computed hash from the values that should be unique
If you want to check if a document ID already exists you can use the HEAD method
HEAD /db/docId
which returns 200-OK if the docId exits on the database.
If you would like to check if you have the same content in the new document and in the previous one, you may use the Validate Document Update Function which allows to compare both documents.
function(newDoc, oldDoc, userCtx, secObj) {
...
}
Option 2 - Use content hash computed outside CouchDB
Before create or update a document a hash should be computed using the values of the attributes that should be unique.
The hash is included in the document in a new attribute i.e. "key_hash"
Create a mango index using the "key_hash" attribute
When a new doc should be inserted, the hash should be computed and find for documents with the same hash value using a mango expression before the doc is inserted.
Option 3 - Compute hash in a View
Define a view which emit the computed hash for each document as key
Couchdb Javascript support does not include hashing functions, this could be difficult to include in a design document.
Use erlang to define the map function, where you can access to the erlang support for hashing.
Before creating a new document you should query the view using a the hash that you need to compute previously.
One solution would be to take Juanjo's and Alexis's comment one step further.
Select the keys you wish to keep unique
Put the values in a string and generate a hash
Set the document's _id to that hash
PUT the document on the database.
check return for failure
If another document already exists on the database with the same _id value, the PUT request will fail.

Data modelling for consistent secondary keys with Cassandra

With Cassandra,
I want to represent all users objects with a unique uuid, but also contain a set of zero or more secondary user keys to map to a user. Each secondary key should map to one and only one user(id). Because I need to be able to quick lookup of secondarykey to find a user, I maintain a separate lookup table, instead of a secondary INDEX.
I've modelled the data like this, but I am open to alternatives:
CREATE TABLE users (
userid uuid PRIMARY KEY,
name text,
secondarykeys set<text>
);
CREATE TABLE user_secondarykeys (
secondarykey text,
userid uuid,
PRIMARY KEY(secondarykey)
);
A typical use case is this:
I got this user with a secondary key mail:andreas#example.org, and I would like to see if there exists any user with that secondary key, and if it do not exists, I would like to create a new user object.
I can look for the secondary key:
SELECT * FROM "user_secondarykeys" WHERE secondarykey = "mail:andreas#example.org";
and if I do not find any matches, I can insert a new user:
BEGIN BATCH
INSERT INTO users (userid, name, secondarykeys) VALUES (77059e45-5fac-460b-9c4f-47528c292be0, "Andreas", {'mail:andreas#example.org'});
INSERT INTO user_secondarykeys (secondarykey, userid) VALUES ('mail:andreas#example.org', 77059e45-5fac-460b-9c4f-47528c292be0);
APPLY BATCH;
My problem is that this can lead to inconsistent data, because a user can be inserted with that secondary key in the meantime between my select and my inserts.
I'm thinking that if I can make my INSERT transaction fail if the secondary key already exists in user_secondarykeys, that would work, because it should then also revert the insert into the users table, because of the atomic property of the transaction. However, I do not know any ways to make the INSERT fail if the secondary key exists. If I add IF NOT EXISTS to the second insert, it will not revert the trasaction it will just avoid inserting into user_secondarykeys, but it will still insert into users.
Any suggestions on how to implement this use case in a reliable way is appreciated. Thanks.
At first, I think that your model is pretty complicated, and I'm not sure if I understand correctly all of your requirements.
So if you get at first this secondary key, and then you have to decide what to do - add user or not - then the following will work for you:
Instead of checking user_secondarykeys table with SELECT statement for occurrence of particular secondary key, go with the following:
INSERT INTO user_secondarykeys (secondarykey, userid) VALUES ('mail:andreas#example.org', 77059e45-5fac-460b-9c4f-47528c292be0) IF NOT EXISTS;
So if it applies, it means that this secondary key is not connected with any user - so there are two cases: user doesn't exists or user exists and someone want's to add new secondary key for him. The following will do the job in both cases:
INSERT INTO users(userid, name, secondarykeys) VALUES(77059e45-5fac-460b-9c4f-47528c292be0, 'Andreas', secondarykeys = secondarykeys + 'mail:andreas#example.org')
Because inserts/updates in Cassandra are idempotent(except counters), this will work even if there will be already an user with that id in users table - this should just add another secondary key for him.
Pros of this solution are that you will remove this gap in time which can make you 'inconsistent'. You have a guarantee that no one will insert two users with the same secondary key. You specified that user can have no secondary keys at all - in this situation you can add him straight to the users table.
I'm thinking that if I can make my INSERT transaction fail if the secondary key already exists in user_secondarykeys, that would work, because it should then also revert the insert into the users table, because of the atomic property of the transaction. However, I do not know any ways to make the INSERT fail if the secondary key exists. If I add IF NOT EXISTS to the second insert, it will not revert the trasaction it will just avoid inserting into user_secondarykeys, but it will still insert into users.
Since Cassandra 2.0.6 you can use a conditional statements inside a batch, and if any of conditions will be not met then all instructions in that batch won't fire. This sounds great but there is a limitation - all of the statements inside batch have to operate on the single, same partition. According to this, it is impossible to make cross partition/table conditional insert/update/delete. So in your case this:
BEGIN BATCH
INSERT INTO users (userid, name, secondarykeys) VALUES (77059e45-5fac-460b-9c4f-47528c292be0, "Andreas", {'mail:andreas#example.org'});
INSERT INTO user_secondarykeys (secondarykey, userid) VALUES ('mail:andreas#example.org', 77059e45-5fac-460b-9c4f-47528c292be0) IF NOT EXISTS;
APPLY BATCH;
would not even pass the query validation, because you try here to operate on two different tables.
I'm not sure if this will be suitable for other of your requirements, I would need more information about your queries and the velocity/volume of the data. For sure there are other ways for modeling this.
It would greatly simplify the problem if every user would have to have at least one specified secondary key(e.g. email would be a great unique key for your users table), but that's are your requirements, so unless you can't change them there is no discussion.
Hope this will help you a bit.
Good luck!

How to efficiently bulk insert and update mongodb document values from an array?

I have a Tags collection which contains documents of the following structure:
{
word:"movie", //tag word
count:1 //count of times tag word has been used
}
I am given an array of new tags that need to be added/updated in the Tags collection:
["music","movie","book"]
I can update the counts all Tags currently existing in the tags collection by using the following query:
db.Tags.update({word:{$in:["music","movies","books"]}}, {$inc:{count:1}}), true, true);
While this is an effective strategy to update, I am unable to see which tag values were not found in the collection, and setting the upsert flag to true did not create new documents for the unfound tags.
This is where I am stuck, how should I handle the bulk insert of "new" values into the Tags collection?
Is there any other way I could better utilize the update so that it does upsert the new tag values?
(Note: I am using Node.js with mongoose, solutions using mongoose/node-mongo-native would be nice but not necessary)
Thanks ahead
The concept of using upsert and the $in operator simultaneously is incongruous. This simply will not work as there is no way to different between upsert if *any* in and upsert if *none* in.
In this case, MongoDB is doing the version you don't want it to do. But you can't make it change behaviour.
I would suggest simply issuing three consecutive writes by looping through the array of tags. I know that's it's annoying and it has a bad code smell, but that's just how MongoDB works.

Resources