Delete multiple couchbase entities having common key pattern - node.js

I have a use case where I have to remove a subset of entities stored in couchbase, e.g. removing all entities with keys starting with "pii_".
I am using NodeJS SDK but there is only one remove method which takes one key at a time: http://docs.couchbase.com/sdk-api/couchbase-node-client-2.0.0/Bucket.html#remove
In some cases thousands of entities need to be deleted and it takes very long time if I delete them one by one especially because I don't keep list of keys in my application.

I agree with the #ThinkFloyd when he saying: Delete on server should be delete on server, rather than requiring three steps like get data from server, iterate over it on client side and finally for each record fire delete on the server again.
In this regards, I think old fashioned RDBMS were better all you need to do is 'DELETE * from database where something=something'.
Fortunately, there is something similar to SQL is available in CouchBase called N1QL (pronounced nickle). I am not aware about JavaScript (and other language syntax) but this is how I did it in python.
Query to be used: DELETE from <bucketname> b where META(b).id LIKE "%"
layer_name_prefix = cb_layer_key + "|" + "%"
query = ""
try:
query = N1QLQuery('DELETE from `test-feature` b where META(b).id LIKE $1', layer_name_prefix)
cb.n1ql_query(query).execute()
except CouchbaseError, e:
logger.exception(e)
To achieve the same thing: alternate query could be as below if you are storing 'type' and/or other meta data like 'parent_id'.
DELETE from <bucket_name> where type='Feature' and parent_id=8;
But I prefer to use first version of the query as it operates on key, and I believe Couchbase must have some internal indexes to operate/query faster on key (and other metadata).

The best way to accomplish this is to create a Couchbase view by key and then range query over that view via your NodeJS code, making deletes on the results.
http://docs.couchbase.com/admin/admin/Views/views-querySample.html
http://docs.couchbase.com/couchbase-manual-2.0/#couchbase-views-writing-querying-selection-partial
http://docs.couchbase.com/sdk-api/couchbase-node-client-2.0.8/ViewQuery.html
For example, your Couchbase view could look like the following:
function(doc, meta) {
emit(meta.id, null);
}
Then in your NodeJS code, you could have something that looks like this:
var couchbase = require('couchbase');
var ViewQuery = couchbase.ViewQuery;
var query = ViewQuery.from('designdoc', 'by_id');
query.range("pii_", "pii_" + "\u0000", false);
var myBucket = myCluster.openBucket();
myBucket.query(query, function(err, results) {
for(i in results) {
// Delete code in here
}
});
Of course your Couchbase design document and view will be named differently than the example that I gave, but the important part is the ViewQuery.range function that was used.
All document ids prefixed with pii_ would be returned, in which case you can loop over them and start deleting.
Best,

Related

Getting database names from server

I want to do a simple thing: get the database names on a RavenDB server. Looks straightforward according to the docs (https://ravendb.net/docs/article-page/4.1/csharp/client-api/operations/server-wide/get-database-names), however I'm facing a chicken-and-egg problem.
The problem comes because I want to get the database names without knowing them in advance. The code in the docs works great, but requires to have an active connection to a DocumentStore. And to get an active connection to a DocumentStore, is mandatory to select a valid database. Otherwise I can't execute the GetDatabaseNamesOperation.
That makes me think that I'm missing something. Is there any way to get the database names without having to know at least one of them?
The database isn't mandatory to open a store. Following code works with no problems:
using (var store = new DocumentStore
{
Urls = new[] { "http://live-test.ravendb.net" }
})
{
store.Initialize();
var dbs = store.Maintenance.Server.Send(new GetDatabaseNamesOperation(0, 25));
}
We send GetDatabaseNamesOperation to the ServerStore, which is common for all databases and holds common data (like database names).

Insert or update multiple documents in MongoDB

I want to implement hashtags functionality with NodeJS and MongoDB support, so that I can also count the uses. Whenever a user adds hashtags to a page, I want to push or update them in the database. Each hastag looks like this:
{_id:<auto>, name:'hashtag_name', uses: 0}
The problem I'm facing is that the user can add new tags as well, so when he clicks 'done', I have to increment the 'uses' field for the existing tags, and add the new ones. The trick is how to do this with only one Mongo instruction? So far I thought of 2 possible ways of achieving this, but I'm not particularly happy with either:
Option 1
I have a service which fetches the existing tags from the db before the user starts to write a new article. Based on this, I can detect which tags are new, and run 2 queries: one which will add the new tags, and another which will update the existing one
Option 2
I will send the list of tags to the server, and there I will run a find() for every tag; if I found one, I'll update, if not, I'll create it.
Option 3 (without solution for now)
Best option would be to run a query which takes an array of tag names, do a $inc operation for the existing ones, and add the missing ones.
The question
Is there a better solution? Can I achieve the end result from option #3?
You should do something like this, all of them will be executed in one batch, this is only an snippet idea how to do it:
var db = new Db('DBName', new Server('localhost', 27017));
// Establish connection to db
db.open(function(err, db) {
// Get the collection
var col = db.collection('myCollection');
var batch = col.initializeUnorderedBulkOp();
for (var tag in hashTagList){
// Add all tags to be executed (inserted or updated)
batch.find({_id:tag.id}).upsert().updateOne({$inc: {uses:1}});
}
batch.execute(function(err, result) {
db.close();
});
});
I would use the Bulk method offered by Mongodb since version 2.6. In the same you could perform insertion operations when the tag is new and the counter update when it already exists.

Can you use CouchDB 'document update handlers' with replication?

I am replicating docs from DB A to DB B, every time a Doc from DB A arrives in DB B I want to run a 'stored procedure' to remove most of the fields from DB A (DB A is private, but has attachments that I want to be publicly available)
So far I've seen that this might be achieved using the _changes feed (continuous)and then running an 'update' handler on each document.
The document update handlers doc: https://wiki.apache.org/couchdb/Document_Update_Handlers
This seems like something that CouchDB would implement for me... (and I'm not really sure yet how to do the above).
Is there something like a 'hook' that can be run on every document that enters the database?
== EDIT ==
It seems that I would want to somehow include the update handler command in the replication trigger?
It sounds like with some changes to how your storing documents you may be able to benefit from CouchDB's filtered replication. You'd need to store the attachments in documents that could be equivalently copied (without modification) between the two databases.
If that's not an option, then you could potentially use transform-pouchdb plus PouchDB's .replicate.from() method to manage the replication.
Some quick pseudo-code for this idea looks a bit like this:
var PouchDB = require('pouchdb');
PouchDB.plugin(require('transform-pouch'));
var dbA = new PouchDB('a'); // "a" could be a URL to CouchDB or Cloudant
var dbB = new PouchDB('b');
dbB.transform({
incoming: function (doc) {
// do something to the document before storage
return doc;
}
});
dbB.replicate.from(dbA);
In theory, that (or something like it) should do what you're wanting...or at least giving you the framework in which to do what you're wanting. ^_^
Hope that helps!

Referencing external doc in CouchDB view

I am scraping an 90K record database using JSON-RPC and I am trying to put in some basic error checking. I want to start by scraping the database twice using two different settings and adding a prefix to the second scrape. This way I can check to ensure that the two settings are not producing different records (due to dropped updates, etc). I wanted to implement the comparison using a view which compares each document from the first scrape with it's twin produced by the second scrape and then emit the names of records with a difference between them.
However, I cannot quite figure out how to pull in another doc in the view, everything I have read only discusses external docs using the emit() function, which is too late to permit me to compare it. In the example below, the lookup() function would grab the referenced document.
Is this just not possible?
function(doc) {
if(doc._id.slice(0,1)!=='$' && doc._id.slice(0,1)!== "_"){
var otherDoc = lookup('$test" + doc._id);
if(otherDoc){
var keys = doc.value.keys();
var same = true;
keys.forEach(function(key) {
if ((key.slice(0,1) !== '_') && (key.slice(0,1) !=='$') && (key!=='expires')) {
if (!Object.equal(otherDoc[key], doc[key])) {
same = false;
}
}
});
if(!same){
emit(doc._id, 1);
}
}
}
}
Context
You are correct that this is not possible in CouchDB. The whole point of the map function is that it must be idempotent, otherwise you lose all the other nice benefits of a pre-calculated index.
This is why you cannot access external resources in the map function, whether they be other records or the clock. Any time you run a map you must always get the same result if you put the same record into it. Since there are no relationships between records in CouchDB, you cannot promise that this is possible.
Solution
However, you can still achieve your end goal, just be different means. Some possibilities...
Assuming there is some meaningful numeric value in each doc, you could use a view to take the sum of all those values and group them by which import you did ({key: <batch id>, value: <meaningful number>}). Then compare the two numbers in your client or the browser to see if they match.
A brute force approach would be to use a view to pair the docs that should match. Each doc is on a different row, but they're grouped by a common field. Then iterate through the entire index comparing the pairs. This would certainly be the quickest to code and doesn't depend on your application or data.
Implement a validation function to enforce a schema on your data. Just be warned that this will reduce your write throughput since each written record will be piped out of Erlang and into the JS engine. Also, this is only applicable if you're worried about properly formed records instead of their precise content, which might not be the case.
Instead of your different batch jobs creating different docs, have them place them into the same doc. The structure might look like this: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } } Then your validation function could compare them or you could create a view that indexes all the docs that don't match. All depends on where in your pipeline you want to do the error checking and correction.
Personally I like the last option better, but only if you don't plan to use the database as is in production. Ie., you wouldn't want to carry around all that extra data in each record.
Hope that helps.
Cheers.

How do I design a couchdb view for following case ?

I am migrating an application from mySQL to couchDB. (Okay, Please dont pass judgements on this).
There is a function with signature
getUserBy($column, $value)
Now you can see that in case of SQL it is a trivial job to construct a query and fire it.
However as far as couchDB is concerned I am supposed to write views with map functions
Currently I have many views such as
get_user_by_name
get_user_by_email
and so on. Can anyone suggest a better and yet scalable way of doing this ?
Sure! One of my favorite views, for its power, is by_field. It's a pretty simple map function.
function(doc) {
// by_field: map function
// A single view for every field in every document!
var field, key;
for (field in doc) {
key = [field, doc[field]];
emit(key, 1);
}
}
Suppose your documents have a .name field for their name, and .email for their email address.
To get users by name (ex. "Alice" and "Bob"):
GET /db/_design/example/_view/by_field?include_docs=true&key=["name","Alice"]
GET /db/_design/example/_view/by_field?include_docs=true&key=["name","Bob"]
To get users by email, from the same view:
GET /db/_design/example/_view/by_field?include_docs=true&key=["email","alice#gmail.com"]
GET /db/_design/example/_view/by_field?include_docs=true&key=["name","bob#gmail.com"]
The reason I like to emit 1 is so you can write reduce functions later to use sum() to easily add up the documents that match your query.

Resources