CouchDB attachment manipulation before document update - couchdb

I have the requirement to transform images attached to every document (actually need images to be shrinked to 400px width). What is the best way to achieve that? Was thinking on having nodejs code listening on _changes and performing necessary manipulations on document save. However, this have bunch of drawbacks:
a) document change does not always means that new attachment was added
b) all the time we have to process already shrinked images (at least check image width)

I think you basically have some data in a database and most of your problem is simply application logic and implementation. I could imagine a very similar requirements list for an application using Drizzle. Anyway, how can your application "cut with the grain" and use CouchDB's strengths?
A Node.js _changes listener sounds like a very good starting point. Node.js has plenty of hype and silly debates. But for receiving a "to-do list" from CouchDB and executing that list concurrently, Node.js is ideal.
Memoizing
I immediately think that image metadata in the document will help you. Fetching an image and checking if it is 400px could get expensive. If you could indicate "shrunk":true or "width":400 or something like that in the document, you would immediately know to skip the document. (This is an optimization, you could possibly skip it during the early phase of your project.)
But how do you keep the metadata in sync with the images? Maybe somebody will attach a large image later, and the metadata still says "shrunk":true. One answer is the validation function. validate_doc_update() has the privilege of examining both the old and the new (candidate) document version. If it is not satisfied, it can throw() an exception to prevent the change. So it could enforce your policy in a few ways:
Any time new images are attached, the "shrunk" key must also be deleted
Or, your external Node.js tool has a dedicated username to access CouchDB. Documents must never set "shrunk":true unless the user is your tool
Another idea worth investigating is, instead of setting "shrunk":true, you set it to the MD5 checksum of the image. (That is already in the document, in the ._attachments object.) So if your Node.js tool sees this document, it knows that it has work to do.
{ "_id": "a_doc"
, "shrunk": "md5-D2yx50i1wwF37YAtZYhy4Q=="
, "_attachments":
{ "an_image.png":
{ "content_type":"image/png"
, "revpos": 1
, "digest": "md5-55LMUZwLfzmiKDySOGNiBg=="
}
}
}
In other words:
if(doc.shrunk == doc._attachments["an_image.png"].digest)
console.log("This doc is fine")
else
console.log("Uh oh, I need to check %s and maybe shrink the image", doc._id)
Execution
I am biased because I wrote the following tools. However I have had success, and others have reported success using the Node.js package Follow to watch the _changes events: https://github.com/iriscouch/follow
And then use Txn for ACID transactions in the CouchDB documents: https://github.com/iriscouch/txn
The pattern is,
Run follow() on the _changes URL, perhaps with "include_docs":true in the options.
For each change, decide if it needs work. If it does, execute a function to make the necessary changes, and let txn() take care of fetching and updating, and possible retries if there is a temporary error.
For example, Txn helps you atomically resize the image and also update the metadata, pretty easily.
Finally, if your program crashes, you might fetch a lot of documents that you already processed. That might be okay (if you have your metadata working); however you might want to record a checkpoint occasionally. Remember which changes you saw.
var db = "http://localhost:5984/my_db"
var checkpoint = get_the_checkpoint_somehow() // Synchronous, for simplicity
follow({"db":db, "since":checkpoint}, function(er, change) {
if(change.seq % 100 == 0)
store_the_checkpoint_somehow(change.seq) // Another synchronous call
})
Work queue
Again, I am embarrassed to point to all my own tools. But image processing is a classic example of a work queue situation. Every document that needs work is placed in the queue. An unlimited, elastic, army of workers receives a job, fixes the document, and marks the job done (deleted).
I use this a lot myself, and that is why I made CQS, the CouchDB Queue System: https://github.com/iriscouch/cqs
It is for Node.js, and it is identical to Amazon SQS, except it uses your own CouchDB server. If you are already using CouchDB, then CQS might simplify your project.

Related

Should I cache results of functions involving mass file I/O in a node.js server app?

I'm writing my first 'serious' Node/Express application, and I'm becoming concerned about the number of O(n) and O(n^2) operations I'm performing on every request. The application is a blog engine, which indexes and serves up articles stored in markdown format in the file system. The contents of the articles folder do not change frequently, as the app is scaled for a personal blog, but I would still like to be able to add a file to that folder whenever I want, and have the app include it without further intervention.
Operations I'm concerned about
When /index is requested, my route is iterating over all files in the directory and storing them as objects
When a "tag page" is requested (/tag/foo) I'm iterating over all the articles, and then iterating over their arrays of tags to determine which articles to present in an index format
Now, I know that this is probably premature optimisation as the performance is still satisfactory over <200 files, but definitely not lightning fast. And I also know that in production, measures like this wouldn't be considered necessary/worthwhile unless backed by significant benchmarking results. But as this is purely a learning exercise/demonstration of ability, and as I'm (perhaps excessively) concerned about learning optimal habits and patterns, I worry I'm committing some kind of sin here.
Measures I have considered
I get the impression that a database might be a more typical solution, rather than filesystem I/O. But this would mean monitoring the directory for changes and processing/adding new articles to the database, a whole separate operation/functionality. If I did this, would it make sense to be watching that folder for changes even when a request isn't coming in? Or would it be better to check the freshness of the database, then retrieve results from the database? I also don't know how much this helps ultimately, as database calls are still async/slower than internal state, aren't they? Or would a database query, e.g. articles where tags contain x be O(1) rather than O(n)? If so, that would clearly be ideal.
Also, I am beginning to learn about techniques/patterns for caching results, e.g. a property on the function containing the previous result, which could be checked for and served up without performing the operation. But I'd need to check if the folder had new files added to know if it was OK to serve up the cached version, right? But more fundamentally (and this is the essential newbie query at hand) is it considered OK to do this? Everyone talks about how node apps should be stateless, and this would amount to maintaining state, right? Once again, I'm still a fairly raw beginner, and so reading the source of mature apps isn't always as enlightening to me as I wish it was.
Also have I fundamentally misunderstood how routes work in node/express? If I store a variable in index.js, are all the variables/objects created by it destroyed when the route is done and the page is served? If so I apologise profusely for my ignorance, as that would negate basically everything discussed, and make maintaining an external database (or just continuing to redo the file I/O) the only solution.
First off, the request and response objects that are part of each request last only for the duration of a given request and are not shared by other requests. They will be garbage collected as soon as they are no longer in use.
But, module-scoped variables in any of your Express modules last for the duration of the server. So, you can load some information in one request, store it in a module-level variable and that information will still be there when the next request comes along.
Since multiple requests can be "in-flight" at the same time if you are using any async operations in your request handlers, then if you are sharing/updating information between requests you have to make sure you have atomic updates so that the data is shared safely. In node.js, this is much simpler than in a multi-threaded response handler web server, but there still can be issues if you're doing part of an update to a shared object, then doing some async operation, then doing the rest of an update to a shared object. When you do an async operation, another request could run and see the shared object.
When not doing an async operation, your Javascript code is single threaded so other requests won't interleave until you go async.
It sounds like you want to cache your parsed state into a simple in-memory Javascript structure and then intelligently update this cache of information when new articles are added.
Since you already have the code to parse your set of files and tags into in-memory Javascript variables, you can just keep that code. You will want to package that into a separate function that you can call at any time and it will return a newly updated state.
Then, you want to call it when your server starts and that will establish the initial state.
All your routes can be changed to operate on the cached state and this should speed them up tremendously.
Then, all you need is a scheme to decide when to update the cached state (e.g. when something in the file system changed). There are lots of options and which to use depends a little bit on how often things will change and how often the changes need to get reflected to the outside world. Here are some options:
You could register a file system watcher for a particular directory of your file system and when it triggers, you figure out what has changed and update your cache. You can make the update function as dumb (just start over and parse everything from scratch) or as smart (figure out what one item changed and update only that part of the cache) as it is worth doing. I'd suggest you start simple and only invest more in it when you're sure that effort is needed.
You could just manually rebuild the cache once every hour. Updates would take an average of 30 minutes to show, but this would take 10 seconds to implement.
You could create an admin function in your server to instruct the server to update its cache now. This might be combined with option 2, so that if you added new content, it would automatically show within an hour, but if you wanted it to show immediately, you could hit the admin page to tell it to update its cache.

how to make trigger in rethinkDb

My Requirement :
When ever there is a data change in a table(s) (whether insert,update,delete) , i should be
able to update my cache using my logic which does manipulation using the table(s).
Technology : Node, rethinkDb
My Implementation :
I heard of table.changes() in rethinkDb, which emits a stream of objects representing changes to a table.
I tried this code
r.table('games').changes().run(conn, function(err, cursor) {
cursor.each(console.log);
});
Its working fine, i mean i am getting the events in that i put my logic for manipulations.
My Question is for how long it will emit the changes .. I mean is there any limit.
And how it works ?
I read this in their doc,
The server will buffer up to 100,000 elements. If the buffer limit is hit, early changes will be discarded, and the client will receive an object of the form {error: "Changefeed cache over array size limit, skipped X elements."} where X is the number of elements skipped.
I didn't understand this properly. I guess after 100,000 it wont give the changes in the event like old_vale and new_value.
Please explain this constraint and also as per requirement will this work ?
I m very to this technology. Please help me.
Short answer: There's no limit.
The 100.000 elements for the buffer is if you do not retrieve changes from the cursor. The server will keep buffering them up to 100.000 elements. If you use each, you will retrieve the changes as soon as they are available, so you will not be impacted by the limit.

How to pipeline in node.js to redis?

I have lot's of data to insert (SET \ INCR) to redis DB, so I'm looking for pipeline \ mass insertion through node.js.
I couldn't find any good example/ API for doing so in node.js, so any help would be great!
Yes, I must agree that there is lack of examples for that but I managed to create the stream on which I sent several insert commands in batch.
You should install module for redis stream:
npm install redis-stream
And this is how you use the stream:
var redis = require('redis-stream'),
client = new redis(6379, '127.0.0.1');
// Open stream
var stream = client.stream();
// Example of setting 10000 records
for(var record = 0; record < 10000; record++) {
// Command is an array of arguments:
var command = ['set', 'key' + record, 'value'];
// Send command to stream, but parse it before
stream.redis.write( redis.parse(command) );
}
// Create event when stream is closed
stream.on('close', function () {
console.log('Completed!');
// Here you can create stream for reading results or similar
});
// Close the stream after batch insert
stream.end();
Also, you can create as many streams as you want and open/close them as you want at any time.
There are several examples of using redis stream in node.js on redis-stream node module
In node_redis there all commands are pipelined:
https://github.com/mranney/node_redis/issues/539#issuecomment-32203325
You might want to look at batch() too. The reason why it'd be slower with multi() is because it's transactional. If something failed, nothing would be executed. That may be what you want, but you do have a choice for speed here.
The redis-stream package doesn't seem to make use of Redis' mass insert functionality so it's also slower than the mass insert Redis' site goes on to talk about with redis-cli.
Another idea would be to use redis-cli and give it a file to stream from, which this NPM package does: https://github.com/almeida/redis-mass
Not keen on writing to a file on disk first? This repo: https://github.com/eugeneiiim/node-redis-pipe/blob/master/example.js
...also streams to Redis, but without writing to file. It streams to a spawned process and flushes the buffer every so often.
On Redis' site under mass insert (http://redis.io/topics/mass-insert) you can see a little Ruby example. The repo above basically ported that to Node.js and then streamed it directly to that redis-cli process that was spawned.
So in Node.js, we have:
var redisPipe = spawn('redis-cli', ['--pipe']);
spawn() returns a reference to a child process that you can pipe to with stdin. For example: redisPipe.stdin.write().
You can just keep writing to a buffer, streaming that to the child process, and then clearing it every so often. This then won't fill it up and will therefore be a bit better on memory than perhaps the node_redis package (that literally says in its docs that data is held in memory) though I haven't looked into it that deeply so I don't know what the memory footprint ends up being. It could be doing the same thing.
Of course keep in mind that if something goes wrong, it all fails. That's what tools like fluentd were created for (and that's yet another option: http://www.fluentd.org/plugins/all - it has several Redis plugins)...But again, it means you're backing data on disk somewhere to some degree. I've personally used Embulk to do this too (which required a file on disk), but it did not support mass inserts, so it was slow. It took nearly 2 hours for 30,000 records.
One benefit to a streaming approach (not backed by disk) is if you're doing a huge insert from another data source. Assuming that data source returns a lot of data and your server doesn't have the hard disk space to support all of it - you can stream it instead. Again, you risk failures.
I find myself in this position as I'm building a Docker image that will run on a server with not enough disk space to accommodate large data sets. Of course it's a lot easier if you can fit everything on the server's hard disk...But if you can't, streaming to redis-cli may be your only option.
If you are really pushing a lot of data around on a regular basis, I would probably recommend fluentd to be honest. It comes with many great features for ensuring your data makes it to where it's going and if something fails, it can resume.
One problem with all of these Node.js approaches is that if something fails, you either lose it all or have to insert it all over again.
By default, node_redis, the Node.js library sends commands in pipelines and automatically chooses how many commands will go into each pipeline [(https://github.com/NodeRedis/node-redis/issues/539#issuecomment-32203325)][1]. Therefore, you don't need to worry about this. However, other Redis clients may not use pipelines by default; you will need to check out the client documentation to see how to take advantage of pipelines.

Nodejs - How to maintain a global datastructure

So I have a backend implementation in node.js which mainly contains a global array of JSON objects. The JSON objects are populated by user requests (POSTS). So the size of the global array increases proportionally with the number of users. The JSON objects inside the array are not identical. This is a really bad architecture to begin with. But I just went with what I knew and decided to learn on the fly.
I'm running this on a AWS micro instance with 6GB RAM.
How to purge this global array before it explodes?
Options that I have thought of:
At a periodic interval write the global array to a file and purge. Disadvantage here is that if there are any clients in the middle of a transaction, that transaction state is lost.
Restart the server every day and write the global array into a file at that time. Same disadvantage as above.
Follow 1 or 2, and for every incoming request - if the global array is empty look for the corresponding JSON object in the file. This seems absolutely absurd and stupid.
Somehow I can't think of any other solution without having to completely rewrite the nodejs application. Can you guys think of any .. ? Will greatly appreciate any discussion on this.
I see that you are using memory as a storage. If that is the case and your code is synchronous (you don't seem to use database, so it might), then actually solution 1. is correct. This is because JavaScript is single-threaded, which means that when one code is running the other cannot run. There is no concurrency in JavaScript. This is only a illusion, because Node.js is sooooo fast.
So your cleaning code won't fire until the transaction is over. This is of course assuming that your code is synchronous (and from what I see it might be).
But still there are like 150 reasons for not doing that. The most important is that you are reinventing the wheel! Let the database do the hard work for you. Using proper database will save you all the trouble in the future. There are many possibilites: MySQL, PostgreSQL, MongoDB (my favourite), CouchDB and many many other. It shouldn't matter at this point which one. Just pick one.
I would suggest that you start saving your JSON to a non-relational DB like http://www.couchbase.com/.
Couchbase is extremely easy to setup and use even in a cluster. It uses a simple key-value design so saving data is as simple as:
couchbaseClient.set("someKey", "yourJSON")
then to retrieve your data:
data = couchbaseClient.set("someKey")
The system is also extremely fast and is used by OMGPOP for Draw Something. http://blog.couchbase.com/preparing-massive-growth-revisited

CQRS/EventStore - how do you manage a large tree if command should not fail?

I read that commands in CQRS are designed to not fail and should be async in nature.
In my case, I have a tree (think windows explorer) where users have folders that represent locations for video content and each child is a video/media file. Multiple users can all be working on the same branch of the tree moving folders and files around (and uploading new files and creating new folders as well as deleting files/folders).
If I was ignoring the async nature of commands, I could let the first user make their change and raise an exception on the second if say a folder the user is moving a video to is no longer there. It is now the responsibility of the second user to refresh part of his tree and then reapply his changes.
How would I do this with CQRS if I need instant feed back if my change has not been allowed (i.e. I try to move a video file to another folder and another user has deleted the folder or moved it elsewhere)?
Your command is supposed to be valid when you send it to your domain. Therefore, before sending it, you have to validate it on your client to know if the folder is still here or not. It allows you to tell the client what is happening exactly with a clear error message
This reduces also a lot the margin of error.The timeframe to have something that fails is reduced to the time to send the command over the network and the time to execute the command on the server.
If this risks is really low. We may only receive a fail command
answer (eg:enum) from the domain. For the user, it might end up in a
generic exception message and we could actualize its data to show him
that things are different and taht he cannot do what he intended to.
Such messages with a low percentage should not be a huge problem if
they occurs only 2-3 times in the year , I expect.
If this risks is really high or if validation is not possible from
the client but must occur only on the domain side, then I have no
answer to give you at the moment. I am myself learning CQRS, and i
cannot say more.
hope it helped,
[Edit]
I assumed command handling not to be async. If So, I try catch whatever exception during execution of the command and I can return some fail notification without saying what it is exactly to the client. The projection to the various readmodel remaining async.
public void Handle(ICommand command)
{
try
{
CommandService.Execute(command);
Bus.Return(ErrorCodes.None);
}
catch (Exception e)
{
Bus.Return(ErrorCodes.Fail);
}
}
My CommandServiceDelegate the right executor to do the job :
Public class TheGoodExecutor{
protected void Execute(IUOW context, MyCommand command)
{
var myDomainObject= context.GetById<DomainObjecType>(command.Id);
myDomainObject.DoStuff(Command.Data);
// Accept all the work we just did.
context.Accept();
}
}
If the good Executor goes in error then Bus.Return(ErrorCodes.Fail);
And this can be received by my client either synchronously or asynchronously.
If you wish to go full async (that's what I am trying to do, or at least I would like to explore that way), I would try on subscribing to events, the client might be interested in.
For Validation I think, listening to events does not make a lot of sense for most of the cases. But in the second case, I was speaking of, it might. In other cases, aside validation, it might too...
everything in Italic is some personal tryout, I have not read anything about it, nor finished anything working properly going in that sense. So take it with Big brackets.. he he!!
[/Edit]

Resources