My Requirement :
When ever there is a data change in a table(s) (whether insert,update,delete) , i should be
able to update my cache using my logic which does manipulation using the table(s).
Technology : Node, rethinkDb
My Implementation :
I heard of table.changes() in rethinkDb, which emits a stream of objects representing changes to a table.
I tried this code
r.table('games').changes().run(conn, function(err, cursor) {
cursor.each(console.log);
});
Its working fine, i mean i am getting the events in that i put my logic for manipulations.
My Question is for how long it will emit the changes .. I mean is there any limit.
And how it works ?
I read this in their doc,
The server will buffer up to 100,000 elements. If the buffer limit is hit, early changes will be discarded, and the client will receive an object of the form {error: "Changefeed cache over array size limit, skipped X elements."} where X is the number of elements skipped.
I didn't understand this properly. I guess after 100,000 it wont give the changes in the event like old_vale and new_value.
Please explain this constraint and also as per requirement will this work ?
I m very to this technology. Please help me.
Short answer: There's no limit.
The 100.000 elements for the buffer is if you do not retrieve changes from the cursor. The server will keep buffering them up to 100.000 elements. If you use each, you will retrieve the changes as soon as they are available, so you will not be impacted by the limit.
Related
I built an app that manages sports tournaments using MongoDB, Mongoose on NodeJS. I'd like to know if I am using the best solution to handle multiple concurrent writes to a large document (5Mb) in rapid succession.
Each "Event" (tournament) is a single document that contains a list of teams. There is a maximum number of teams that can register to each Event. So normally, when a team registers, my Node JS server will load the event, check if the max number of teams has not been reached, add the team to sub-documents and save the Event.
The problem is that some tournaments make players frantic to get a spot and you can have 60 teams complete their registration in the opening seconds which would cause concurrency errors.
For example, if 2 teams click on "save" at the same time, 2 threads (requests) will open on the NodeJS server, both threads will load identical copies of the event, modify them and save two different versions of the document over one another. Obviously, you will get a version error for one of the two threads. Now imagine 60 teams registering within the same second.
The second problem is that the Event document is quite large. Let's be dramatic and say it's 5Mb in size (rare but possible). If I have to load, modify, write 5 megs per registration, the registration system is going to grind to a halt (since my MongoDB is on a different server.)
So I need to know if I built the right solution and if you guys foresee problems with this.
On my node server, I built a Singleton class (accessible to all requests) to manage access to documents. So if a request comes along and asks for Document X, the singleton returns a Promise to the request which will be resolved once this document becomes available to edit. The singleton then turns around, loads the document and grants access to the first request by resolving it's promise. When the request is done editing this document, it tells the singleton that it's done. The singleton then checks if there is queue of other requests waiting to edit this document (other teams that want to register). If so, it does NOT save the document but rather resolves the next promise, allowing the next request to edit the document.
When the last request has finished editing the document and there are no more requests in the queue, the singleton saves the document and clears it from memory.
So in short, the singleton allows the system to load the document once, allow modifications from multiple requests and then saves the document at the end of the rush. This is especially useful since the document is rather large (up to 5mb) and minimizes the number of read/writes to the MongoDB server. The other use is that if we're accepting 50 teams and we get 55 requests wanting to append their teams, the last 5 requests in the queue will take into account that the live document has reached it's team limit and return a "sorry we're full" response.
Is this the best way to manage concurrent writes to a large document?
MongoDB provides a multitude of update operators that you should be using on the specific fields instead of modifying the entire document in your application. For example, for adding to arrays use https://docs.mongodb.com/manual/reference/operator/update/push/.
This way you 1) will only be sending the changed data on each write and 2) avoid racing yourself and clobbering your other changes.
This doesn't help you with the time it takes the server to rewrite that 5 mb document each time it's modified - split the document up to fix this (if you find it to be an issue).
I'm creating some kind of real-chat app and I have trouble. What I want to do is read previous 50 messages(documents) before specified _id. I'll explain more detail.
In first time user getting in the room, App automatically load recent 50 messages. After, if user scrolling up to the top, load more 50 previous messages.
The problem is I don't get it how to do. What I thought is find all documents and move the cursor, but every I tried were failed. If I log the "cursor" object in console, it saids:
Promise { <pending> }
so if I do this:
let cursor = db.find('room', { ... });
while(cursor.hasNext()) {
cursor.next();
}
it goes infinite loop, never stops. If will be very thanksful gimme a hand. :)
And if there is alternative way to not need to use cursor, that would be really nice.
one more final question: is using cursor causes performance low?
I'm not sure what library you use, it seems that cursor is an asynchronous object (that's what Promise suggests), so the while loop is incorrect anyway. It will always be pending cause you don't allow the other event (i.e. "i got response") to occure due to single-threaded nature of NodeJS. You probably have to use callbacks, not synchronous loops.
But that aside I do believe that your whole approach is incorrect.
If you know how to load the most recent 50 messages, then it means that you have to have some kind of logical ordering on the collection. Perhaps a timestamp (which might be a part of id_).
So what I propose instead is something similar to "pagination":
On the client side set timestamp_pointer = now()
Do a query: get me 50 most recent messages such that timestamp < timestamp_pointer
On the client side set timestamp_pointer = smallest timestamp of loaded messages
If a user scrolls up go back to point 2.
There are several advantages of this method, one of them is that you don't have to worry if a connection drops for a short moment since the state is tracked on the user side, not on the database side. And with a proper index it will be very fast.
And yes, using cursor like you do causes low performance because the database has to keep track of the query until it is fully iterated. Apart from pure memory and/or cpu usage it has some other nasty drawbacks, like Mongo has timeouts on cursors. What if a user scrolls up after 15 minutes? By default the timeout on cursor is 10 minutes. It would be very hard to implement your idea properly.
Use Postgres. #PostgresEvangelist
I am trying to debug an issue with the `node-pg-cursor' module in node.js against a postgresql server (version 9.3)
This module allows for sequential reads of N rows in a select and works by sending
cur.read(N): 'Execute' on portal=unnamed, rows=N
this command fetches up to N rows and we can continue fetching rows incrementally until the end, where we receive
CommandComplete
ReadyForQuery
Now my problem is that I want to bail out of the extended command before fetching all the rows and reaching the end of the Execute sequence: I would like to incrementally fetch N rows, N rows, N rows,.. and at one point decide that I have enough.
When I do that (stop fetching via Execute), the query seem to never reach CommandComplete or ReadyForQuery. This seems normal since nothing tells the extended query that I am never going to ask rows from it again.
Apart from closing the connection, is there a command to reach CommandComplete, or ReadyForQuery while not fetching all the rows from the portal ?
I tried to send Close and received CloseComplete, but it did not go to ReadyForQuery.
If I force an ErrorResponse by sending garbage on the protocol, I reach ReadyForQuery but that does not seem very clean ...
I think you're referring to this, in the documentation:
If Execute terminates before completing the execution of a portal (due to reaching a nonzero result-row count), it will send a PortalSuspended message; the appearance of this message tells the frontend that another Execute should be issued against the same portal to complete the operation. The CommandComplete message indicating completion of the source SQL command is not sent until the portal's execution is completed. Therefore, an Execute phase is always terminated by the appearance of exactly one of these messages: CommandComplete, EmptyQueryResponse (if the portal was created from an empty query string), ErrorResponse, or PortalSuspended.
Presumably, you're getting PortalSuspended and you want to discard the portal without executing any more of it or consuming any more results.
If so, I think you can just send a Sync message:
At completion of each series of extended-query messages, the frontend should issue a Sync message. This parameterless message causes the backend to close the current transaction if it's not inside a BEGIN/COMMIT transaction block ("close" meaning to commit if no error, or roll back if error). Then a ReadyForQuery response is issued.
You may wish to issue a Close against the portal first:
The Close message closes an existing prepared statement or portal and releases resources.
so what I think you need to do is, in message flow terms:
Parse
Bind a named portal
Describe
Loop:
Execute with rowcount limit to fetch some rows
If no more rows needed; then
Close the portal
Break out of the loop
If CommandComplete received:
Break out of the loop
Sync
Wait for ReadyForQuery
It sounds like you might want to be using the asynchronous query processing API, if your driver is a libpq wrapper. If it's a native implementation the source code for libpq may offer you clues.
Overall, it looks like you'll need to cancel the query using a new connection, then continue to consume input until the buffer is empty. You'll receive however much result data was buffered, then an error message indicating the query was cancelled (if it didn't buffer all its output before you cancelled it) and finally a ReadyForQuery.
I quote the libpq manual:
A client that uses PQsendQuery/PQgetResult can also attempt to cancel a command that is still being processed by the server; see Section 31.6. But regardless of the return value of PQcancel, the application must continue with the normal result-reading sequence using PQgetResult. A successful cancellation will simply cause the command to terminate sooner than it would have otherwise.
Systems usually have quite big TCP send buffers, and they're typically dynamic. See Linux's tcp(7), the SO_SNDBUF option to setsockopt(2), etc. So quite a lot of data might be buffered before the PostgreSQL server blocks on writing to the socket. PostgreSQL doesn't offer per-connection control of the send buffer size, or even a global config option; you must do it on the operating system level. (That said, it'd be trivial to patch PostgreSQL to set a send buffer size with setsockopt and SO_SENDBUF if you wanted to).
PostgreSQL can't just flush the output buffer when you cancel a query. Even if it were safe to do so and the platform supported it, Pg doesn't know for sure that the buffer has emptied of results from prior queries and other relevant messages, since you might have piplined multiple queries.
So all you can really do is reduce the maximum size of the TCP output buffer. That'll reduce the amount of data you must read and throw away, but it may impact the performance of other queries that send bulk data.
Instead of trying to run the query and cancelling it when you've seen enough, I suggest reading rows in batches, requesting a new batch when you've consumed the current one. You can do this by using protocol-level cursors. That way you can control how much data the server queues up and you don't have to mess with buffer sizes. You may already be doing this - using a named portal, and sending an Execute with a maximum row-count, waiting for the PortalSuspended to say there are more rows to read.
I have the requirement to transform images attached to every document (actually need images to be shrinked to 400px width). What is the best way to achieve that? Was thinking on having nodejs code listening on _changes and performing necessary manipulations on document save. However, this have bunch of drawbacks:
a) document change does not always means that new attachment was added
b) all the time we have to process already shrinked images (at least check image width)
I think you basically have some data in a database and most of your problem is simply application logic and implementation. I could imagine a very similar requirements list for an application using Drizzle. Anyway, how can your application "cut with the grain" and use CouchDB's strengths?
A Node.js _changes listener sounds like a very good starting point. Node.js has plenty of hype and silly debates. But for receiving a "to-do list" from CouchDB and executing that list concurrently, Node.js is ideal.
Memoizing
I immediately think that image metadata in the document will help you. Fetching an image and checking if it is 400px could get expensive. If you could indicate "shrunk":true or "width":400 or something like that in the document, you would immediately know to skip the document. (This is an optimization, you could possibly skip it during the early phase of your project.)
But how do you keep the metadata in sync with the images? Maybe somebody will attach a large image later, and the metadata still says "shrunk":true. One answer is the validation function. validate_doc_update() has the privilege of examining both the old and the new (candidate) document version. If it is not satisfied, it can throw() an exception to prevent the change. So it could enforce your policy in a few ways:
Any time new images are attached, the "shrunk" key must also be deleted
Or, your external Node.js tool has a dedicated username to access CouchDB. Documents must never set "shrunk":true unless the user is your tool
Another idea worth investigating is, instead of setting "shrunk":true, you set it to the MD5 checksum of the image. (That is already in the document, in the ._attachments object.) So if your Node.js tool sees this document, it knows that it has work to do.
{ "_id": "a_doc"
, "shrunk": "md5-D2yx50i1wwF37YAtZYhy4Q=="
, "_attachments":
{ "an_image.png":
{ "content_type":"image/png"
, "revpos": 1
, "digest": "md5-55LMUZwLfzmiKDySOGNiBg=="
}
}
}
In other words:
if(doc.shrunk == doc._attachments["an_image.png"].digest)
console.log("This doc is fine")
else
console.log("Uh oh, I need to check %s and maybe shrink the image", doc._id)
Execution
I am biased because I wrote the following tools. However I have had success, and others have reported success using the Node.js package Follow to watch the _changes events: https://github.com/iriscouch/follow
And then use Txn for ACID transactions in the CouchDB documents: https://github.com/iriscouch/txn
The pattern is,
Run follow() on the _changes URL, perhaps with "include_docs":true in the options.
For each change, decide if it needs work. If it does, execute a function to make the necessary changes, and let txn() take care of fetching and updating, and possible retries if there is a temporary error.
For example, Txn helps you atomically resize the image and also update the metadata, pretty easily.
Finally, if your program crashes, you might fetch a lot of documents that you already processed. That might be okay (if you have your metadata working); however you might want to record a checkpoint occasionally. Remember which changes you saw.
var db = "http://localhost:5984/my_db"
var checkpoint = get_the_checkpoint_somehow() // Synchronous, for simplicity
follow({"db":db, "since":checkpoint}, function(er, change) {
if(change.seq % 100 == 0)
store_the_checkpoint_somehow(change.seq) // Another synchronous call
})
Work queue
Again, I am embarrassed to point to all my own tools. But image processing is a classic example of a work queue situation. Every document that needs work is placed in the queue. An unlimited, elastic, army of workers receives a job, fixes the document, and marks the job done (deleted).
I use this a lot myself, and that is why I made CQS, the CouchDB Queue System: https://github.com/iriscouch/cqs
It is for Node.js, and it is identical to Amazon SQS, except it uses your own CouchDB server. If you are already using CouchDB, then CQS might simplify your project.
Right now whenever I need to access my data set size (and it can be quite frequently), I perform a countForFetchRequest on the managedObjectContext. Is this a bad thing to do? Should I manage the count locally instead? The reason I went this route is to ensure I am getting 100% correct answer. With Core Data being accessed from more than one places (for example, through NSFetchedResultsController as well), it's hard to keep an accurate count locally.
-countForFetchRequest: is always evaluated in the persistent store. When using the Sqlite store, this will result in IO being performed.
Suggested strategy:
Cache the count returned from -countForFetchRequest:.
Observe NSManagedObjectContextObjectsDidChangeNotification for your own context.
Observe NSManagedObjectContextDidSaveNotification for related contexts.
For the simple case (no fetch predicate) you can update the count from the information contained in the notification without additional IO.
Alternately, you can invalidate your cached count and refresh via -countForFetchRequest: as necessary.