Storm Bolt Database Connection - cassandra

I am using Storm (java) with Cassandra.
One of my Bolts inserts data in to Cassandra. Is there any way to hold the connection to Cassandra open between instantiations of this bolt?
The write speed of my application is fast. The bolt need to run several times a second, and the performance is being hindered by the fact that it is connecting to Cassandra each time.
It would run a lot faster if I could have a static connection that was held open, but I am not sure to achieve this in storm.
To clarify the question:
what is the scope of a static connection in a storm topology?
Unlike other messaging systems which have workers where the "work" goes on in a loop or callback which can make use of a variable (maybe a static connection) outside this loop, storms bolts seem to be instantiated each time they are called and can not have parameters passed in to them, so how can I use the same connection to cassandra?

Unlike other messaging systems which have workers where the "work" goes on in a loop or callback which can make use of a variable (maybe a static connection) outside this loop, storms bolts seem to be instantiated each time they are called and can not have parameters passed in to them
Its not exactly right to say that storm bolts get instantiated each time they called. For example the prepare method only get called during the initialization phase i.e only once. from the doc it says it is Called when a task for this component is initialized within a worker on the cluster. It provides the bolt with the environment in which the bolt executes.
So the best bet would be to put the initialization code in the prepare or open (in case of spouts) method as they will be called when the tasks are starting. But you need make it thread safe as it will be called by every tasks concurrently in its own thread.
The execute(Tuple tuple) method on the other hand is actually responsible for processing the logic and called every time it receives a tuple from the corresponding spouts or bolts.(so this is actually what get called every single time the bolt runs)
The cleanup method is called when an IBolt is going to be shutdown, the documentation says
There is no guarentee that cleanup will be called, because the
supervisor kill -9's worker processes on the cluster.The one context
where cleanup is guaranteed to be called is when a topology is killed
when running Storm in local mode
So its not true that you can't pass a variable to it, you can instantiate any instance variables with the prepare method and then use it during the processing.
Regarding the DB connection I am not exactly sure about your use cases as you have not put any code but maintaining a pool of resource sounds like a good choice to me.

Related

What is the intended usage of Qt threads in conjunction with dependency injection?

Let's have a worker thread which is accessed from a wide variety of objects. This worker object has some public slots, so anyone who connects its signals to the worker's slots can use emit to trigger the worker thread's useful tasks.
This worker thread needs to be almost global, in the sense that several different classes use it, some of them are deep in the hierarchy (child of a child of a child of the main application).
I guess there are two major ways of doing this:
All the methods of the child classes pass their messages upwards the hierarchy via their return values, and let the main (e.g. the GUI) object handle all the emitting.
All those classes which require the services of the worker thread have a pointer to the Worker object (which is a member of the main class), and they all connect() to it in their constructors. Every such class then does the emitting by itself. Basically, dependency injection.
Option 2. seems much more clean and flexible to me, I'm only worried that it will create a huge number of connections. For example, if I have an array of an object which needs the thread, I will have a separate connection for each element of the array.
Is there an "official" way of doing this, as the creators of Qt intended it?
There is no magic silver bullet for this. You'll need to consider many factors, such as:
Why do those objects emit the data in the first place? Is it because they need to do something, that is, emission is a “command”? Then maybe they could call some sort of service to do the job without even worrying about whether it's going to happen in another thread or not. Or is it because they inform about an event? In such case they probably should just emit signals but not connect them. Its up to the using code to decide what to do with events.
How many objects are we talking about? Some performance tests are needed. Maybe it's not even an issue.
If there is an array of objects, what purpose does it serve? Perhaps instead of using a plain array some sort of “container” class is needed? Then the container could handle the emission and connection and objects could just do something like container()->handle(data). Then you'd only have one connection per container.

Can I use child process or cluster to do custom function calls in node?

I have a node program that does a lot of heavy synchronous work. The work that needs to be done could easily be split into several parts. I would like to utilize all processor cores on my machine for this. Is this possible?
Form the docs on child processes and clusters I see no obvious solution. Child processes seems to be focused on running external programs and clusters only work for incoming http connections (or have I misunderstood that?).
I have a simple function var output = fn(input) and would just like to run it several times, spread all the calls across the cores on my machine and provide the result in a callback. Can that be done?
Yes, child processes and clusters are the way to do that. There are a couple of ways of implementing a solution to your problem.
Your server creates a queue and manages that queue. Whenever you need to call your function, you will drop it into the queue. You will then process the queue N items at a time, where N equals the number of your cores. When you start processing, you will spawn a child process, probably either using spawn or exec, with the argument being another standalone Node.js script, along with any additional parameters (it's just a command line call, basically). Inside that script you will do your work, and emit the result back to the server. The worker is then freed up.
You can create a dedicated server with cluster, where all it will do is run your function. With the cluster module, you can (once again) create N number of other workers, and delegate work to these wokers.
Now this may seem like a lot of work, and it is. And for that reason you should use an existing library as this is a, for the most part, a solve problem at this point. I really like redis-based queues, so if you're interested in that see this answer for some queue recommendations.

Understanding the Event-Loop in node.js

I've been reading a lot about the Event Loop, and I understand the abstraction provided whereby I can make an I/O request (let's use fs.readFile(foo.txt)) and just pass in a callback that will be executed once a particular event indicates completion of the file reading is fired. However, what I do not understand is where the function that is doing the work of actually reading the file is being executed. Javascript is single-threaded, but there are two things happening at once: the execution of my node.js file and of some program/function actually reading data from the hard drive. Where does this second function take place in relation to node?
The Node event loop is truly single threaded. When we start up a program with Node, a single instance of the event loop is created and placed into one thread.
However for some standard library function calls, the node C++ side and libuv decide to do expensive calculations outside of the event loop entirely. So they will not block the main loop or event loop. Instead they make use of something called a thread pool that thread pool is a series of (by default) four threads that can be used for running computationally intensive tasks. There are ONLY FOUR things that use this thread pool - DNS lookup, fs, crypto and zlib. Everything else execute in the main thread.
"Of course, on the backend, there are threads and processes for DB access and process execution. However, these are not explicitly exposed to your code, so you can’t worry about them other than by knowing that I/O interactions e.g. with the database, or with other processes will be asynchronous from the perspective of each request since the results from those threads are returned via the event loop to your code. Compared to the Apache model, there are a lot less threads and thread overhead, since threads aren’t needed for each connection; just when you absolutely positively must have something else running in parallel and even then the management is handled by Node.js." via http://blog.mixu.net/2011/02/01/understanding-the-node-js-event-loop/
Its like using, setTimeout(function(){/*file reading code here*/},1000);. JavaScript can run multiple things side by side like, having three setInterval(function(){/*code to execute*/},1000);. So in a way, JavaScript is multi-threading. And for actually reading from/or writing to the hard drive, in NodeJS, if you use:
var child=require("child_process");
function put_text(file,text){
child.exec("echo "+text+">"+file);
}
function get_text(file){
//JQuery code for getting file contents here (i think)
return JQueryResults;
}
These can also be used for reading and writing to/from the hard drive using NodeJS.

Concurrent processing via scala singleton object

I'm trying to build a simple orchestration engine in a functional test like the following:
object Engine {
def orchestrate(apiSequence : Seq[Any]) {
val execUnitList = getExecutionUnits(apiSequence) // build a specific list
schedule(execUnitList) // call multiple APIs
}
In the methods called underneath (getExecutionUnits, and schedule), the pattern I've applied is one where I incrementally build a list (hence, not a val but a var), iterate over the list and call sepcific APIs and run some custom validation on each one.
I'm aware that an object in scala is sort of equivalent to a singleton (so there's only one instance of Engine, in my case). I'm wondering if this is an appropriate pattern if I'm expecting 100's of invocations of the orchestrate method concurrently. I'm not managing any other internal variables within the Engine object and I'm simply acting on the provided arguments in the method. Assuming that the schedule method can take up to 10 seconds, I'm worried about the behavior when it comes to concurrent access. If client1, client2 and client3 call this method at the same time, will 2 of the clients get queued up and be blocked my the current client being processed?
Is there a safer idiomatic way to handle the use-case? Do you recommend using actors to wrap up the "orchestrate" method to handle concurrent requests?
Edit: To clarify, it is absolutely essential the the 2 methods (getExecutionUnits and schedule) and called in sequence. Moreover, the schedule method in turn calls multiple APIs (anywhere between 1 to 10) and it is important that they too get executed in sequence. As of right now I have a simply for loop that tackles 1 Api at a time, waits for the response, then moves onto the next one if appropriate.
I'm not managing any other internal variables within the Engine object and I'm simply acting on the provided arguments in the method.
If you are using any vars in Engine at all, this won't work. However, from your description it seems like you don't: you have a local var in getExecutionUnits method and (possibly) a local var in schedule which is initialized with the return value of getExecutionUnits. This case should be fine.
If client1, client2 and client3 call this method at the same time, will 2 of the clients get queued up and be blocked my the current client being processed?
No, if you don't add any synchronization (and if Engine itself has no state, you shouldn't).
Do you recommend using actors to wrap up the "orchestrate" method to handle concurrent requests?
If you wrap it in one actor, then the clients will be blocked waiting while the engine is handling one request.

Independent server side processing in node

Is it possible, or even practical to create a node program (or sub program/loop) that executes independently of the connected clients.
So in my specific use case, I would like to make a mulitplayer game, where each turn a player preforms actions. And at the end of that turn those actions are computed. Is it possible to perform those computations at a specific time regardless of the client/players connecting?
I assume this involves the use of threads somewhere.
Possibly an easier solution would be to compute the outcome when it is observed, but this could cause difficulties if it has an influence in with other entities. But this problem has been a curiosity of mine for a while.
Well, basically, the easiest solution would probably to run the computation onto a cluster. This is spawning a thread who's running independent task and communicating with messages with the main thread.
If you wish however to run a completely separate process (I probably wouldn't, but it is an option), this can happen too. You then just need a communication protocol between the two process. Usually this would be handled by a messaging or a task queue system. A popular queue solving this issue is RabbitMQ.
If the computations each turn is not to heavy you could solve the issue with a simple setTimeout()
function turnCalculations(){
//do loads of stuff every 30 seconds
}
setTimout(turnCalculations,30000)
//normal node server stuff here
This would do the turn calculations every 30 seconds regardless of users connected, but if the calculations take to long they might block your server.

Resources