What are the issues associated with enabling threadsafe in Google Appengine? - multithreading

This is going to be a self-answered question, but I thought that such a specific question (and answer) could be of use to others...
What are the potential issues associated with enabling multi-threading in Google App Engine (GAE/J) using the element in appengine-web.xml?

I have looked at this for a project I'm working on, and I have written up what I have found in an analysis here: http://devcon5.blogspot.com
I would very much appreciate any comments or additional questions I should cover.
Thanks.

One important thing to mention is that during the loading request of an instance no additional requests are handled in other threads. Only after the first request is completely finished will the instance go into a multi-threaded mode. This is especially noticeable on loading the initial instance after a deploy (or after all instances died without idle instances).
This will impact applications that use URLFetch to call other servlets in the same application. The first request will try to call the same instance first, but that instance won't handle the call yet. After a timeout the scheduler will spin-up a second instance, after which the request is handled. (Latency on-top-of latency...)

Related

How to run long running synchronous operation in nodejs

I am writing payroll management web application in nodejs for my organisation. In many cases application shall involve cpu intensive mathematical calculation for calculating the figures and that too with many users trying to do this simulatenously.
If i plainly write the logic (setting aside the fact that i already did my best from algorithm and data structure point of view to contain the complexity) it will run synchronously blocking the event loop and make request, response slow.
How to resolve this scenario? What are the possible options to do this asynchronously? I also want to mention that this calculation stuff can be let to run in the background and later i can choose to tell user via notification about the status. I have searched for the solution all over this places and i found some solutions but only in theory & i haven't tested them all by implementing. Mentioning below:
Clustering the node server
Use worker threads
Use an alternate server and do some load balancing.
Use a message queue and couple it with worker thread to do backgound tasks.
Can someone suggest me some tried and battle tested advice on this scenario? and also some tutorial links associated with that.
You might wanna try web workers,easy to use and documented.
https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Using_web_workers

What are common development issues, pitfalls and suggestions?

I've been developing in Node.js for only 2 weeks and started re-creating a web site previously written in PHP. So far so good, and looks like I can do same thing in Node (with Express) that was done in PHP in same or less time.
I have ran into things you just have to get used to such as using modules, modules not sharing common environment, and getting into a habit of using callbacks for file system and database operations etc.
But is there anything that developer might discover a lot later that is pretty important to development in node? Issues that everyone else developing in Node has but they don't surface until later? Pitfalls? Anything that pros know and noobs don't?
I would appreciate any suggestions and advice.
Here are the things you might not realize until later:
Node will pause execution to run the garbage collector eventually/periodically. Your server will pause for a hiccup when this happens. For most people, this issue is not a significant problem, but it can be a barrier for building near-time systems. See Does Node.js scalability suffer because of garbage collection when under high load?
Node is single process and thus by default will only use 1 CPU. There is built-in clustering support to run multiple processes (typically 1 per CPU), and for the most part the Node community believes this to be a solid approach. You might be surprised by this reality, though.
Stack traces are often lost due to the event queue, so your logging and debugging methodology needs to change significantly
Here are some minor stumbling blocks you may run into for a while (I still bump up against these)
Remembering to do callback(null, value) on a successful callback. Passing null as a first parameter is weird and thus I forget to do it. Instead I accidentally do callback(value), which is interpreted as an error by the caller until I debug into it for a while and slap my forehead.
forgetting to use return when you invoke the callback in a guard clause and don't want a function to continue to execute past that point. Sometimes this results in the callback getting invoked twice, which causes all manner of misbehavior.
Here are some NICE things you might not realize initially
It is much easier in node.js, using one of the awesome flow control libraries, to do complex operations like loading 3 network resources in parallel, then making 2 DB calls in serial, then writing to 2 log files in parallel, then sending an HTTP response. This stuff is trivial and beautiful in node and damn near impossible in many synchronous environments.
ALL of node's modules are new and modern, and for the most part, you can find a beautifully-designed module with a great API to do what you need. Python has great libraries by now, too, but compare Node's cheerio or jsdom module to python's BeautifulSoup and see what I mean. Compare python's requests module to node's superagent.
There's a community benefit that comes from working with a modern platform where people are focused on modern web development. The contrast between the node community and the PHP community cannot be overstated.

Implementing concurrency in Java EE Web application

We are creating a web app where we need to have concurrency for a few business cases. This application would be deployed in a tomcat container. I know that creating user defined threads in the web container is a bad idea and am trying to explore options that i have.
Have my multi-threaded library used as a JCA component. We are averse to using this approach because of the learning curve that might be involved.
I know that there's WorkManager API's available but i guess thats not implemented by tomcat so this option goes out.
I did some research and found out that CommonJ library is recommended for Tomcat. Has anyone used it?
Also, I see that there are ManagedExecutorService available but I am not sure how to use it and is it different from WorkManager API's (and the commonJ library)?
Any help on this appreciated. By the way, using JMS is out of question because of deployment environment. I am inclining towards points 3 and 4 but i do not have much knowledge on it. Could someone guide pls.
Since you're using Tomcat, don't worry about it and do whatever you want. The Servlet section of Java EE makes no mention of threads etc. That's mostly under the EJB section.
Tomcat itself doesn't do much at all in terms of worrying about managing threads, it's a pretty non-invasive container.
Its best to tie your threads to a ServletContextListener so that you can pay attention to the application lifecycle, and shutdown your stuff when you app shuts down, but beyond that, don't overly concern yourself about it and use whatever you're happy with.
Addenda -
The simple truth is Tomcat does not care, and it's not that sophisticated. Tomcat has a thread pool for each of the HTTP listeners and that's about the end of its level of management. Tomcat is not going to take threads from a quiet HTTP listener and dedicate them to a busy one, for example. If Tomcat was truly interested in how you create threads, it would prevent you from doing so -- and it doesn't.
That means that thread management outside of the HTTP context falls squarely on your shoulders as an implementor. Java EE exposes these kinds of facilities, and the interfaces make great reads. But the simple truth is that the theoretical capabilities espoused by the Java EE API docs, and the reality of modern implementations is far different, particularly on low end systems such as Tomcat.
Not to disparage Tomcat. Tomcat is a great piece of software. But for most of its use cases, the extra management capability simply is not necessary.
Setting up your own thread pool (using the JDK provided facilities) and working with your own thread lifecycle model will likely see you successfully through whatever project you're working on. It's really not a big deal.
There are a couple of options. Regardless container restrictions that might or might not be in place, spawning individual threads on demand is nearly always a bad idea. It's not that this wouldn't work in a Servlet environment, but the number of threads you can potentially create might get completely out of hand.
The simplest solution to go with is a plain old Java SE thread pool via a normal executer service. Start the pool in a Servlet listener and provide access to it via some static variable. Not overly pretty, but it gets the job done. Depending on your exact use case this might actually be the best solution (if your use case is pretty low-level).
Another option is to add OpenEJB to your war, and then take advantage of the #Asynchronous annotation.
Yet another option, is to realize that one typically uses Tomcat if the business requirements are extremely simple or low-level. That's pretty much the entire point of using something as bare bone a Tomcat. As soon as you find yourself in need of adding (tons of) libraries, you might have outgrown Tomcat and might be better of using a server that already has the functionality you need (in this case asynchronous execution). Examples are TomEE, GlassFish, Resin, JBoss AS, Geronimo, etc.
Every Servlet -Java EE base component for HTTP request processing- in your Web Application is a Singleton, and each request runs in its own independent thread so there is no need to start/stop user generated threads on your own. Your Web Container -in this case Tomcat- manages all that stuff.
Besides that, you need to have in mind some considerations for multi-threaded processing in your code. For example, since Servlets are singletons and many threads are spawned for this class is a bad idea to have instance attributes in this components.
I have used CommonJ many times and it works very well. It can be initialized and destroyed from a ServletContextListener.

Is there a compelling reason to use an AMQP based server over something like beanstalkd or redis?

I'm writing a piece to a project that's responsible for processing tasks outside of the main application facing data server, which is written in javascript using Node.js. It needs to handle tasks which are scheduled in the future and potentially handle tasks that are "right now". The "right now" just means the next time a worker becomes available it will operate on that task, so that bit might not matter. The workers are going to all talk to external resources, an example job would be to send an email. We are a small shop and we don't have a ton of resources so one thing I don't want to do is start mixing languages at this point in the process, and I already see that Node can do this for us pretty easily, so that's what we're going to go with unless I see a compelling reason not to before I start coding, which is soon.
All that said, I can't tell if there is a compelling reason to use an AMQP based server, like OpenAMQ or RabbitMQ over something like Kue or Beanstalkd with a node client. So, here we go:
Is there a compelling reason to use an AMQP based server over something like beanstalkd or redis with Kue? If yes, which AMPQ based server would fit best with the architecture that I laid out? If no, which nosql solution (beanstalkd, redis/Kue) would be easiest to set up and fastest to deploy?
FWIW, I'm not accepting my answer yet, I'm going to explain what I've decided and why. If I don't get any answers that appear to be better than what I've decided, I'll accept my own later.
I decided on Kue. It supports multiple workers running asynchronously, and with cluster it can take advantage of multicore systems. It is easily extended to provide security. It's backed with Redis, which is used all over for this exact thing, so I know I'm not backing my job process server with unproven software (that's not to say that any of the others are unproven.)
The most compelling reasons that I picked Kue is that it provides a JSON api so that the client applications (The first client is going to be a web based application, but we're planning on making smartphone apps also) can add jobs easily without going through the main application facing node instance, so I can be totally out of the way of the rest of my team as I write this. I don't need a route, I don't need anything, and it's all provided for me so I don't need to write anything to support this. This has another advantage, with an extention to provide l/p security only authorized clients can add jobs, so I don;t have to expose my redis server to client applications directly. It also has a built in web console and the API allows the client to pull back lists of jobs associated with a given user very easily, so we can show the user all of their scheduled tasks in a nifty calendar view with 0 effort on my part.
The other compelling reason is the lack of steep learning curve associated with getting redis and Kue going for me. I've set up redis before, and Kue is simple and effective.
Yes, I'm a lazy developer, but I'm the good kind of lazy developer.
UPDATE:
I have it working and doing jobs, the throughput is amazing. I split out the task marshaling logic into it's own node instance, basically all I have to do is deploy my repo to a new machine and run node task-server.js to scale out my workers. I may need to add in some more job searching calls to Kue, because of how I implimented a few things, but that will be easy.

Node.js event vs thread programming on server side

We are planning to start a fairly complex web-portal which is expected to attract good local traffic and I've been told by my boss to consider/analyse node.js for the serve side.
I think scalability and multi-core support can be handled with an Nginx or Cherokee in front.
1) Is this node.js ready for some serious/big business?
2) Does this 'event/asynchronous' paradigm on server side has the potential to support the heavy traffic and data operation ? considering the fact that 'everything' is being processed in a single thread and all the live connections would be lost if it got crashed (though its easy to restart).
3) What are the advantages of event based programming compared to thread based style ? or vice-versa.
(I know of higher cost associated with thread switching but hardware can be squeezed with event model.)
Following are interesting but contradicting (to some extent) papers:-
1) http://www.usenix.org/events/hotos03/tech/full_papers/vonbehren/vonbehren_html
2) http://pdos.csail.mit.edu/~rtm/papers/dabek:event.pdf
Node.js is developing extremely rapidly, and most of its functionality is sturdy and ready for business. However, there are a lot of places where its lacking, like database drivers, jquery and DOM, multiple http headers, etc. There are plenty of modules coming up tackling every aspect, but for a production environment you'll have to be careful to pick ones that are stable.
Its actually much MUCH more efficient using a single thread than a thousand (or even fifty) from an operating system perspective, and benchmarks I've read (sorry, don't have them on hand -- will try to find them and link them later) show that it's able to support heavy traffic -- not sure about file-system access though.
Event based programming is:
Cleaner-looking code than threaded code (in JavaScript, that is)
The JavaScript engine is extremely efficient with processing events and handling callbacks, and its easily one of the languages seeing the most runtime optimization right now.
Harder to fit when you are thinking in terms of control flow. With events, you can never be sure of the flow. However, you can also come to think of it as more dynamic programming. You can treat each event being fired as independent.
It forces you to be more security-conscious when programming, for the above reason. In that sense, its better than linear systems, where sometimes you take sanitized input for granted.
As for the two papers, both are relatively old. The first benchmarks against this, which as you can see, has a more recent note about these studies:
http://www.eecs.harvard.edu/~mdw/proj/seda/
It also cites the second paper you linked about what they have done, but refuses to comment on its relevance to the comparison between event-based systems and thread-based ones :)
Try yourself to discover the truth
See What is Node.js? where we cover exactly that:
Node in production is definitely possible, but far from the "turn-key" deployment seemingly promised by the docs. With Node v0.6.x, "cluster" has been integrated into the platform, providing one of the essential building blocks, but my "production.js" script is still ~150 lines of logic to handle stuff like creating the log directory, recycling dead workers, etc. For a "serious" production service, you also need to be prepared to throttle incoming connections and do all the stuff that Apache does for PHP. To be fair, Rails has this exact problem. It is solved via two complementary mechanisms: 1) Putting Rails/Node behind a dedicated webserver (written in C and tested to hell and back) like Nginx (or Apache / Lighttd). The webserver can efficiently serve static content, access logging, rewrite URLs, terminate SSL, enforce access rules, and manage multiple sub-services. For requests that hit the actual node service, the webserver proxies the request through. 2) Using a framework like "Unicorn" that will manage the worker processes, recycle them periodically, etc. I've yet to find a Node serving framework that seems fully baked; it may exist, but I haven't found it yet and still use ~150 lines in my hand-rolled "production.js".

Resources