I've been using JMS Message Driven Bean for a while and it is working great for the asynchronous tasks. I know that there is many ways to handle the asynchronous processes, but I am just curious what are the benefits over using JMS Message Driven Bean and ScheduledThreadPoolExecutor?
For example I have a web service which handles some tasks asynchronously. So I see two main differences. If I would be using ScheduledThreadPoolExecutor I don't need application server, I could use a servlet container for e.g. Tomcat, because I am not using any EJB stuff, for MDB I need an application server, for e.g. Glassfish. But in terms of handling the actual asynchronous process, what are the advantages over each ScheduledThreadPoolExecutor and MDB?
ScheduledThreadPoolExecutor is used to schedule tasks, the abstraction best corresponding to MDB is ExecutorService. But back to your question.
MDB is more heavyweight, API is much more complex and in principle it was actually designed for transferring data, not logic. On the other hand ExecutorService is a thin layer on top of actual thread pool. So if you need performance, low latency and small overhead, go for ordinary thread pool.
The only reason for MDB and JMS is when you need durability and transaction support. That of course introduces even bigger overhead as each message needs to be persisted. But you won't loose any tasks that are queued or even in the middle of processing are not lost due to crash.
Related
I have a Multi-Client Single-Server application where client and server gets connected through sockets. Client and Server are in different machine.
In client Application, client socket gets connected to server and sends data periodically to server.
In server application server socket listens for client to connect. When a client is connected, new thread is created for client to receive data.
for example: 1 client = 1 thread created by server for receiving data. If its 10000 client, server creates 10000 threads. This seems not good and scalable too.
My Application is in Java.
Is there an alternate method for this problem?
Thanks in advance
This is a typical C10K problem. There are patterns to solve this, one examples is Reactor pattern
Java NIO is another way where the incoming request can be processed in non blocking way. See a reference implementation here
Yes, you should not need a separate thread for each client. There's a good tutorial here that explains how to use await to handle asynchronous socket communication. Once you receive data over the socket you can use a fixed number of threads. The tutorial also covers techniques to handle thousands of simultaneous communications.
Unfortunately given the complexity it's not possible to post the code here, so although link-only answers are frowned upon ...
I would say it's a perfect candidate for an Erlang/Elixir application. Whatsapp, RabbitMQ...
Erlang processes are cheap and fast to start, Erlang manages the scheduling for you so you don't have to think about the number of threads, CPUs or even machines, Erlang manages garbage collection for each process after you don't need it anymore.
Haskell is slow, Erlang is fast enough for most applications that are not doing heavy calculations and even then you can use it and hand off the heavy lifting to a C process.
What are you writing in?
Yes, you can use the Actor model, with e.g. Akka or Akka.net. This allows you to create millions of actors that run on e.g. 4 threads. Erlang is a programming language that implements the actor model natively.
However, actors and non-blocking code won't do you much good if you are relying on blocking library calls for backend services that you rely on, such as (the most prominent example in the JVM world) JDBC calls.
There is also a rather interesting approach that Haskell uses, called green threads. It means that the runtime threads are very lightweight and are dynamically mapped to OS threads. It also means that you get a certain amount of scalability "for free", with no need to write non-blocking IO code. It does however require a good IO manager in the runtime to schedule the IO operations efficiently, and GHC Haskell has had a substantial amount of work put into that in recent years.
In spring integration application, I am using concurrent consumers to consume and process the multiple messages at a time.
In my application, I configured all beans to a singleton. I am assuming if I am going to parallelize the processing by using the concurrent consumer's, multiple messages entered into same integration components.
Does it leads to data collision between two objects?
Does it leads to data collision between two objects?
No, that doesn't mean. If you don't do any state management in your components, then the is not going to be any collisions. Just because one thread can perform only one task at a time. So, if you use the same component in different threads to perform stateless work, there is no any inter-thread interaction. Just because each thread get its own call stack.
I would like to implement a mechanism in my server application but I'm not sure which OTL abstraction would be best appropriate.
My application collects data about various types of equipements.
Some of them use synchronous communication, thus generating Delphi event in my server application. (push-like)
Some of them use asynchronous communication, requiring my application to periodically request the latest data available. (pull like)
Because I want my server application to stay responsive while requesting as frequently as possible the new available data, I want to put that "pull driver" within a separated thread that will request all the configured data points one by one.
I'd like my main thread to spawn this OTL object and then receive the result as a delphi event in the main thread. This would emulate the "push-like" that my server main code is already made for.
Think of it as a thread you launch that periodically request a data you want to monitor and only send you an event when the value has changed.
Which OTL abstraction (high level? Low level?) do you think would be appropriate to this behavior?
Thank you.
I'm not sure OTL gives you much benefit here at all, to be honest. I write a lot of classes for managing hardware devices and the class model is almost invariably a plain TThread descendent. OTL is nice for spinning off tasks and work packages, queues, parallel calculations, etc. In this case, however, you don't want to do any of that. What you do want is a class that models your device and encapsulates the functions it can perform.
This is going to be a single worker thread dedicated to pumping reads and writes to the device. It is going to be a long-lived thread that will persist as long as the class that encapsulates the device remains alive - TThread makes sense for this. Your thread is going to be a simple loop that runs continuously, polling all the required data and flushing any write requests.
The class will also serve as a data cache for the device parameters and you will need some sort of synchronization devices (mutex, critical section, etc) to protect reads and writes to those fields through properties so again it makes sense that these sync objects also exist as class fields and that your thread and class model live together in a single entity. If you want event notifications, these too conveniently wrap into the same model. One device, one thread, one class. It's a perfect job for a TThread descendent.
I have a couple of questions regarding EJB transactions. I have a situation where a process has become longer running that originally intended and is sometimes failing due to server timeout's being exceeded. While I have increased the timeouts initially (both total transaction and max transaction), for a long running process, I know that it make more sense to segment this work as much as possible into smaller units of work that don't fail based on timeout. As a result, I'm looking for some thoughts or references regarding next course of action based on the background below and the questions that follow.
Environment:
EJB 3.1, JPA 2.0, WebSphere 8.5
Background:
I built a set of POJOs to do some batch oriented work for an enterprise application. They are non-EJB POJOs that were intended to implement several business processes (5 related, sequential processes, each depending on it's predecessor). The POJOs are in a plain Java project, not an EJB project.
However, these POJOs access an EJB facade for database access via JPA. The abstract core of the 5 business processes does the JNDI lookup for the EJB facade in order to return the domain objects for processing. Originally, the design was to run from the server completely, however, a need arose to initiate these processes externally. As a result, I created an EJB wrapper so that the processes could be called remotely (individually or as a single process based on a common strategy interface). Unfortunately, the size of the data, both row width and row count, has grown well beyond the original intent.
The processing time required to complete these batch processes has increased significantly (from around a couple of hours to around 1/2 a day and could increase beyond that). Only one of the 5 processes made sense to multi-thread (I did implement it multi-threaded). Since I have the wrapper EJB to initiate 1 or all, I have decided to create a new container transaction for each process as opposed to the single default transaction of "required" when I run all as a single process. Since the one process is multi-threaded, it would make sense to attempt to create a new transaction per thread, however, being a group of POJOs, I do not have transaction capability.
Question:
So my question is, what makes more sense and why? Re-engineer the POJOs to be EJBs themselves and have the wrapper EJB instantiate each process as a child process where each can have its own transaction and more importantly, the multi-threaded process can create a transaction per thread. Or does it make more sense to attempt to create a UserTransaction in the POJOs from a JNDI lookup in the container and try to manage it as if it were a bean managed transaction (if that's even a viable solution). I know this may be application dependent, but what is reasonable with regard to timeouts for a Java EE container? Obviously, I don't want run away processes, but want to make sure that I can complete these batch processes.
Unfortunatly, this application has already been deployed as a production system. Re-engineering, though it may be little more than assembling the strategy logic in EJBs, is a large change to the functionality.
I did look around for some other threads here and via general internet searches, but thought I would see if anyone had compelling arguments for one over the other or another solution entirely. Additional links that talk about a topic such as this are appreciated. I wrestled with whether to post this since some may construe this as subjective, however, I felt the narrowed topic was worth the post and potentially relevant to others attempting processes like this.
This is not direct answer to your question, but something you could consider.
WebSphere 8.5 especially for these kind of applications (batch) provides a batch container. The batch function accommodate applications that must perform batch work alongside transactional applications. Batch work might take hours or even days to finish and uses large amounts of memory or processing power while it runs. You can reuse your Java classes in batch applications, batch steps can be run in parallel in cluster and has transaction checkpoint management.
Take a look at following resources:
IBM Education Assistant - Batch applications
Getting started with the batch environment
Since I really didn't get a whole lot of response or thoughts for this question over the past couple of weeks, I figured I would answer this question to hopefully help others in making a decision if they run across this or a similar situation.
Ultimately, I re-engineered one of the POJOs into an EJB that acted as a wrapper to call the other POJOs. The wrapper EJB performed the same activity as when it was just a POJO, except for the fact that I added the transaction semantics (REQUIRES_NEW) on the primary method. The primary method calls the other POJOs based on a stategy pattern so each call (or POJO) gets its own transaction. Other methods in the EJB that call the primary method were defined with NOT_SUPPORTED so that I could separate the transactions for each call to the primary method and not join an existing transaction.
Full disclosure, the original addition of transaction semantics significantly increased the processing time (on the order of days), but the process did not fail due to exceeding transaction timeouts. It was the result of some unexpected problems with JPA Many-To-One relationships that were bringing back too much data. Data retreived as a result of a the Many-To-One relationship. As I mentioned originally, some of my data row width increased unexpectedly. That data increase was in the related table object, but the query did not need that data at the time. I corrected those issues by changing my queries (creating objects for SELECT NEW queries, changed relationships to FetchType.LAZY, etc).
Going forward, if I am able to dedicate enough time, I will transform the rest of those POJOs into EJBs. The POJO doing the most significant amount of work that is threaded has been implemented with a Callable implementation that is run via an ExecutorService. If I can transform that one, the plan will be to make each thread its own transaction. However, while I'm not sure yet, it appears that my container may already be creating transactions for each thread group (of 10 threads) due to status updates I'm seeing. I will have to do more investigation.
As I've done some more research into web server software, I've begun to question if Apache's thread/process based method is the way to go vs. the the asynchronous request handling provided by servers like Nginx a Lighttpd, which tend to scale better with heavier loads.
I understand there are many other differences between these latter two and Apache. My question is under what circumstances would I pick a thread/process based method over the asynchronous handling.
Are there any features/technologies that I can't use with an asynchronous method (or would function poorly/not as well)?
What situations would cause the performance of an asynchronous method to perform worse than a thread/process based approach? Are these common or rare cases, and how big is the difference?
Are there any other factors I should take into consideration when comparing the two? Keep in mind I'm focusing mainly on the thread/process based method vs. asynchronous, not any particular server software which happens to utilize one of these methods. These concerns might be difficulty of managing/debugging, security issues, etc.
This is old, but worth answering. Let's first start by saying how each model works.
In threaded, you have a request come in to a handler, the handler spawns a new OS thread to handle that request, and all work for that request happens in that thread until a response is sent and the thread is ended. This model supports as many concurrent requests as threads that your server can spawn (but threads can be somewhat heavyweight).
When doing async a request comes in to a handler but instead of creating a thread to deal with it, it adds the connection to what's known as an event loop. The event loop listens for data/state changes on the connection and fires callbacks each time "something" happens. Once the connection is added to the event loop, the handler immediately listens for new connections to add. This allows you to have many (sometimes 100K) concurrent connections at the same time.
Are there any features/technologies that I can't use with an asynchronous method (or would function poorly/not as well)?
Yes, when you're doing number crunching. The architecture of an async (or "evented") system is such that it is great at passing data around but not processing data. It can handle thousands of concurrent operations, but because it only runs on one OS thread, the callbacks it fires need to do as little as possible to get the most throughput. This is because if one of your callbacks does some number crunching that takes 5 seconds, your entire server is frozen for 5 seconds until that operation completes. The idea is to get data, send it to where it's going (database, API, etc) and send a response all with minimal processing.
Async is good for network I/O: passing data between multiple sources/destinations (and also user interfaces, but that's beyond this post).
What situations would cause the performance of an asynchronous method to perform worse than a thread/process based approach? Are these common or rare cases, and how big is the difference?
See above, but any time you're doing more CPU work than network I/O, you should switch to a threaded model. However, there are architecture workarounds...for instance, you could have an async app, and anytime it needs to do real work, it sends a job to a worker queue. However, if every request requires CPU processing then that architecture is overkill and you might as well just use a threaded server.
Are there any other factors I should take into consideration when comparing the two? Keep in mind I'm focusing mainly on the thread/process based method vs. asynchronous, not any particular server software which happens to utilize one of these methods. These concerns might be difficulty of managing/debugging, security issues, etc.
Programming in async is generally more complicated than threaded. That said, if you're not doing the programming yourself (ie you're choosing between nginx and apache) then I usually recommend you go async (nginx) because you'll generally be able to squeeze more juice out of your server that way. I'm always in favor of using as much async in the stack as possible.
That said, if you're programming an app and trying to decide whether to use a threaded or async model, you will have to take developer time into account. Unless you're using a language that has green threads over an event loop (like scheme), expect to tear your hair out quite a bit over rogue exceptions crashing your entire app and in general wrapping your head around CPS/using callbacks for everything. Futures/promises are your friend, but are only a bandaid to make async nicer.
TL;DR
Async, when used in a server, can squeeze (a lot) more concurrent operations than threading if you're doing network IO and nothing else.
If you're doing any kind of number crunching, either use a threaded app server or use an async app with a background queuing system.
Async is a lot harder to program in unless your language supports "fake" threading over it (ie green threads). Once you get past the initial hump you're fine, generally. If you don't have green threads, use promises.
If you have the choice between threaded and async as a component in your stack (apache vs nginx), and they provide the exact same features, slightly favor async. Don't just pick it because you think it will make everything 20x faster though.
Processes have several advantages compared to threads and async models related to security and reliability. Most websites don't need these particular advantages, but sometimes they're indispensable.
Security: you can run your worker processes in a sandbox, as a low privileged user, and handle only one request per worker process. This mitigates against some kinds of security vulnerabilities: even if an attacker takes over your entire worker process, as long as you sandboxed it tightly based on request metadata (i.e. it doesn't have write access to all your data), then it can't harm system stability or affect the responses made to requests.
Security #2: sometimes you need to sandbox untrusted code, or to enforce segregation between different code or different requests, and the only way to do this is with a separate one-shot process. (Think running user-provided code.)
Reliability: memory leaks and memory corruption are much less severe if you teardown and replace worker processes regularly (or for each request).
It's easy to enforce hard limits on CPU time, disk and network quota, etc. spent on handling a user request in a separate process. Even if the request-handling code goes into an infinite loop, the master process (or the OS) can enforce a timeout.