my main concern is on ThreadLocal.
akka actors don't bind to specific thread, so any use of thread local storage will be problematic on akka actors
does Hibernate use ThreadLocal?
can they coexist in this case?
Yes. I use Hibernate (via the JPA interface) together with Akka actors.
I have two approaches to deal with multi-threading:
If each actor updates information that cannot conflict with other actors (for example, each row in a table will never be used by multiple actors), you can create a single entity manager per actor and use that for the life of the actor. Although, if a transaction fails it may be best to discard the entity manager.
Otherwise, in each invocation of the actor (via it's receive function) that needs database access, create a new entity manager, begin a transaction, access the database, commit the transaction, and then close the entity manager.
I have the first approach running in production systems.
Related
In spring integration application, I am using concurrent consumers to consume and process the multiple messages at a time.
In my application, I configured all beans to a singleton. I am assuming if I am going to parallelize the processing by using the concurrent consumer's, multiple messages entered into same integration components.
Does it leads to data collision between two objects?
Does it leads to data collision between two objects?
No, that doesn't mean. If you don't do any state management in your components, then the is not going to be any collisions. Just because one thread can perform only one task at a time. So, if you use the same component in different threads to perform stateless work, there is no any inter-thread interaction. Just because each thread get its own call stack.
Akka persistence query complements Persistence by providing a universal asynchronous stream based query interface that various journal plugins can implement in order to expose their query capabilities.
This is the description of 'Akka Persistence Query' from akka documentation. What I wonder that is it using only for querying, in other words for read-side?
Probably the better question is why isn't Akka persistence query part of Akka persistence? What good is persistence if you can't query it?
Akka persistence seeks to solve one thing and one thing only, make the state of an actor persistent. It doesn't care whether that actor represents a domain entity, or if it's just an operational actor whose state needs to survive restarts (such as is the case for the cluster sharing manager, which used to store its state in Akka persistence before switching to distributed data). It's just a general purpose "make this actors state persistent" feature.
Now, a common use for Akka persistence is to implement event sourced domain entities. Event sourced domain entities however need more than just "make this actor persistent", they usually also need the ability to execute queries across domain entities. And so Akka persistence query exists to allow this, it allows creating cross entity streams that can be processed to populate read side views.
The thing is, not all event stores necessarily make it easy to do cross entity streaming like that. So those that don't can just implement Akka persistence, and be used to make actors persistent, while stores that provide more functionality can also implement Akka persistence query.
This is all greatly simplified, but hopefully explains some of the motivations.
The Akka Persistence Query can be used for querying the journal, and this will be in most of simpler cases sufficient enough.
Nevertheless the main purpose of the Persistence Query is to stream stored events to a specific, or multiple separate, read/query-stores where you will execute your actual queries against.
Aim: Pretend, I have a very popular page (let's say 1 million people per 5 minute) on my Azure Service Fabric based web application. I want to make some kind of cache layer between a data layer and frontend API layer.
Solution: For this purpose, I choose a Reliable Actor performing only one method for readonly operation: GetFrequentlyAskedPage(). This Actor has a volatile type and 5 minutes timeout to be replaced with Garbage Collector.
Questions:
How many read-operations can be handled by the Actor before it lay down?
Should I use in this case "read from secondary replicas" option for that Actor?
Or maybe I am totally wrong in my reasoning and should change the way of implementation.
I would not recommend using actors as a cache. Actor instances force single-threaded turn-based access, meaning an actor instance can only service one request at a time. This obviously will not perform well as a cache. See here for more info: https://azure.microsoft.com/en-us/documentation/articles/service-fabric-reliable-actors-introduction/
Instead I would recommend using a stateful Reliable Service with a Reliable Dictionary to cache data, or better yet, use a stateful Reliable Service as your data layer, in which case you don't need this cache at all.
I'm spending my evenings evaluating Azure Service Fabric as a replacement for our current WebApps/CloudServices stack, and feel a little bit unsure about how to decide when services/actors with state should be stateful actors, and when they should be stateless actors with externally persisted state (Azure SQL, Azure Storage and DocumentDB). I know this is a fairly new product (to the general public at least), so there's probably not a lot of best practices in regards to this yet, but I've read through most of the documentation made available by Microsoft without finding a definite answer for this.
The current problem domain I'm approaching is our event store; parts of our applications are based on event sourcing and CQRS, and I'm evaluating how to move this event store over to the Service Fabric platform. The event store is going to contain a lot time series-data, and as it's our only source of truth for the data being persisted there it must be consistent, replicated and stored to some form of durable storage.
One way I have considered doing this is with stateful "EventStream" actor; each instance of an aggregate using event sourcing stores its events within an isolated stream. This means the stateful actor could keep track of all the events for its own stream, and I'd have met my requirements as to how the data is stored (transactional, replicated and durable). However, some streams may grow very large (hundreds of thousands, if not millions, of events), and this is where I'm starting to get unsure. Having an actor with a large amount of state will, I imagine, have impacts on the performance of the system when these large data models needs to be serialized to or deserialized from disk.
Another option is to keep these actors stateless, and have them just read their data from some external storage like Azure SQL - or just go with stateless services instead of actors.
Basically, when is the amount of state for an actor/service "too much" and you should start considering other ways of handling state?
Also, this section in the Service Fabric Actors design pattern: Some anti-patterns documentation leave me a little bit puzzled:
Treat Azure Service Fabric Actors as a transactional system. Azure Service Fabric Actors is not a two phase commit-based system offering ACID. If we do not implement the optional persistence, and the machine the actor is running on dies, its current state will go with it. The actor will be coming up on another node very fast, but unless we have implemented the backing persistence, the state will be gone. However, between leveraging retries, duplicate filtering, and/or idempotent design, you can achieve a high level of reliability and consistency.
What does "if we do not implement the optional persistance" indicate here? I was under the impression that as long as your transaction modifying the state succeeded, your data was persisted to durable storage and replicated to at least a subset of the replicas. This paragraph leaves me wondering if there are situations where state within my actors/services will get lost, and if this is something I need to handle myself. The impression I got from the stateful model in other parts of the documentation seems to counteract this statement.
One option that you have is to keep 'some' of the state in the actor (let's say what could be considered to be hot data that needs to be quickly available) and store everything else on a 'traditional' storage infrastructure such as SQL Azure, DocDB, ....
It is difficult to have a general rule about too much local state but, maybe, it helps to think about hot vs. cold data.
Reliable Actors also offer the ability to customize the StateProvider so you can also consider implementing a customized StateProvider (by implementing the IActorStateProvider) with the specific policies that you need to be more efficient with the requirements that you have in terms of amount of data, latency, reliability and so on (note: documentation is still very minimal on the StateProvider interface but we can publish some sample code if this is something you want to pursue).
About the anti-patterns: the note is more about implementing transactions across multiple actors. Reliable Actors provides full guarantee on reliability of the data within the boundaries of an actor. Because of the distributed and loosly coupled nature of the Actor model, implementing transactions that involve multiple actors is not a trivial task. If 'distributed' transactions is a strong requirement, the Reliable Services programming model is probably a better fit.
I know this has been answered, but recently found myself in the same predicament with a CQRS/ES system and here's how I went about it:
Each Aggregate was an actor with only the current state stored in it.
On a command, the aggregate would effect a state change and raise an event.
Events themselves were stored in a DocDb.
On activation, AggregateActor instances read events from DocDb if available to recreate its state. This is obviously only performed once per actor activation. This took care of the case where an actor instance is migrated from one node to another.
To answer #Trond's sedcondary question which is, "What does, 'if we do not implement the optional persistance' indicate here?"
An actor is always a stateful service, and its state can be configured, using an attribute on the actor class, to operate in one of three modes:
Persisted. The state is replicated to all replica instances, and it
also written to disk. This the state is maintained even if all
replicas are shut down.
Volatile. The state is replicated to all
replica instances, in memory only. This means as long as one replica
instance is alive the state is maintained. But when all replicas are
shut down the state is lost and cannot be recovered after they are
restarted.
No persistence. The state is not replicated to other
replica instances, nor to disk. This provides the least state
protection.
A full discussion of the topic can be found in the Microsoft documentation
I have a couple of questions regarding EJB transactions. I have a situation where a process has become longer running that originally intended and is sometimes failing due to server timeout's being exceeded. While I have increased the timeouts initially (both total transaction and max transaction), for a long running process, I know that it make more sense to segment this work as much as possible into smaller units of work that don't fail based on timeout. As a result, I'm looking for some thoughts or references regarding next course of action based on the background below and the questions that follow.
Environment:
EJB 3.1, JPA 2.0, WebSphere 8.5
Background:
I built a set of POJOs to do some batch oriented work for an enterprise application. They are non-EJB POJOs that were intended to implement several business processes (5 related, sequential processes, each depending on it's predecessor). The POJOs are in a plain Java project, not an EJB project.
However, these POJOs access an EJB facade for database access via JPA. The abstract core of the 5 business processes does the JNDI lookup for the EJB facade in order to return the domain objects for processing. Originally, the design was to run from the server completely, however, a need arose to initiate these processes externally. As a result, I created an EJB wrapper so that the processes could be called remotely (individually or as a single process based on a common strategy interface). Unfortunately, the size of the data, both row width and row count, has grown well beyond the original intent.
The processing time required to complete these batch processes has increased significantly (from around a couple of hours to around 1/2 a day and could increase beyond that). Only one of the 5 processes made sense to multi-thread (I did implement it multi-threaded). Since I have the wrapper EJB to initiate 1 or all, I have decided to create a new container transaction for each process as opposed to the single default transaction of "required" when I run all as a single process. Since the one process is multi-threaded, it would make sense to attempt to create a new transaction per thread, however, being a group of POJOs, I do not have transaction capability.
Question:
So my question is, what makes more sense and why? Re-engineer the POJOs to be EJBs themselves and have the wrapper EJB instantiate each process as a child process where each can have its own transaction and more importantly, the multi-threaded process can create a transaction per thread. Or does it make more sense to attempt to create a UserTransaction in the POJOs from a JNDI lookup in the container and try to manage it as if it were a bean managed transaction (if that's even a viable solution). I know this may be application dependent, but what is reasonable with regard to timeouts for a Java EE container? Obviously, I don't want run away processes, but want to make sure that I can complete these batch processes.
Unfortunatly, this application has already been deployed as a production system. Re-engineering, though it may be little more than assembling the strategy logic in EJBs, is a large change to the functionality.
I did look around for some other threads here and via general internet searches, but thought I would see if anyone had compelling arguments for one over the other or another solution entirely. Additional links that talk about a topic such as this are appreciated. I wrestled with whether to post this since some may construe this as subjective, however, I felt the narrowed topic was worth the post and potentially relevant to others attempting processes like this.
This is not direct answer to your question, but something you could consider.
WebSphere 8.5 especially for these kind of applications (batch) provides a batch container. The batch function accommodate applications that must perform batch work alongside transactional applications. Batch work might take hours or even days to finish and uses large amounts of memory or processing power while it runs. You can reuse your Java classes in batch applications, batch steps can be run in parallel in cluster and has transaction checkpoint management.
Take a look at following resources:
IBM Education Assistant - Batch applications
Getting started with the batch environment
Since I really didn't get a whole lot of response or thoughts for this question over the past couple of weeks, I figured I would answer this question to hopefully help others in making a decision if they run across this or a similar situation.
Ultimately, I re-engineered one of the POJOs into an EJB that acted as a wrapper to call the other POJOs. The wrapper EJB performed the same activity as when it was just a POJO, except for the fact that I added the transaction semantics (REQUIRES_NEW) on the primary method. The primary method calls the other POJOs based on a stategy pattern so each call (or POJO) gets its own transaction. Other methods in the EJB that call the primary method were defined with NOT_SUPPORTED so that I could separate the transactions for each call to the primary method and not join an existing transaction.
Full disclosure, the original addition of transaction semantics significantly increased the processing time (on the order of days), but the process did not fail due to exceeding transaction timeouts. It was the result of some unexpected problems with JPA Many-To-One relationships that were bringing back too much data. Data retreived as a result of a the Many-To-One relationship. As I mentioned originally, some of my data row width increased unexpectedly. That data increase was in the related table object, but the query did not need that data at the time. I corrected those issues by changing my queries (creating objects for SELECT NEW queries, changed relationships to FetchType.LAZY, etc).
Going forward, if I am able to dedicate enough time, I will transform the rest of those POJOs into EJBs. The POJO doing the most significant amount of work that is threaded has been implemented with a Callable implementation that is run via an ExecutorService. If I can transform that one, the plan will be to make each thread its own transaction. However, while I'm not sure yet, it appears that my container may already be creating transactions for each thread group (of 10 threads) due to status updates I'm seeing. I will have to do more investigation.