I am using Spring Batch and Partition to do parallel processing. Hibernate and Spring Data Jpa for db. For the partition step, the reader, processor and writer have stepscope and so I can inject partition key and range(from-to) to them. Now in processor, I have one synchronized method and expected this method to be ran once at time, but it is not the case.
I set it to have 10 partitions , all 10 Item reader read the right partitioned range. The problem comes with item processor. Blow code has the same logic I use.
public class accountProcessor implementes ItemProcessor{
#override
public Custom process(item) {
createAccount(item);
return item;
}
//account has unique constraints username, gender, and email
/*
When 1 thread execute that method, it will create 1 account
and save it. If next thread comes in and try to save the same account,
it should find the account created by first thread and do one update.
But now it doesn't happen, instead findIfExist return null
and it try to do another insert of duplicate data
*/
private synchronized void createAccount(item) {
Account account = accountRepo.findIfExist(item.getUsername(), item.getGender(), item.getEmail());
if(account == null) {
//account doesn't exist
account = new Account();
account.setUsername(item.getUsername());
account.setGender(item.getGender());
account.setEmail(item.getEmail());
account.setMoney(10000);
} else {
account.setMoney(account.getMoney()-10);
}
accountRepo.save(account);
}
}
The expected output is that only 1 thread will run this method at any given time and so that there will be no duplicate inserttion in db as well as avoid DataintegrityViolationexception.
Actually result is that second thread can't find the first account and try to create a duplicate account and save to db, which will cause DataintegrityViolationexception, unique constraints error.
Since I synchronized the method, thread should execute it in order, second thread should wait for first thread to finish and then run, which mean it should be able to find the first account.
I tried with many approaches, like a volatile set to contains all unique accounts, do saveAndFlush to make commits asap, using threadlocal whatsoever, no of these works.
Need some help.
Since you made the item processor step-scoped, you don't really need synchronization as each step will have its own instance of the processor.
But it looks like you have a design problem rather than an implementation issue. You are trying to sychronize threads to act in a certain order in a parallel setup. When you decide to go parallel and divide the data into partitions and give each worker (either local or remote) a partition to work on, you must admit that these partitions will be processed in an undefined order and that there should be no relation between records of each partition or between the work done by each worker.
When 1 thread execute that method, it will create 1 account
and save it. If next thread comes in and try to save the same account,
it should find the account created by first thread and do one update. But now it doesn't happen, instead findIfExist return null and it try to do another insert of duplicate data
That's because the transaction of thread1 may not be committed yet, hence thread2 won't find the record you think have been inserted by thread1.
It looks like you are trying to create or update some accounts with a partitioned setup. I'm not sure if this setup is suitable for the problem at hand.
As a side note, I would not call accountRepo.save(account); in an item processor but rather do that in an item writer.
Hope this helps.
Related
I am using Hazelcast for distributed cache (only one instance). In once scenario I am trying to update a value in map.
Based on https://stackoverflow.com/a/33351291/1212903 it seems I have to use EntryProcessor when doing update operations since its atomic.
Why I have to use EntryProcessor even if IMap is distributed?
In Entry processor code I don't technically understand the usage of BackupProcessor exactly from the documentation since its distributed.
Why the process method returns an Object and it has no effect (we have to set the updated value to Map.entry.setValue() to actually update).
public class AnalysisResponseProcessor implements EntryProcessor<String, AnalysisResponseMapper> {
#Override
public Object process(Map.Entry<String, AnalysisResponseMapper> entry) {
AnalysisResponseMapper analysisResponseMapper = entry.getValue();
analysisResponseMapper.increaseCount();
entry.setValue(analysisResponseMapper);
return analysisResponseMapper;
}
#Override
public EntryBackupProcessor<String, AnalysisResponseMapper> getBackupProcessor() {
return null;
}
}
How to deal with this scenario?
Answers to your questions:
whether the map is distributed or not, it can be accessed concurrently. If you do a series of get and put, someone else can modify the value in the meantime and you will overwrite the update. If you use EntryProcessor, you can read and update the value in one atomic operation. If only one client updates the map, you can use get and put. The entry processor also needs one network round-trip instead of two.
you can return null for backup processor if you have 0 backups. But if you ever decide to add a backup, then the backup will not be updated. It might be easier to extend AbstractEntryProcessor where you don't have to deal with the backup processor, it will execute the same logic on main and backup replicas. It's only worth to do the backup processor manually if the computation inside the process method is heavy.
the return value from the process() method isn't the updated entry value, but a value that will be returned from map.executeOnKey() method. If you don't need it, just return null.
Is my application service obtaining a lock using JDBC LockRepository supposed to run inside an #Transaction ?
We have a sample application service that updates a JDBCRepository and since this application can run on multiple JVMS (headless). We needed a global lock to serialize those updates.
I looked at your test and was hoping my use case would work too. ... JdbcLockRegistryDifferentClientTests
My config has a DefaultLockRepository and JdbcLockRegistry;
I launched( java -jar boot.jar) my application on two terminals to simulate. When I obtain a lock and issue a tryLock() without #Transaction on my application service both of them get the lock (albeit) one after the other almost immediately. I expected one of them to NOT get it for at least 10 seconds (Default expiry).
Service (Instance -1) {
Obtain("KEY-1")
tryLock()
DoWork()
unlock();
close();
}
Service (Instance -2) {
Obtain("KEY-1")
tryLock() <-- Wait until the lock expires or the unlock happens
DoWork()
unlock();
close();
}
I also noticed here DefaultLockRepository that the transaction scope (if not inherited) is only around the JDBC operation.
When I change my service to
#Transaction
Service (Instance -1) {
Obtain("KEY-1")
tryLock()
DoWork()
unlock();
close();
}
It works as expected.
I am quite sure I missed something ? But I expect my lock operation to honor global-locks (the fact that a lock exists in a JDBC store with an expiration) until an unlock or expiration.
Is my understanding incorrect ?
This works as designed. I didnt configure the DefaultLockRepository correctly and the default ttl was shorter than my service (artificial wait) lock duration. My apologies. :) Josh Long helped me figure this out :)
You have to use different client ids. The same is means the same client. That for special use-case. Use different client ids as they are different instances
The behavior here is subtle (or obvious once you see how this is working) and the general lack of documentation unhelpful, so here's my experience.
I created a lock table by looking at the SQL in DefaultLockRepository, which appeared to imply a composite primary key of REGION, LOCK_KEY and CLIENT_ID - THIS WAS WRONG.
I subsequently found the SQL script in the spring-integration-jdbc JAR, where I could see that the composite primary key MUST BE on just REGION and LOCK_KEY as #ArtemBilan says.
The reason is that the lock doesn't care about the client, obviously, so the primary key must be just the REGION and LOCK_KEY columns. These columns are used when acquiring a lock and it is the key violation that occurs should another client attempt to obtain the lock that is used to restrict other client IDs.
This also implies that, again as #ArtemBilan says, each client instance must have a unique ID, which is the default behavior when no ID specified at construction time.
I have a multi-threaded batch job reading from a DB and I am concerned about different threads re-reading records as ItemReader is not thread safe in Spring batch. I went through SpringBatch FAQ section which states that
You can synchronize the read() method (e.g. by wrapping it in a delegator that does the synchronization). Remember that you will lose restartability, so best practice is to mark the step as not restartable and to be safe (and efficient) you can also set saveState=false on the reader.
I want to know why will I loose re-startability in this case? What has restartability got to do with synchronizing my read operations? It can always try again,right?
Also, will this piece of code be enough for synchronizing the reader?
public SynchronizedItemReader<T> implements ItemReader<T> {
private final ItemReader<T> delegate;
public SynchronizedItemReader(ItemReader<T> delegate) {
this.delegate = delegate;
}
public synchronized T read () {
return delegate.read();
}
}
When using an ItemReader with multithreads, the lack of restartability is not about the read itself. It's about saving the state of the reader which occurs in the update method. The issue is that there needs to be coordination between the calls to read() - the method providing the data and update() - the method persisting the state. When you use multiple threads, the internal state of the reader (and therefore the update() call) may or may not reflect the work that has been done. Take for example the FlatFileItemReader using a chunk size of 5 and running on multiple threads. You could have thread1 having read 5 items (time to update), yet thread 2 could have read an additional 3. This means that the call to update would save that 8 items have been read. If the chunk on thread 2 fails, the state would due incorrect and the restart would miss the three items that were already read.
This is not to say that it is impossible to write a thread safe ItemReader. However, as your example above illustrates, if delegate is a stateful ItemReader (implements ItemStream as well), the state will not be persisted correctly with calls to update (in fact, your example above doesn't even take the ItemStream aspect of stageful readers into account).
If you want make restartable your job, with parallel execution of items, you can save item, that reader read plus state of this item by yourself.
I want to use the master-slave (worker) paradigm, to solve a problem. I have read that opening new threads manually (for example using thread pool) is not available and I need to use queue, attached code example:
class MyDeferred implements DeferredTask {
#Override
public void run() {
// Do something interesting
}
};
MyDeferred task = new MyDeferred();
// Set instance variables etc as you wish
Queue queue = QueueFactory.getDefaultQueue();
queue.add(withPayload(task));
How can I get the result of the workers (which were added to the queue)?
I need this info, in-order to solve the bigger problem.
Actually you can use threads on GAE, but there are limitations. If you need long-running threads you can use background threads, but this requires you to use backend instances.
If you opt to use task queue, then keep in mind that tasks do not "return" to caller. To aggregate results you'll need to use datastore.
You will have to write the results into the datastore.
Just as a starting point to think about it, you might pass a JobId as a parameter to the tasks, have each task write an entity with the result and the JobId, and then later query the datstore for the given JobId to get all the results.
I've got a computation (CTR encryption) that requires results in a precise order.
For this I created a multithreaded design that calculates said results, in this case the result is a ByteBuffer. The calculation itself of course runs asynchronous, so the results may become available at any time and in any order. The "user" is a single-threaded application that uses the results by calling a method, after which the ByteBuffers are returned to the pool of resources by said method - the management of resources is already handled (using a thread safe stack).
Now the question: I need something that aggregates the results and makes them available in the right order. If the next result is not available, the method that the user called should block until it is. Does anyone know a good strategy or class in java.util.concurrent that can return asynchronously calculated results in order?
The solution it must be thread safe. I would like to avoid third party libraries, Thread.sleep() / Thread.wait() and theading related keywords other than "synchronized". Futhermore, The tasks may be given to e.g. an Executor in the correct order if that is required. This is for research, so feel free to use Java 1.6 or even 1.7 constructs.
Note: I've tagged these quesions [jre] as I want to keep within the classes defined in the JRE and [encryption] as somebody may already have had to deal with it, but the question itself is purely about java & multi-threading.
Use the executors framework:
ExecutorService executorService = Executors.newFixedThreadPool(5);
List<Future> futures = executorService.invokeAll(listOfCallables);
for (Future future : futures) {
//do something with future.get();
}
executorService.shutdown();
The listOfCallables will be a List<Callable<ByteBuffer>> that you have constructed to operate on the data. For example:
list.add(new SubTaskCalculator(1, 20));
list.add(new SubTaskCalculator(21, 40));
list.add(new SubTaskCalculator(41, 60));
(arbitrary ranges of numbers, adjust that to your task at hand)
.get() blocks until the result is complete, but at the same time other tasks are also running, so when you reach them, their .get() will be ready.
Returning results in the right order is trivial. As each result arrives, store it in an arraylist, and once you have ALL the results, just sort the arraylist. You could use a PriorityQueue to keep the results sorted at all times as they arrive, but there is no point in doing this, since you will not be making any use of the results before all of them have arrived anyway.
So, what you could do is this:
Declare a "WorkItem" class which contains one of your bytearrays and its ordinal number, so that they can be sorted by ordinal number.
In your work threads, do something like this:
...do work and produce a work_item...
synchronized( LockObject )
{
ResultList.Add( work_item );
number_of_results++;
LockObject.notifyAll();
}
In your main thread, do something like this:
synchronized( LockObject )
while( number_of_results != number_of_items )
LockObject.wait();
ResultList.Sort();
...go ahead and use the results...
My new answer after gaining a better understanding of what you want to do:
Declare a "WorkItem" class which contains one of your bytearrays and its ordinal number, so that they can be sorted by ordinal number.
Make use of a java.util.PriorityQueue which is kept sorted by ordinal number. Essentially, all we care is that the first item in the priority queue at any given time will be the next item to process.
Each work thread stores its result in the PriorityQueue and issues a NotifyAll on some locking object.
The main thread waits on the locking object, and then if there are items in the queue, and if the ordinal of the (peeked, not dequeued) first item in the queue is equal to the number of items processed so far, then it dequeues the item and processes it. If not, it keeps waiting. If all of the items have been produced and processed, it is done.