I have a Hibernate application that may produce concurrent inserts and updates (via Session.saveOrUpdate) to records with the same primary key, which is assigned. These transactions are somewhat long-running, perhaps 15 seconds on average (since data is collected from remote sources and persisted as it comes in). My DB isolation level is set to Read Committed, and I'm using MySQL and InnoDB.
The problem is this scenario creates excessive lock waits which timeout, either as a result of a deadlock or the long transactions. This leads me to a few questions:
Does the database engine only release its locks when the transaction is committed?
If this is the case, should I seek to shorten my transactions?
If so, would it be a good practice to use separate read and write transactions, where the write transaction could be made short and only take place after all of my data is gathered (the bulk of my transaction length involves collecting remote data).
Edit:
Here's a simple test that approximates what I believe is happening. Since I'm dealing with long running transactions, commit takes place long after the first flush. So just to illustrate my situation I left commit out of the test:
#Entity
static class Person {
#Id
Long id = Long.valueOf(1);
#Version
private int version;
}
#Test
public void updateTest() {
for (int i = 0; i < 5; i++) {
new Thread() {
public void run() {
Session s = sf.openSession();
Transaction t = s.beginTransaction();
Person p = new Person();
s.saveOrUpdate(p);
s.flush(); // Waits...
}
}.run();
}
}
And the queries that this expectantly produces, waiting on the second insert:
select id, version from person where id=?
insert into person (version, id) values (?, ?)
select id, version from person where id=?
insert into person (version, id) values (?, ?)
That's correct, the database releases locks only when the transaction is committed. Since you're using hibernate, you can use optimistic locking, which does locks the database for long periods of time. Essentially, hibernate does what you suggest, separating the reading and writing portions into separate transactions. On write, it checks that the data in memory has not been changed concurrently in the database.
Hibernate Reference - Optimistic Transactions
Opportunistic locking:
Base assumption: update conflicts do occur seldom.
Mechanic:
Read dataset with version field
Change dataset
Update dataset
3.1.Read Dataset with current Version field and key
If you get it, nobody has changed the record.
Apply the next version field value.
update the record.
If you do not get it, the record has been changed, return en aproriate message to the caller and you are done
Inserts are not affected, you either
have a separate primary key anyway
or you accept multiple record with identical values.
Therefore the example given above is not a case for optimistic locking.
Related
I have a simple usecase. I have a system where duplicate requests to a REST service (with dozens of instances) are not allowed. However, also difficult to prevent because of a complicated datastore configuration and downstream services.
So the only way I can prevent duplicate "transactions" is to have some centralized place where I write a unique hash of a request data. Each REST endpoint first checks if the hash of a new request already exists and only proceeds if no such hash exists.
For purposes of this question assume that it's not possible to do this with database constraints.
One solution is to create a table in the database where I store my request hashes and always write to this table before proceeding with the request. However, I want something lighter than that.
Another solution is to use something like Redis and write to redis my unique hashes before proceeding with the request. However, I don't want to spin up a Redis cluster and maintain it etc..
I was thinking of embedding Hazelcast in each of my app instances and write my unique hashes there. In theory, all instances will see the hash in the memory grid and will be able to detect duplicate requests. This solves my problem of having a lighter solution than a database and the other requirement of not having to maintain a Redis cluster.
Ok now for my question finally. Is it a good idea to use Hazelcast for this usecase?
Will hazelcast be fast enough to detect duplicate requests that come in milliseconds or microseconds apart ?
If request 1 comes into instance 1 and request 2 comes into instance 2 microseconds apart. Instance 1 writes to hazelcast a hash of the request, instance 2 checks hazelcast for existence of the hash only millyseconds later will the hash have be detected? Is hazelcast going to propagate the data across the cluster in time? Does it even need to do that?
Thanks in advance, all ideas are welcome.
Hazelcast is definitely a good choice for this kind of usecase. Especially if you just use a Map<String, Boolean> and just test with Map::containsKey instead of retrieving the element and check for null. You should also put a TTL when putting the element, so you won't run out of memory. However, same as with Redis, we recommend to use Hazelcast with a standalone cluster for "bigger" datasets, as the lifecycle of cached elements normally interferes with the rest of the application and complicates GC optimization. Running Hazelcast embedded is a choice that should be taken only after serious considerations and tests of your application at runtime.
Yes you can use Hazelcast distributed Map to detect duplicate requests to a REST service as whenever there is put operation in hazelcast map data will be available to all the other clustered instance.
From what I've read and seen in the tests, it doesn't actually replicate. It uses a data grid to distribute the primary data evenly across all the nodes rather than each node keeping a full copy of everything and replicating to sync the data. The great thing about this is that there is no data lag, which is inherent to any replication strategy.
There is a backup copy of each node's data stored on another node, and that obviously depends on replication, but the backup copy is only used when a node crashes.
See the below code which creates two hazelcast clustered instances and get the distributed map. One hazelcast instance putting the data into distibuted IMap and other instance is getting data from the IMap.
import com.hazelcast.config.Config;
import com.hazelcast.core.Hazelcast;
import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.core.IMap;
public class TestHazelcastDataReplication {
//Create 1st Instance
public static final HazelcastInstance instanceOne = Hazelcast
.newHazelcastInstance(new Config("distributedFisrtInstance"));
//Create 2nd Instance
public static final HazelcastInstance instanceTwo = Hazelcast
.newHazelcastInstance(new Config("distributedSecondInstance"));
//Insert in distributedMap using instance one
static IMap<Long, Long> distributedInsertMap = instanceOne.getMap("distributedMap");
//Read from distributedMap using instance two
static IMap<Long, Long> distributedGetMap = instanceTwo.getMap("distributedMap");
public static void main(String[] args) {
new Thread(new Runnable() {
#Override
public void run() {
for (long i = 0; i < 100000; i++) {
//Inserting data in distributedMap using 1st instance
distributedInsertMap.put(i, System.currentTimeMillis());
//Reading data from distributedMap using 2nd instance
System.out.println(i + " : " + distributedGetMap.get(i));
}
}
}).start();
}
}
I have two methods in my server application:
boolean isMessageExist(messageId) which execute below query:
SELECT messageId from message Where messageId =1;
insertMessage(int messageId,String data) which execute below query:
INSERT INTO message (messageId,data) VALUES (1, xyz);
In my code I am doing below to meet the required that "only insert if message does not exist".
if(!isMessageExist(1)){
insertMessage(1,"xyz")
}
But above code is not working if request for same messageId comes almost simultaneously.
i.e at time T0 ... the Read1(1), Write1(1) and Read2(1), Write2(1) are happening at the same time since the two requests were sent from the client at the same time. Is there a way make those request in sequence at serverside. I mean Read2(1) should always get the result Write1(1) ?
I don't want to USE CAS operation if IF NOT EXISTS due to performance overhead.
Is there any other way to achieve my requirement? Please suggest.
Using Cassandra's light weight transactions (LWT) IF NOT EXISTS should be both less expensive then what you are currently doing and satisfy your requirement for uniqueness.
INSERT INTO message( messageId, data ) VALUES ( 1, xyz ) IF NOT EXISTS
You can test and verify the performance, but two round trips (read, write) is almost certainly more expensive than a single INSERT ... IF NOT EXISTS.
Alternatively, if you can redesign your application so it uses UPSERTS -- where new values simply overwrites old data, that would be even better and use a more native Cassandra style.
I was following this link to use a batch transaction without using BATCH keyword.
Cluster cluster = Cluster.builder()
.addContactPoint(“127.0.0.1")
.build();
Session session = cluster.newSession();
//Save off the prepared statement you're going to use
PreparedStatement statement = session.prepare(“INSERT INTO tester.users (userID, firstName, lastName) VALUES (?,?,?)”);
//
List<ResultSetFuture> futures = new ArrayList<ResultSetFuture>();
for (int i = 0; i < 1000; i++) {
//please bind with whatever actually useful data you're importing
BoundStatement bind = statement.bind(i, “John”, “Tester”);
ResultSetFuture resultSetFuture = session.executeAsync(bind);
futures.add(resultSetFuture);
}
//not returning anything useful but makes sure everything has completed before you exit the thread.
for(ResultSetFuture future: futures){
future.getUninterruptibly();
}
cluster.close();
My question is with the given approach is it possible to INSERT, UPDATE or DELETE data from different table and if any of those fail all should be failed by maintaining the same performance (as described in the link).
With this approach what i tried, i was trying to insert, delete data from different table and one query got failed so all previous query was executed and updated the db.
With BATCH I can see that if any statement get failed all statement will be failed. But using BATCH on different table is anti-pattern so what is the solution ?
With BATCH I can see that if any statement get failed all statement will be failed.
Wrong, the guarantee of LOGGED BATCH is: if some statements in the batch fail, they will be retried until the succeed.
But using BATCH on different table is anti-pattern so what is the solution ?
ACID transaction is not possible with Cassandra, it would require some sort of global lock or global coordination and be prohibitive performance-wise.
However, if you don't care about the performance cost, you can implement your self a global lock/lease system using Light Weight Transaction primitives as described here
But be ready to face poor performance
I'm trying to send a simple batch of Insert operations to Azure Table Storage but it seems that the whole batch transaction is invalidated and, using the managed azure storage client, the ExecuteBatch method itself throws an Exception if there is a single Insert in the batch to a pre-existing record. (using 2.0 client):
public class SampleEntity : TableEntity
{
public SampleEntity(string partKey, string rowKey)
{
this.PartitionKey = partKey;
this.RowKey = rowKey;
}
}
var acct = CloudStorageAccount.DevelopmentStorageAccount;
var client = acct.CreateCloudTableClient();
var table = client.GetTableReference("SampleEntities");
var foo = new SampleEntity("partition1", "preexistingKey");
var bar = new SampleEntity("partition1", "newKey");
var batchOp = new TableBatchOperation();
batchOp.Add(TableOperation.Insert(foo));
batchOp.Add(TableOperation.Insert(bar));
var result = table.ExecuteBatch(batchOp); // throws exception: "0:The specified entity already exists."
The batch-level exception is avoided by using InsertOrMerge but then every individual operation response returns a 204, whether or not that particular operation inserted or merged it. So it seems its impossible for the client application to retain knowledge of whether it, or another node in the cluster, inserted the record. Unforunately, in my current case, this knowledge is necessary for some downstream synchronization.
Is there some configuration or technique to allow the batch of inserts to proceed and return the particular response code per-item without throwing a blanket exception?
As you already know, since batch is a transaction operation you get an all-or-none kind of a deal. One thing interesting with batch transactions is that you get an index of first failed entity in the batch. So assuming you're trying to insert 100 entities in a batch and 50th entity is already present in the table, the batch operation will give you the index of failed entity (49 in this case).
Is there some configuration or technique to allow the batch of inserts
to proceed and return the particular response code per-item without
throwing a blanket exception?
I don't think so. The transaction would fail as soon as the first entity fails. It will not even attempt to process other entities.
Possible Solutions (Just thinking out loud :))
If I understand correctly, your key requirement is to identify if an entity was inserted or merged (or replaced). For this the approach would be to separate out failed entities from a batch and process them separately. Based on this, I can think of two approaches:
What you could possibly do in this case is split that batch into 3
batches: 1st batch will contain 49 entities, 2nd batch will contain
just 1 entity (which failed) and the 3rd batch will contain 50
entities. You could now insert all entities in the 1st batch, decide
what you want to do with that failed entity and try to insert the
3rd batch. You would need to repeat the process over and over again
till the time this operation is complete.
Another idea would be to remove the failed entity from the batch and
retry that batch. So in the example above, in your 1st attempt
you'll try with 100 entities, in your 2nd attempt you'll try with 99
entities and so on and so forth keeping track of failed entities all
the while (with the reason as to why they failed). Once the batch
operation is successfully completed, you can work with all the
failed entities.
I'd need to understand if/how a call to MutationBatch.execute() is safe against the server running the code going down.
Have a look at the code below (copy from the Astyanax examples). I intend to use this code to modify 2 rows in 2 different column families. I need to ensure (100%) that if the server executing this code crashes/fails at any point during the execution either:
- nothing is changed in the Cassandra datastore
- ALL changes (2 rows) are applied to the Cassandra datastore
I'm especially concerned about the line "OperationResult result = m.execute();". I would assume that this translates into something like: write all modifications to a commit log in Cassandra and then atomically trigger a change to be executed inside Cassandra (and Cassandra guarantee execution on some server).
Any help on this is very appreciated.
Thanks,
Sven.
CODE:
MutationBatch m = keyspace.prepareMutationBatch();
long rowKey = 1234;
// Setting columns in a standard column
m.withRow(CF_STANDARD1, rowKey)
.putColumn("Column1", "X", null)
.putColumn("Column2", "X", null);
m.withRow(CF_STANDARD1, rowKey2)
.putColumn("Column1", "Y", null);
try {
OperationResult<Void> result = m.execute();
} catch (ConnectionException e) {
LOG.error(e);
}
http://www.datastax.com/docs/0.8/dml/about_writes
In Cassandra, a write is atomic at the row-level, meaning inserting or updating columns for a given row key will be treated as one write operation. Cassandra does not support transactions in the sense of bundling multiple row updates into one all-or-nothing operation.
This means, that there is no way to be 100% sure, that mutation will update two different rows or none. But since Cassandra 0.8 you will have such guarantee at least within single row - all columns modified within single row will success or none - this is all.
You can see mutations on different rows as separate transactions, the fact that they are send within single mutation call does not change anything. Cassandra internally will group all operations together on row key, and execute each row mutation as separate atomic operation.
In your example, you can be sure that rowKey (Column1,Column2) or rowKey2(Column1) was persisted, but never both.
You can enable Hinted Handoff Writes, this would increase probability, that write will propagate with time, but again, this is not ACID DB