I have a simple usecase. I have a system where duplicate requests to a REST service (with dozens of instances) are not allowed. However, also difficult to prevent because of a complicated datastore configuration and downstream services.
So the only way I can prevent duplicate "transactions" is to have some centralized place where I write a unique hash of a request data. Each REST endpoint first checks if the hash of a new request already exists and only proceeds if no such hash exists.
For purposes of this question assume that it's not possible to do this with database constraints.
One solution is to create a table in the database where I store my request hashes and always write to this table before proceeding with the request. However, I want something lighter than that.
Another solution is to use something like Redis and write to redis my unique hashes before proceeding with the request. However, I don't want to spin up a Redis cluster and maintain it etc..
I was thinking of embedding Hazelcast in each of my app instances and write my unique hashes there. In theory, all instances will see the hash in the memory grid and will be able to detect duplicate requests. This solves my problem of having a lighter solution than a database and the other requirement of not having to maintain a Redis cluster.
Ok now for my question finally. Is it a good idea to use Hazelcast for this usecase?
Will hazelcast be fast enough to detect duplicate requests that come in milliseconds or microseconds apart ?
If request 1 comes into instance 1 and request 2 comes into instance 2 microseconds apart. Instance 1 writes to hazelcast a hash of the request, instance 2 checks hazelcast for existence of the hash only millyseconds later will the hash have be detected? Is hazelcast going to propagate the data across the cluster in time? Does it even need to do that?
Thanks in advance, all ideas are welcome.
Hazelcast is definitely a good choice for this kind of usecase. Especially if you just use a Map<String, Boolean> and just test with Map::containsKey instead of retrieving the element and check for null. You should also put a TTL when putting the element, so you won't run out of memory. However, same as with Redis, we recommend to use Hazelcast with a standalone cluster for "bigger" datasets, as the lifecycle of cached elements normally interferes with the rest of the application and complicates GC optimization. Running Hazelcast embedded is a choice that should be taken only after serious considerations and tests of your application at runtime.
Yes you can use Hazelcast distributed Map to detect duplicate requests to a REST service as whenever there is put operation in hazelcast map data will be available to all the other clustered instance.
From what I've read and seen in the tests, it doesn't actually replicate. It uses a data grid to distribute the primary data evenly across all the nodes rather than each node keeping a full copy of everything and replicating to sync the data. The great thing about this is that there is no data lag, which is inherent to any replication strategy.
There is a backup copy of each node's data stored on another node, and that obviously depends on replication, but the backup copy is only used when a node crashes.
See the below code which creates two hazelcast clustered instances and get the distributed map. One hazelcast instance putting the data into distibuted IMap and other instance is getting data from the IMap.
import com.hazelcast.config.Config;
import com.hazelcast.core.Hazelcast;
import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.core.IMap;
public class TestHazelcastDataReplication {
//Create 1st Instance
public static final HazelcastInstance instanceOne = Hazelcast
.newHazelcastInstance(new Config("distributedFisrtInstance"));
//Create 2nd Instance
public static final HazelcastInstance instanceTwo = Hazelcast
.newHazelcastInstance(new Config("distributedSecondInstance"));
//Insert in distributedMap using instance one
static IMap<Long, Long> distributedInsertMap = instanceOne.getMap("distributedMap");
//Read from distributedMap using instance two
static IMap<Long, Long> distributedGetMap = instanceTwo.getMap("distributedMap");
public static void main(String[] args) {
new Thread(new Runnable() {
#Override
public void run() {
for (long i = 0; i < 100000; i++) {
//Inserting data in distributedMap using 1st instance
distributedInsertMap.put(i, System.currentTimeMillis());
//Reading data from distributedMap using 2nd instance
System.out.println(i + " : " + distributedGetMap.get(i));
}
}
}).start();
}
}
Related
I have an application that uses a large in memory file (just under 2gb).
I'm trying to use redis lists (in Azure) as the storage (vs SQL). Building the list in redis is pretty fast, I can load the redis list in about 5 minutes but then I need to read from the list into the application.
This is extremely slow, I've tried increasing the threads, extending synctimeout etc to no avail.
ThreadPool.SetMinThreads(200, 200);
I'm using a C# implementation of redis list that I discovered online, I'm passing this to the code that builds the in memory collection via foreach loop. Internally this is how it processes the data (I've omitted the rest of the class)
public class RedisList<T> : IList<T>
{
private static ConnectionMultiplexer _cnn;
private readonly string _key;
private static readonly Lazy<ConnectionMultiplexer> LazyConnection = new Lazy<ConnectionMultiplexer>(() => ConnectionMultiplexer.Connect(ConfigurationManager.AppSettings["AzureRedisCacheUrl"]));
public ConnectionMultiplexer Connection => LazyConnection.Value;
public RedisList(string key)
{
this._key = key;
_cnn = Connection;
}
public IEnumerator<T> GetEnumerator()
{
for (var i = 0; i < this.Count; i++)
{
yield return Deserialize<T>(GetRedisDb().ListGetByIndex(_key, i).ToString());
}
}
}
Is there a more efficient way of reading in the data? Am I insane to do it this way? :D Thanks
Am I insane to do it this way?
Absolutely. Redis isn't designed to store large binary data like < 2GB files (or even 100MB files).
Redis is about indexing small chunks of data to retrieve them later in a very optimized and efficient way in terms of both CPU and memory. Remember that Redis is an in-memory database, and the fact that snapshots its data to disk (for example in RDB files) doesn't mean that the data source is your RAM memory.
Instead of storing this large binary data on Redis, just use Redis as file index, and leverage its data structures to get back to them in a breeze.
OP said in some comment:
Hi Matias, the redis list is a collection of individual files ranging
from a few kb to around 5mb. The legacy application I'm working on
loads all this into a huge static object on application start (I know,
awful). The original version was an in memory file loaded from SQL,
this proved too slow when moving to azure so we need a faster
intermediary store of the data
Anyway, Redis isn't designed to be an in-memory file storage.
I would say that you should take a look at memory-mapped files, where you can even load all files into a memory mapped file, and get them using indexes (for example, from byte 0 to 243843, is file1, and so on). This should improve overall performance and you won't need to use wrong tools for the job.
EDIT question summary:
I want to expose an endpoints, that will be capable of returning portions of xml data by some query parameters.
I have a statefull service (that is keeping the converted to DTOs xml data into a reliable dictionary)
I use a single, named partition (I just cant tell which partition holds the data by the query parameters passed, so I cant implement some smarter partitioning strategy)
I am using service remoting for communication between the stateless WEBAPI service and the statefull one
XML data may reach 500 MB
Everything is OK when the XML only around 50 MB
When data gets larger I Service Fabric complaining about MaxReplicationMessageSize
and the summary of my few questions from below: how can one achieve storing large amount of data into a reliable dictionary?
TL DR;
Apparently, I am missing something...
I want to parse, and load into a reliable dictionary huge XMLs for later queries over them.
I am using a single, named partition.
I have a XMLData stateful service that is loading this xmls into a reliable dictionary in its RunAsync method via this peace of code:
var myDictionary = await this.StateManager.GetOrAddAsync<IReliableDictionary<string, List<HospitalData>>>("DATA");
using (var tx = this.StateManager.CreateTransaction())
{
var result = await myDictionary.TryGetValueAsync(tx, "data");
ServiceEventSource.Current.ServiceMessage(this, "data status: {0}",
result.HasValue ? "loaded" : "not loaded yet, starts loading");
if (!result.HasValue)
{
Stopwatch timer = new Stopwatch();
timer.Start();
var converter = new DataConverter(XmlFolder);
List <Data> data = converter.LoadData();
await myDictionary.AddOrUpdateAsync(tx, "data", data, (key, value) => data);
timer.Stop();
ServiceEventSource.Current.ServiceMessage(this,
string.Format("Loading of data finished in {0} ms",
timer.ElapsedMilliseconds));
}
await tx.CommitAsync();
}
I have a stateless WebApi service that is communicating with the above stateful one via service remoting and querying the dictionary via this code:
ServiceUriBuilder builder = new ServiceUriBuilder(DataServiceName);
DataService DataServiceClient = ServiceProxy.Create<IDataService>(builder.ToUri(),
new Microsoft.ServiceFabric.Services.Client.ServicePartitionKey("My.single.named.partition"));
try
{
var data = await DataServiceClient.QueryData(SomeQuery);
return Ok(data);
}
catch (Exception ex)
{
ServiceEventSource.Current.Message("Web Service: Exception: {0}", ex);
throw;
}
It works really well when the XMLs do not exceeds 50 MB.
After that I get errors like:
System.Fabric.FabricReplicationOperationTooLargeException: The replication operation is larger than the configured limit - MaxReplicationMessageSize ---> System.Runtime.InteropServices.COMException
Questions:
I am almost certain that it is about the partitioning strategy and I need to use more partitions. But how to reference a particular partition while in the context of the RunAsync method of the Stateful Service? (Stateful service, is invoked via the RPC in WebApi where I explicitly point out a partition, so in there I can easily chose among partitions if using the Ranged partitions strategy - but how to do that while the initial loading of data when in the Run Async method)
Are these thoughts of mine correct: the code in a stateful service is operating on a single partition, thus Loading of huge amount of data and the partitioning of that data should happen outside the stateful service (like in an Actor). Then, after determining the partition key I just invoke the stateful service via RPC and pointing it to this particular partition
Actually is it at all a partitioning problem and what (where, who) is defining the Size of a Replication Message? I.e is the partiotioning strategy influencing the Replication Message sizes?
Would excerpting the loading logic into a stateful Actor help in any way?
For any help on this - thanks a lot!
The issue is that you're trying to add a large amount of data into a single dictionary record. When Service Fabric tries to replicate that data to other replicas of the service, it encounters a quota of the replicator, MaxReplicationMessageSize, which indeed defaults to 50MB (documented here).
You can increase the quota by specifying a ReliableStateManagerConfiguration:
internal sealed class Stateful1 : StatefulService
{
public Stateful1(StatefulServiceContext context)
: base(context, new ReliableStateManager(context,
new ReliableStateManagerConfiguration(new ReliableStateManagerReplicatorSettings
{
MaxReplicationMessageSize = 1024 * 1024 * 200
}))) { }
}
But I strongly suggest you change the way you store your data. The current method won't scale very well and isn't the way Reliable Collections were meant to be used.
Instead, you should store each HospitalData in a separate dictionary item. Then you can query the items in the dictionary (see this answer for details on how to use LINQ). You will not need to change the above quota.
PS - You don't necessarily have to use partitioning for 500MB of data. But regarding your question - you could use partitions even if you can't derive the key from the query, simply by querying all partitions and then combining the data.
I have three node Cassandra cluster which is serving currently 50 writes/sec. Now, It would be 100 writes/sec and following are the details of my cluster :
Keyspace definition :
CREATE KEYSPACE keyspacename WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3 };
Partitioner :
org.apache.cassandra.dht.RandomPartitioner
I have client in c# (datastax c# driver) and i am using the singleton design pattern or rather creating only one object of cassandra server. Which will be used for writing and reading the data from ring. And reason for doing it was tcp connections were not getting closed on ring. Till now my ring is working fine and it is able to sustain the load of 50 writes/sec. Now it would increase to 100 writes/sec.
So, my question is will the same design pattern would be able to handle the same given the configuration of my ring?
C# code :
public static ISession GetSingleton()
{
if (_singleton == null)
{
Cluster cluster = Cluster.Builder().AddContactPoints(ConfigurationManager.AppSettings["cassandraCluster"].ToString().Split(',')).Build();
ISession session = cluster.Connect(ConfigurationManager.AppSettings["cassandraKeySpace"].ToString());
_singleton = session;
}
return _singleton;
}
From the Cassandra side, 100 writes/sec is quite low. It would handle it easily.
From the client side, I see no problem with your design. According to me, it is a good idea to use Singleton pattern. But I cannot give you an exact answer since I do not know :
What the size of your written data is.
How performant your network is.
Whether you use synchronous or asynchronous execution.
Generaly, we can reasonably consider 10 ms/writes. With synchronous execution, you would be able to write 100 times/sec. But you could not go along indefinitely because the driver would not create more connections.
In the other hand, you can use ExecuteAsync method to execute writes asynchronously. The C# Cassandra driver will manage the connection pool for you.
Another tip I can give you is PreparedStatement.
We are working on a c# windows service using NHibernate which is supposed to process a batch of records.
The service has to process about 6000 odd records and its taking about 3 hours at present to process these. There are a lot of db hits incurred and while we are trying to minimize these , we are also exploring multithreading options to improve performance.
We are using the UnitOfWork pattern to access the NHibernate session.
This is roughly how the service looks :
public class BatchService
{
public DoWork()
{
StartUnitOfWork();
foreach ( var record in recordsToBeProcessed)
{
Process(record);
// Perform lots of db operations
}
StopUnitOfWork();
}
}
We were thinking of using the Task Parallel Library to try to process these records in batches ( using the Parallel.Foreach () method).
From what I have read about NHibernate so far , we should provide each thread a separate NHibernate session.
My query is how do we supply this ..considering the UnitOfWork pattern which only allows one session to be available.
Should I be looking at wrapping a UnitOfWork around the processing of a single record ?
Any help much appreciated.
Thanks
The best way is to start a new unitofwork for each thread, use a thread-static contextual session NHibernate.Context.ThreadStaticSessionContext. You must be aware of dettached entities.
The easiest way is to wrap each processing of a record in it's own unit of work and then run each UOW on it's own thread. You need to make sure that each UOW & session is started, used and completed on a single thread.
To gain performance you could split the batch of records in smaller batches and then wrap the processing of this smaller batches into UOWs and execute them on separate threads.
Depending on your workload using a second level cache (memcached/membase) might dramatically improve your performance. (eg if you need to read some records from the db for each processing )
I have a Hibernate application that may produce concurrent inserts and updates (via Session.saveOrUpdate) to records with the same primary key, which is assigned. These transactions are somewhat long-running, perhaps 15 seconds on average (since data is collected from remote sources and persisted as it comes in). My DB isolation level is set to Read Committed, and I'm using MySQL and InnoDB.
The problem is this scenario creates excessive lock waits which timeout, either as a result of a deadlock or the long transactions. This leads me to a few questions:
Does the database engine only release its locks when the transaction is committed?
If this is the case, should I seek to shorten my transactions?
If so, would it be a good practice to use separate read and write transactions, where the write transaction could be made short and only take place after all of my data is gathered (the bulk of my transaction length involves collecting remote data).
Edit:
Here's a simple test that approximates what I believe is happening. Since I'm dealing with long running transactions, commit takes place long after the first flush. So just to illustrate my situation I left commit out of the test:
#Entity
static class Person {
#Id
Long id = Long.valueOf(1);
#Version
private int version;
}
#Test
public void updateTest() {
for (int i = 0; i < 5; i++) {
new Thread() {
public void run() {
Session s = sf.openSession();
Transaction t = s.beginTransaction();
Person p = new Person();
s.saveOrUpdate(p);
s.flush(); // Waits...
}
}.run();
}
}
And the queries that this expectantly produces, waiting on the second insert:
select id, version from person where id=?
insert into person (version, id) values (?, ?)
select id, version from person where id=?
insert into person (version, id) values (?, ?)
That's correct, the database releases locks only when the transaction is committed. Since you're using hibernate, you can use optimistic locking, which does locks the database for long periods of time. Essentially, hibernate does what you suggest, separating the reading and writing portions into separate transactions. On write, it checks that the data in memory has not been changed concurrently in the database.
Hibernate Reference - Optimistic Transactions
Opportunistic locking:
Base assumption: update conflicts do occur seldom.
Mechanic:
Read dataset with version field
Change dataset
Update dataset
3.1.Read Dataset with current Version field and key
If you get it, nobody has changed the record.
Apply the next version field value.
update the record.
If you do not get it, the record has been changed, return en aproriate message to the caller and you are done
Inserts are not affected, you either
have a separate primary key anyway
or you accept multiple record with identical values.
Therefore the example given above is not a case for optimistic locking.