I would like to use PN Counters on Hazelcast to maintain Inventory numbers for fast lookup , but at the same time i would also like to clear them all from time to time and reload them as a "true up".
Hazelcast allows PN Counter access by assigning a name but there is no way for a process to clear all the PN Counters in a cluster as there doesn't seem to be a client API to get a list of all PN Counters in the cluster.
How can i clear all PN counters in a cluster and start afresh ?
You can make use of the getDistributedObjects API, filter the PNCounters and destroy them.
HazelcastInstance client = HazelcastClient.newHazelcastClient();
client.getDistributedObjects()
.stream()
.filter(distributedObject -> distributedObject instanceof PNCounter)
.forEach(DistributedObject::destroy);
As the documentation of the linked API suggests, the list of distributed objects are returned on a best-effort basis, but it should work just fine most of the time.
Related
I have issues understanding the new features of hazelcast 5.0 Which is the main difference between those data structures? because pncounter is a counter that replicates data and when there is no more updates they combine together, I want to understand how does PNCounter of hazelcast controls the concurrency on hazelcast.
The replicant count max value it's about the max nodes you have running on hazelcast?
How does work internally? because I need to understand how this works, I'm working with a counter that counts the activity of several clients, we create like 1000 or even more pncounters for different activities because I don't know if one pn counter would work.
Does the client know which counter need to connect or does the counter follow a certain logic flow? I don't understand this feature, I really want to know the difference between pncounter and atomiclong.
for me it's like atomiclong which the feature that it can replicate.
It all boils down to the CAP Theorem.
In summary, out of Consistency, Availability & Partition-Resistance, you can pick 2 out of 3. And since Hazelcast is distributed by nature, your choice is between Consistency and Availability.
IAtomicLong is a member of CP Subsystem API
-- https://docs.hazelcast.com/imdg/4.2/data-structures/iatomiclong
A Conflict-free Replicated Data Type (CRDT) is a distributed data structure that achieves high availability by relaxing consistency constraints. There may be several replicas for the same data and these replicas can be modified concurrently without coordination. This means that you may achieve high throughput and low latency when updating a CRDT data structure. On the other hand, all of the updates are replicated asynchronously.
-- https://docs.hazelcast.com/imdg/4.2/data-structures/pn-counter
In summary, IAtomicLong sacrifices Availability for Consistency. The result will always be correct, but it might not always be available.
PNCounter makes the opposite trade-off. It's always available (depending on the number of nodes of the cluster, of course) but it's eventually consistent as it's asynchronously replicated.
I am working in a specific project to change my repository to hazelcast.
I need find some documents by data range, store type and store ids.
During my tests i got 90k throughput using one instance c3.large, but when i execute the same test with more instances the result decrease significantly (10 instances 500k and 20 instances 700k).
These numbers were the best i could tuning some properties:
hazelcast.query.predicate.parallel.evaluation
hazelcast.operation.generic.thread.count
hz:query
I have tried to change instance to c3.2xlarge to get more processing but but the numbers don't justify the price.
How can i optimize hazelcast to be more fast in this scenario?
My user case don't use map.get(key), only map.values(predicate).
Settings:
Hazelcast 3.7.1
Map as Data Structure;
Complex object using IdentifiedDataSerializable;
Map index configured;
Only 2000 documents on map;
Hazelcast embedded configured by Spring Boot Application (singleton);
All instances in same region.
Test
Gatling
New Relic as service monitor.
Any help is welcome. Thanks.
If your use-case only contains map.values with a predicate, I would strongly suggest to use object type as in in-memory storage model. This way, there will not be any serialization involved during Query execution.
On the other end, it is normal to get very high numbers when you only have 1 member. Because, there is no data moving across network. Potentially to improve, I would check EC2 instances with high network capacity. For example c3.8xlarge has 10 Gbit network, compared to High that comes with c3.2xlarge.
I can't promise, how much increase you can get, but I would definitely try these changes first.
I am using Cassandra 2.1.12 to store the event data in a column family. Below is the c# code for creating the client for the .net which manage connections from Cassandra. Now the problem is rate of insert/update data is very high. So, now let's say i increment a column value in Cassandra on subsequent request. But as i said rate of insert/update is very high. So in my 3 node cluster if first time
i write value of the column would be 1 then in next request i will read the value of this column and update it to 2. But if the value if fetched from other node where the value has not been initialized to 1. Then again value would be stored as 1. So, now to solve this problem i have also kept the value of consistency to be QUORUM. But still the problem persists. Can any one tell me the possible solution for this ?
private static ISession _singleton;
public static ISession GetSingleton()
{
if (_singleton == null)
{
Cluster cluster = Cluster.Builder().AddContactPoints(ConfigurationManager.AppSettings["cassandraCluster"].ToString().Split(',')).Build();
ISession session = cluster.Connect(ConfigurationManager.AppSettings["cassandraKeySpace"].ToString());
_singleton = session;
}
return _singleton;
}
No, It is not possible to achieve your goal in cassandra. The reason is, every distributed application falls within the CAP theorem. According to that, cassandra does not have Consistency.
So in your scenario, you are trying to update a same partition key for many time in multi threaded environment, So it is not guaranteed to see latest data in all the threads. If you try with small interval gap then you might see latest data in all the threads. If your requirement is to increment/decrement the integers then you can go with cassandra counters. But however cassandra counter does not support to retrieve the updated value with in a single request. Which means you can have a request to increment the counter and have a separate request to get the updated value. It is not possible to increment and to get the incremented value in a single request. If you requirement is to only incrementing the value (like counting the number of times a page is viewed) then you can go with cassandra counters. Cassandra counters will not miss any increments/decrements. You can see actual data at last. Hope it helps.
I'm banging my head on this, but, frankly speaking, my brains won't get it - or so it seems.
I have a column family that holds jobs for a rather large group of actors. It is a central job management and scheduling table that must be distributed and available throughout the whole cluster and possibly even traverses datacenter barriers some day in the near future.
Each job executor actor system, the ones that actually execute the jobs, is installed alongside one Cassandra node - that is, on the same node. Actually of course there is s master actor that pulls the jobs and distributes them to the actor agents, but that has nothing to do with my question.
There are also some actor systems that can create jobs in the central job table to be executed by other actors or even actor systems, but usually the jobs are loaded batch wise or manually through a web interface.
An actor that is to execute a job always only queries it's local cassandra node. If finished, it will update the job table to indicate it is finished. This write should, in normal circumstances, also only update records with jobs, for which his local Cassandra node is authoritative.
Now, sometimes it may happen that an actor system on a given host has nothing to do. In this case it should indeed get jobs from other nodes too, but of course it will still only talk to it's local Cassandra node. I know this works and it doesn't bother me a bit.
What keep me up at night is this:
How would I create a compound key to achieve the local authoritative of a Cassandra node for job entries for it's local actor system and thereby it's job execution actors, without splitting the job table in multiple column families or the like?
In other words: how can I create a compound key that makes sure that a) jobs are evenly distributed through my cluster and
b) a local query on the job table only returns jobs for which this Cassandra node is authoritative and
c) my distributed agent system still has the possibility to fetch jobs from other nodes, in case it has no own jobs to execute???
A last word on c) above. I do not want to do 2 queries in the case there is no local job, but still only on!
Any hints on this?
This is general structure of job table so far:
ClusterKey UUID: Primary Key
JobScope String: HOST / GLOBAL / SERVICE / CHANNEL
JobIdentifier String: Web-Crawler, Twitter
Description String:
URL String:
JobType String: FETCH / CLEAN / PARSE /
Job String: Definition of the job
AdditionalData Collection:
JobStatus String: NEW / WORKING / FINISHED
User String:
ValidFrom Timestamp:
ValidUntill Collection:
Still in the process setting everything up, so no query so far defined. But an Actor will pull jobs out of it and set status and so
Cassandra has no way of "pinning" a key to a node, if that's what you are after.
If I were you, I'd stop worrying about whether my local node was authoritative for some set of data, and start leveraging the built-in consistency controls in Cassandra for managing the set of nodes that you read from or write to.
Lots of information here on read consistency and write consistency- using the right consistency will ensure that your application scales well while keeping it logically correct: http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
Another item worth mentioning is atomic "compare and swap", also known as lightweight transactions. Let's say you want to ensure that a given job is only performed once. You could add a field indicating whether the job has been "picked up", then query on that field (where picked_up = 0) and simultaneously (and atomically) update the field to indicate that you are "picking up" that work. That way no other actors will pick it up again.
Info on lightweight transactions here: http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_ltwt_transaction_c.html
Ok so a simple task such as generating a sequential number has caused us an issue in the cloud.
Where you have more than one server it gets harder and harder to guarantee that the allocated number between servers are not clashing.
We are using Azure servers if it helps.
We thought about using the app cache but you cannot guarantee it will be updated between servers.
We are limited to using:
a SQL table with an identity column
or
some peer to peer method between servers
or
use a blob store and utalise the locks to store the nost upto date number. (this could have scaling issues)
I just wondered of anyone has an idea of a solution to resolve this?
Surely its a simple problem and must have been solved by now.
If you can live with a use-case where sometimes the numbers you get from this central location are not always sequential (but guaranteed to be unique) I would suggest considering the following pattern. I've helped an large e-commerce client implement this since they needed unique int PK's to synchronize back to premise:
Create a queue and create a small always-running process that populates this queue with sequential integers (this process should remember which number it generated last and keep replenishing the pool with more numbers once the queue gets close to be empty)
Now, you can have your code first poll the next number from the queue, delete it from the queue and then attempt to save it into the SQL Azure database. In case of failure, all you'll have is a "hole" in your sequential numbers. In scenarios with frequent inserts, you may be saving things out of order to the database (two processes poll from queue, one polls first but saves last, the PK's saved to the database are not sequential anymore)
The biggest downside is that you now have to maintain/monitor a process that replenishes the pool of PK's.
After read this, I would not trust on identity column.
I think the best way is before insert, get the last stored id and increment it by one. (programatically). Another option is create a trigger, but it could be a mass if you'll receive a lot of concurrent requests on DB or if your table have millions of records.
create trigger trigger_name
on table_name
after insert
as
declare #seq int
set #seq = (select max(id) + 1 from table_name)
update table_name
set table_name.id = #seq
from table_name
inner join inserted
on table_name.id = inserted.id
More info:
http://msdn.microsoft.com/en-us/library/windowsazure/ee336242.aspx
If you're worried about scaling the number generation when using blobs, then you can use the SnowMaker library which is available on GitHub and Nuget. It gets around the scale problem by retrieving blocks of ids into a local cache. This guarantees that the Ids are unique, but not necessarily sequential if you have more than one server. I'm not sure if that would achieve what you're after.