I used the embedded hazelcast 4.0.1 in the spring boot project to manage the cache of the project. I set up Near Cache, and also set up the split-brain protection function, which was called Quorum before 4.0.
However, I found a problem. For example, I put the cache operation on a service:
#Cacheable(value ="CacheSpaceName", key ="#id")
public String findById(String id) {
...
}
If the correct data has been cached in Near Cache, even if the split-brain protection is in effect, the service will still return the correct result instead of being rejected by the split-brain protection.
How can I make Near Cache also be controlled by Split Brain Protection? I hope that when split brain occurs, small clusters cannot operate normally, and only large clusters can operate normally.
The following is the near cache configuration and split-brain protection configuration code in the project:
final NearCacheConfig nearCacheConfig = new NearCacheConfig()
.setInMemoryFormat(InMemoryFOrmat.OBJECT)
.setCacheLocalEntries(true)
.setMaxIdleSeconds(xxx);
MapConfig allMapConfig = new MapConfgi.setName("*").setNearCacheConfig(nearCacheConfig)
.setBackupCount(0).setMaxIndleSeconds(xxx).setInMemoryFormat(InMemoryFormat.OBJECT)
.setMergePolicyConfig(xxx)
final SplitBrainProtectionConfig splitBrainProtectionConfig = new SplitBrainProtectionConfig("name", true, 2);
splitBrainProtectionConfig.setProtectOn(SplitBrainProtectionOn.READ_WRITE);
allMapConfig.setSplitBrainProtectionName("name");
config.addSplitBrainProtectionConfig(splitBrainProtectionConfig);
config.addMapConfig(allMapConfig);
NearCache is not covered by Split Brain protection as NearCache is application side caching and Hazelcast built-in split brain protection is meant to protect cluster members(servers).
Additionally to your point -
I hope that when split brain occurs, small clusters cannot operate
normally, and only large clusters can operate normally
In split brain clusters, both sides regardless of their sizes, work fine. Cluster size becomes relevant when network partitioning is resolved and the two sides are ready to merge. Hazelcast deploys a background task that periodically searches for split clusters. When a split is detected, the side that will initiate the merge process is decided. This decision is based on the cluster size; the smaller cluster, by member count, merges into the bigger one. If they have an equal number of members, then a hashing algorithm determines the merging cluster. When deciding the merging side, both sides ensure that there’s no intersection in their member lists.
The problem that I raised that NearCahce is not controlled by the brain protection function is that I hope that when the cluster has a split brain, the small cluster is prohibited from providing any services, and the large cluster can provide services normally. The reason for this demand is that some businesses rely on Hazelcast's cache synchronization function. We hope that the cache can be updated as needed at certain times to avoid using outdated data. If there is a split brain, the cache update cannot be performed in a complete cluster. Therefore, if the small cluster still provides services normally at this time, it is likely to provide wrong services.
There is also a similar "minimum number of members" configuration in the Hazelcast split-brain protection function. When Hazelcast detects that the current number of cluster members is less than this value, certain caching functions of the cluster are prohibited. But because it is only a restriction on the cache operation, and I found that the NearCache cache can't be controlled, I have my question. Although it was later discovered that Hazelcast's split-brain protection may not meet my needs at all.
But now I have found another way to achieve my needs. In fact, it is to use a filter to verify the minimum number of members. Hazelcast's split-brain protection function is no longer needed (currently it is still needed for split-brain recovery, so The merge strategy is also configured normally).
Related
We are using the EventProcessorClient class to read messages from all partitions of an Event Hub.
To enhance the checkpoint writing, we hold a counter and perform the checkpointing after processing multiple messages in ProcessEventAsync instead of after each one.
When the service restarts or gets deployed, we noticed that we re-read previously processed messages due to the fact that the checkpoint was not updated right before stopping.
This blog mentions the pattern of frequent checkpointing and suggests performing one last checkpoint inside the OnPartitionProcessingStopped when implementing EventProcessor<TPartition>.
However, when using EventProcessorClient, the implementation gets wrapped and only the PartitionClosingAsync event gets called, which does not have the required details to update the checkpoint.
Is there a way to update the checkpoint to the latest one when shutting down while using EventProcessorClient?
There are a couple of different ways to do this, with varying degrees of complexity. However, in most scenarios, it's unlikely that checkpointing on shutdown is going to have the effect that you would like it to. For the interested, context is way down below in "Load balancing details".
How-to: A simple approach
The most straightforward approach would be to hold on to the ProcessEventArgs for the last event processed by each partition in a class-level member to do so.
With this, you're still very likely to run into the dueling owners scenario described in the details.
How-to: More complex, less memory, same challenges
If you extend the EventProcessorClient, you could hold onto just the last processed offset for each partition and use it to call the protected UpdateCheckpointAsync method on the processor. This saves you some space, as you're holding onto just a long rather than the state associated with the event args.
With this, you're still very likely to run into the dueling owners scenario described in the details.
How-to: A lot more complexity, more efficient, and without overlaps
For this one to make sense, you'd have to have a stable number of processors - no dynamic scaling.
Checkpointing takes the same approach as bove - you'll extend the processor client and using the offset to call UpdateCheckpointAsync as described above.
To avoid ownership changes and rewinds, you'll assign the processor a set of static partitions and bypass load balancing. The approach is described in the in the Static partition assignment sample, which can be applied when extending the EventProcessorClient.
With this, you do not have the potential for overlapping owners. Since the partitions are statically assigned, you know this node will be the only one to process them.
The trade-off here is that you lose load balancing and dynamic assignment. If the node crashes or loses network connectivity, nobody will be processing the set of partitions that it owns. This generally works well in host environments where there's an orchestrator monitoring nodes and ensuring they're healthy.
Load balancing details
When partitions change owners, it not an ordered hand-off; processors do not coordinate to ensure that the old owner has stopped before the new owner begins reading. As a result, there is often a period of overlap where the old owner has a set of events in memory and won't be aware that a new owner has taken over until it tries to read the next batch.
During this time, both the old and new owners are processing events from the same partition. Both may be emitting checkpoints. If the old owner emits a checkpoint when the processing stops for that partition, there's a good chance it will rewind the checkpoint position to an earlier point than the current owner has written. That causes a bigger rewind if ownership changes again before the new owner emits a checkpoint.
We generally recommend that you expect to rewind by one checkpoint any time you're scaling the number of processors or deploying/rebooting nodes. This is something to take into account when devising your checkpoint strategy.
I recently posted question and I received full answer. But I am encountering another problem.
Case scenario is the same as in my recent question.
How can I configure member to own partition key?
e.g. DataCenterOnRussia partition key must always be owned by member1 and DataCenterOnGermany partition key must always be owned by member2.
So member2 could request data from DataCenterOnRussia using PartitionAwareKey.
The intent of the PartitionAwareKey is to allow for data affinity ... orders for a customer should be stored in the same partition as the customer record, for example, since they are frequently accessed together.
The PartitionAwareKey allows grouping items together, but not a way to specify the placement of those items on a specific cluster member. (I guess if there were such a thing, it would likely be called MemberAwareKey).
A cluster in Hazelcast isn't a fixed-size entity; it is dynamically scalable, so members might be added or removed, and it is fault-tolerant, so a member could be lost without loss of the data that happened to be on that member. In order to support those features, the cluster must have the freedom to move partitions around to different machines as the cluster topology changes.
Hazelcast recommends that all members of a cluster be similarly configured (equivalent memory configuration, most particularly) because of the idea that cluster members are interchangeable, at least as far as data storage. (The MemberSelector facility does provide a provision for handling systems that have different processing capability, e.g., number of processor cores; but nothing similar exits to allow placement of specific data entries or partitions on a designated member).
If your use case requires specific placement on machines, it's an indication that those machines probably should not be part of the same cluster.
I use hazelcast to store data in memory, and sometimes one point will lose connection to others, and the value of hazelcast's map may be different.
I want to know how it comes when the connection is rebuilt when the values in map are different.
When network partitioning happens, your cluster might split into two and as you say, the values might differ as they run in parallel. After the clusters join again, split-brain recovery mechanism executes. Depending on the data structure you use, the data in both clusters merge according to the configured merge policy. I suggest reading Split Brain Recovery section of the Hazelcast manual for more detail.
(I could not find a good source explaining this, so if it is available elsewhere, you could just point me to it)
Hazelcast replicates data across all nodes in clusters. So, if data is changed in one of the nodes, does the node update its own copy and then propagate it to other nodes?
I read somewhere that each data is owned by a node, how does Hazelcast determine the owner? Is the owner determined per datastructure or per key in the datastructure?
Does Hazelcast follow "eventually consistent" principle? (When the data is being propagated across the nodes, there could be a small window during which the data might be inconsistent between the nodes)
How are conflicts handled? (Two nodes update the same key-value simultaneously)
Hazelcast does not replicate (with exception of the ReplicatedMap, obviously ;-)) but partitions data. That means you have one node that owns a given key. All updates to that key will go to the owner and he notifies possible updates.
The owner is determined by consistent hashing using the following formula:
partitionId = hash(serialize(key)) % partitionCount
Since there is only one owner per key it is not eventually consistent but consistent whenever the mutating operations is returned. All following read operations will see the new value. Under normal operational circumstances. When any kind of failure happens (network, host, ...) we choose availability over consistency and it might happen that a not yet updated backup is reactivated (especially if you use async backups).
Conflicts can happen after split-brain when the split cluster re-merge. For this case you have to configure (or use the default one) MergePolicy to define the behavior on how conflicting elements are merged together or which one of both wins.
I'm in the process of evaluating GridGain and have read and re-read all the documentation I could find. While much of it is very thorough, you can tell that it's mostly written by the developers. It would be great if there were a reference book written by an outsider's perspective.
Anyway, I have five basic questions I'm hoping someone from GridGain can answer and clarify for me.
It's my understanding that GridCacheQueue (and the other Distributed Data Structures) are built on top of the GridCache implementation. Does that mean that each element of the GridCacheQueue is really just a GridCacheElement of the GridCache map, or is each GridCacheQueue a GridCacheElement, or do I have this totally wrong?
If I set a default TTL on the GridCache, will the elements of a GridCacheQueue expire in the TTL time, or does it only apply to GridCacheElements (which might be answered in #1 above)?
Is there a way to make a GridCacheQueue expire after some period of time without having to remove it manually?
If a cache is set-up to be backed-up onto other nodes and the cache is using off-heap memory and/or swap storage, is the off-heap memory and/or swap storage also replicated onto the back-up nodes?
Is it possible to create a new cache dynamically, or can it only be created via configuration when the node is created?
Thanks for any insightful information!
-Colin
After experimenting with a GridCache and a GridCacheQueue, here's what I've learned about my 5 questions:
I don't know how the GridCacheQueue or its elements are attached to a GridCache, but I know that the elements of a GridCacheQueue DO NOT show up as GridCacheElements of the GridCache.
If you set a TTL on a GridCache and add a GridCacheQueue to it, once the elements of the GridCache begin expiring, the GridCacheQueue becomes unusable and will cause a GridRuntimeException to be thrown.
Yes, see #2 above. However, there doesn't seem to be a safe way to test if the queue is still in existence once the elements of the GridCache start to expire.
Still have no information about this yet. Would REALLY like some feedback on that.
That was a question I never should have asked. A GridCache can be created entirely in code and configured.
Let me first of all say that GridGain supports several queue configuration parameters:
Colocated vs. non-colocated. In colocated mode you can have many queues. Each queue will be assigned to some grid node and all the data in that queue will be cached on that grid node. This way, if you have many queues, each queue may be cached on a different node, but queues themselves should be evenly distributed across all nodes. Non-colocated mode, on the other hand is meant for larger queues, where data for the same queue is partitioned across multiple nodes.
Capacity - this parameter defines maximum queue capacity. When queue reaches this capacity it will automatically start evicting elements oldest elements.
Now, let me try to tackle some of these questions.
I believe each element of GridCacheQuery is a separate element in cache, but implementation marks them as internal elements. That is why you don't see these elements when iterating through cache.
TTL should not be used with elements in the queue (GridGain will be adding this feature soon). For now, you should limit the maximum size of the queue by specifying queue 'capacity' at creation time.
I don't believe so, but I think this feature is being added. For now, you can try using org.gridgain.grid.schedule.GridScheduler to schedule a job that will delete a queue later.
The answer is YES. Both, data in off-heap and swap spaces is backed up and replicated the same way as main on-heap cache data.
A cache should be created in configuration, either from code or XML. However, GridGain has a cool notion of GridCacheProjection which allows to create various sub-caches (cache views) on the same cache. For example, if you store Person and Organization classes in the same cache, then you can use cache projection for type Person when working with Person class, and cache projection of type Organization when working with Organization class.