How does partition lease ownership management works in EventProcessorClient? There is an article with high level description but I am looking for more details, like lease management, timings, expiration etc.
Note: question is about EventProcessorClient, not EventProcessorHost.
The EventProcessorClient uses a conservative approach to claiming partitions intended to allow processors to cycle up without having partitions "bounce" between them as they fight for ownership until things have stabilized.
The load balancing cycle ticks every 10 seconds, at which point the processor will determine whether or not to claim an additional partition. Partition ownership uses a 30 second lease, which is used primarily to determine when a partition may have been orphaned; each cycle, the lease for claimed partitions is extended when the owner is healthy. The PartitionLoadBalancer implementation is fairly heavily commented, if you're interested in further details.
The values for the load balancing configuration can be seen by observing the defaults for the EventProcessorOptions class, which influences the EventProcessor<TPartition> class that serves as a base for the EventProcessorClient.
At present, the load balancing configuration is not exposed within the EventProcessorClientOptions in order to simplify the API surface, though that is a topic of discussion for some changes that are being worked on in the near future.
Related
We are using the EventProcessorClient class to read messages from all partitions of an Event Hub.
To enhance the checkpoint writing, we hold a counter and perform the checkpointing after processing multiple messages in ProcessEventAsync instead of after each one.
When the service restarts or gets deployed, we noticed that we re-read previously processed messages due to the fact that the checkpoint was not updated right before stopping.
This blog mentions the pattern of frequent checkpointing and suggests performing one last checkpoint inside the OnPartitionProcessingStopped when implementing EventProcessor<TPartition>.
However, when using EventProcessorClient, the implementation gets wrapped and only the PartitionClosingAsync event gets called, which does not have the required details to update the checkpoint.
Is there a way to update the checkpoint to the latest one when shutting down while using EventProcessorClient?
There are a couple of different ways to do this, with varying degrees of complexity. However, in most scenarios, it's unlikely that checkpointing on shutdown is going to have the effect that you would like it to. For the interested, context is way down below in "Load balancing details".
How-to: A simple approach
The most straightforward approach would be to hold on to the ProcessEventArgs for the last event processed by each partition in a class-level member to do so.
With this, you're still very likely to run into the dueling owners scenario described in the details.
How-to: More complex, less memory, same challenges
If you extend the EventProcessorClient, you could hold onto just the last processed offset for each partition and use it to call the protected UpdateCheckpointAsync method on the processor. This saves you some space, as you're holding onto just a long rather than the state associated with the event args.
With this, you're still very likely to run into the dueling owners scenario described in the details.
How-to: A lot more complexity, more efficient, and without overlaps
For this one to make sense, you'd have to have a stable number of processors - no dynamic scaling.
Checkpointing takes the same approach as bove - you'll extend the processor client and using the offset to call UpdateCheckpointAsync as described above.
To avoid ownership changes and rewinds, you'll assign the processor a set of static partitions and bypass load balancing. The approach is described in the in the Static partition assignment sample, which can be applied when extending the EventProcessorClient.
With this, you do not have the potential for overlapping owners. Since the partitions are statically assigned, you know this node will be the only one to process them.
The trade-off here is that you lose load balancing and dynamic assignment. If the node crashes or loses network connectivity, nobody will be processing the set of partitions that it owns. This generally works well in host environments where there's an orchestrator monitoring nodes and ensuring they're healthy.
Load balancing details
When partitions change owners, it not an ordered hand-off; processors do not coordinate to ensure that the old owner has stopped before the new owner begins reading. As a result, there is often a period of overlap where the old owner has a set of events in memory and won't be aware that a new owner has taken over until it tries to read the next batch.
During this time, both the old and new owners are processing events from the same partition. Both may be emitting checkpoints. If the old owner emits a checkpoint when the processing stops for that partition, there's a good chance it will rewind the checkpoint position to an earlier point than the current owner has written. That causes a bigger rewind if ownership changes again before the new owner emits a checkpoint.
We generally recommend that you expect to rewind by one checkpoint any time you're scaling the number of processors or deploying/rebooting nodes. This is something to take into account when devising your checkpoint strategy.
I recently posted question and I received full answer. But I am encountering another problem.
Case scenario is the same as in my recent question.
How can I configure member to own partition key?
e.g. DataCenterOnRussia partition key must always be owned by member1 and DataCenterOnGermany partition key must always be owned by member2.
So member2 could request data from DataCenterOnRussia using PartitionAwareKey.
The intent of the PartitionAwareKey is to allow for data affinity ... orders for a customer should be stored in the same partition as the customer record, for example, since they are frequently accessed together.
The PartitionAwareKey allows grouping items together, but not a way to specify the placement of those items on a specific cluster member. (I guess if there were such a thing, it would likely be called MemberAwareKey).
A cluster in Hazelcast isn't a fixed-size entity; it is dynamically scalable, so members might be added or removed, and it is fault-tolerant, so a member could be lost without loss of the data that happened to be on that member. In order to support those features, the cluster must have the freedom to move partitions around to different machines as the cluster topology changes.
Hazelcast recommends that all members of a cluster be similarly configured (equivalent memory configuration, most particularly) because of the idea that cluster members are interchangeable, at least as far as data storage. (The MemberSelector facility does provide a provision for handling systems that have different processing capability, e.g., number of processor cores; but nothing similar exits to allow placement of specific data entries or partitions on a designated member).
If your use case requires specific placement on machines, it's an indication that those machines probably should not be part of the same cluster.
I am looking for documentation or general guidelines on when more Cassandra servers should be added to a ring. Should this be based on disk usage or other monitoring factors?
Currently I have some concerns about CoordinatorReadLatency, ReadLatency, and DroppedMessages.REQUEST_RESPONSE, but again I cannot find a good guide on how to interpret various components that I am monitoring. I can find good guides on performance tuning, but limited information on devops.
I understand that this question may be more relevant to Server Fault, but they don't have tags for Datastax Enterprise.
Thanks in advance
Next steps based on #bcoverston 's response
Nodetool provides access to read and write latency metrics: nodetool cfhistrograms
See docs here: http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsCFhisto.html?scroll=toolsCFhisto#
Since we want to tie this into pretty graphs the nodetool source code points us to the right jmx values
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/tools/NodeTool.java#L82
Each cf has write and read latency metrics.
The question is a little open ended, and it depends on your use case. There are a lot of things to monitor, and it can be overwhelming to look at every possible setting and decide if you need to increase your cluster size.
The general advice here is that you should monitor your read and write latency, decide where your thresholds should be, and plan your capacity accordingly. Because there is no proscriptive hardware for running Cassandra, and your use case can be unique to whatever your doing there are only rules of thumb.
Sizing your cluster based on data/node can be helpful, but only if I know how big your working set is, and what your latency targets are. In addition the speed of your storage media also matters.
Sizing your cluster based on latency makes more sense. If you need to do N tx/second you can test your hardware based on your workload and see if it can meet your targets. Keep in mind that when you do this you'll want to do a long term test to see if those targets hold up in a sustained manner, and also how long it will take until performance under that load when and if it will degrade (a write heavy workload will degrade over time, and you'll want to add capacity before you start missing your targets).
I'm in the process of evaluating GridGain and have read and re-read all the documentation I could find. While much of it is very thorough, you can tell that it's mostly written by the developers. It would be great if there were a reference book written by an outsider's perspective.
Anyway, I have five basic questions I'm hoping someone from GridGain can answer and clarify for me.
It's my understanding that GridCacheQueue (and the other Distributed Data Structures) are built on top of the GridCache implementation. Does that mean that each element of the GridCacheQueue is really just a GridCacheElement of the GridCache map, or is each GridCacheQueue a GridCacheElement, or do I have this totally wrong?
If I set a default TTL on the GridCache, will the elements of a GridCacheQueue expire in the TTL time, or does it only apply to GridCacheElements (which might be answered in #1 above)?
Is there a way to make a GridCacheQueue expire after some period of time without having to remove it manually?
If a cache is set-up to be backed-up onto other nodes and the cache is using off-heap memory and/or swap storage, is the off-heap memory and/or swap storage also replicated onto the back-up nodes?
Is it possible to create a new cache dynamically, or can it only be created via configuration when the node is created?
Thanks for any insightful information!
-Colin
After experimenting with a GridCache and a GridCacheQueue, here's what I've learned about my 5 questions:
I don't know how the GridCacheQueue or its elements are attached to a GridCache, but I know that the elements of a GridCacheQueue DO NOT show up as GridCacheElements of the GridCache.
If you set a TTL on a GridCache and add a GridCacheQueue to it, once the elements of the GridCache begin expiring, the GridCacheQueue becomes unusable and will cause a GridRuntimeException to be thrown.
Yes, see #2 above. However, there doesn't seem to be a safe way to test if the queue is still in existence once the elements of the GridCache start to expire.
Still have no information about this yet. Would REALLY like some feedback on that.
That was a question I never should have asked. A GridCache can be created entirely in code and configured.
Let me first of all say that GridGain supports several queue configuration parameters:
Colocated vs. non-colocated. In colocated mode you can have many queues. Each queue will be assigned to some grid node and all the data in that queue will be cached on that grid node. This way, if you have many queues, each queue may be cached on a different node, but queues themselves should be evenly distributed across all nodes. Non-colocated mode, on the other hand is meant for larger queues, where data for the same queue is partitioned across multiple nodes.
Capacity - this parameter defines maximum queue capacity. When queue reaches this capacity it will automatically start evicting elements oldest elements.
Now, let me try to tackle some of these questions.
I believe each element of GridCacheQuery is a separate element in cache, but implementation marks them as internal elements. That is why you don't see these elements when iterating through cache.
TTL should not be used with elements in the queue (GridGain will be adding this feature soon). For now, you should limit the maximum size of the queue by specifying queue 'capacity' at creation time.
I don't believe so, but I think this feature is being added. For now, you can try using org.gridgain.grid.schedule.GridScheduler to schedule a job that will delete a queue later.
The answer is YES. Both, data in off-heap and swap spaces is backed up and replicated the same way as main on-heap cache data.
A cache should be created in configuration, either from code or XML. However, GridGain has a cool notion of GridCacheProjection which allows to create various sub-caches (cache views) on the same cache. For example, if you store Person and Organization classes in the same cache, then you can use cache projection for type Person when working with Person class, and cache projection of type Organization when working with Organization class.
There is a great talk here about simulating partition issues in Cassandra with Kingsby's Jesper library.
My question is - with Cassandra are you mainly concerned with the Partitioning part of the CAP theorem, or is Consistency a factor you need to manage as well?
Cassandra is typically classified as an AP system, meaning that availability and partition tolerance are generally considered to be more important than consistency. However, real world systems rarely fall neatly into these categories, so it's more helpful to view CAP as a continuum. Most systems will make some effort to be consistent, available, and partition tolerant, and many (including Cassandra) can be tuned depending on what's most important. Turning knobs like replication factor and consistency level can have a dramatic impact on C, A, and P.
Even defining what the terms mean can be challenging, as various use cases have different requirements for each. So rather than classify a system as CP, AP, or whatever, it's more helpful to think in terms of the options it provides for tuning these properties as appropriate for the use case.
Here's an interesting discussion on how things have changed in the years since the CAP theorem was first introduced.
CAP stands for Consistency, Availability and Partition Tolerance.
In general, its impossible for a distributed system to guarantee above three at a given point.
Apache Cassandra falls under AP system meaning Cassandra holds true for Availability and Partition Tolerance but not for Consistency but this can further tuned via replication factor(how many copies of data) and consistency level (read and write).
For more info: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlConfigConsistency.html
Interestingly it depends on your Cassandra configuration. Cassandra can at max be AP system. But if you configure it to read or write based on Quorum then it does not remain CAP-available (available as per definition of the CAP theorem) and is only P system.
Just to explain things in more detail CAP theorem means:
C: (Linearizability or strong consistency) roughly means
If operation B started after operation A successfully completed, then
operation B must see the system in the same state as it was on
completion of operation A, or a newer state (but never older state).
A:
“every request received by a non-failing [database] node in the system must result in a [non-error] response”. It’s not sufficient for some node to be able to handle the request: any non-failing node needs to be able to handle it. Many so-called “highly available” (i.e. low downtime) systems actually do not meet this definition of availability.
P
Partition Tolerance (terribly misnamed) basically means that you’re communicating over an asynchronous network that may delay or drop messages. The internet and all our data centres have this property, so you don’t really have any choice in this matter.
Source: Awesome Martin kleppmann's work
The CAP theorem states that a database can’t simultaneously guarantee consistency, availability, and partition tolerance
Since network partitions are part of life, distributed databases tend to be either CP or AP
Cassandara was meant for AP but you can fine tune consistency at the cost of availability.
Availability : It was ensured with replicas. Cassandra typically writes multiple copies to different cluster nodes (generally 3). If one node is unavailable, data won't be lost.
Writing data to multiple nodes will take time because nodes are scattered in different location. At some point of time, data will become eventually consistent.
So with high availability preference, consistency is compramised.
Tunable consistency:
For read or write operation, you can mention consistency level. Consistency level refers to the number of replicas that need to respond for a read or write operation to be considered complete.
For non-critical features, you can provide less consistency level : say 1.
If you think consistency is important, you can increase the level to TWO, THREE or QUORAM ( A majority of replicas)
Assume that you set the consistency level to high (QUORAM) for your critical features and majority of the nodes are down. In this case, write operation will fail.
Here Cassandra sacrificies availabiltiy for consistency.
Have a look at this article for more details.