Should we concern about too many bundles in a Pulsar topic? - apache-pulsar

We observed the following warning in our Pulsar cluster. Shall we concern about it?
18:37:37.165 [pulsar-modular-load-manager-29-1]
WARN org.apache.pulsar.brok er.loadbalance.BundleSplitStrategy -
Could not split namespace bundle pulsareval/0x56000000_0x56800000
because namespace pulsareval has too many bundles: 512

First some background on what bundles actually are...
from https://pulsar.apache.org/docs/en/administration-load-balance/#creating-namespaces-and-bundles
Instead of being assigned to an individual topic, each Pulsar broker takes ownership of a subset of the topics for a namespace. This subset is called a "bundle" and effectively it's a sharding mechanism.
Topics are assigned to a particular bundle by taking the hash of the topic name and seeing in which bundle the hash falls into. Each bundle is independent of the others and thus is independently assigned to different brokers.
In general, if the expected traffic and number of topics is known in advance, it's a good idea to start with a reasonable number of bundles instead of waiting for the system to auto-correct the distribution.
Namespace bundles splitting
Since the load for the topics in a bundle might change over time, or could just be hard to predict upfront, bundles can be split in 2 by brokers. The new smaller bundles can then be reassigned to different brokers.
Your error message indicates that the number of topics grew enough to trigger the automatic splitting of the bundles, but that failed do to the maximum number of bundles allowed in a namespace. While this is a non-fatal condition, it will impact performance.
Therefore you might want to increase the value of the loadBalancerNamespaceMaximumBundles property in your broker.conf file which limits the maximum number of bundles you can have in a namespace. Based on the error message it appears that is what is happening.

Related

Azure.Messaging.EventHubs EventProcessorClient - Writing checkpoint on shutdown/restart

We are using the EventProcessorClient class to read messages from all partitions of an Event Hub.
To enhance the checkpoint writing, we hold a counter and perform the checkpointing after processing multiple messages in ProcessEventAsync instead of after each one.
When the service restarts or gets deployed, we noticed that we re-read previously processed messages due to the fact that the checkpoint was not updated right before stopping.
This blog mentions the pattern of frequent checkpointing and suggests performing one last checkpoint inside the OnPartitionProcessingStopped when implementing EventProcessor<TPartition>.
However, when using EventProcessorClient, the implementation gets wrapped and only the PartitionClosingAsync event gets called, which does not have the required details to update the checkpoint.
Is there a way to update the checkpoint to the latest one when shutting down while using EventProcessorClient?
There are a couple of different ways to do this, with varying degrees of complexity. However, in most scenarios, it's unlikely that checkpointing on shutdown is going to have the effect that you would like it to. For the interested, context is way down below in "Load balancing details".
How-to: A simple approach
The most straightforward approach would be to hold on to the ProcessEventArgs for the last event processed by each partition in a class-level member to do so.
With this, you're still very likely to run into the dueling owners scenario described in the details.
How-to: More complex, less memory, same challenges
If you extend the EventProcessorClient, you could hold onto just the last processed offset for each partition and use it to call the protected UpdateCheckpointAsync method on the processor. This saves you some space, as you're holding onto just a long rather than the state associated with the event args.
With this, you're still very likely to run into the dueling owners scenario described in the details.
How-to: A lot more complexity, more efficient, and without overlaps
For this one to make sense, you'd have to have a stable number of processors - no dynamic scaling.
Checkpointing takes the same approach as bove - you'll extend the processor client and using the offset to call UpdateCheckpointAsync as described above.
To avoid ownership changes and rewinds, you'll assign the processor a set of static partitions and bypass load balancing. The approach is described in the in the Static partition assignment sample, which can be applied when extending the EventProcessorClient.
With this, you do not have the potential for overlapping owners. Since the partitions are statically assigned, you know this node will be the only one to process them.
The trade-off here is that you lose load balancing and dynamic assignment. If the node crashes or loses network connectivity, nobody will be processing the set of partitions that it owns. This generally works well in host environments where there's an orchestrator monitoring nodes and ensuring they're healthy.
Load balancing details
When partitions change owners, it not an ordered hand-off; processors do not coordinate to ensure that the old owner has stopped before the new owner begins reading. As a result, there is often a period of overlap where the old owner has a set of events in memory and won't be aware that a new owner has taken over until it tries to read the next batch.
During this time, both the old and new owners are processing events from the same partition. Both may be emitting checkpoints. If the old owner emits a checkpoint when the processing stops for that partition, there's a good chance it will rewind the checkpoint position to an earlier point than the current owner has written. That causes a bigger rewind if ownership changes again before the new owner emits a checkpoint.
We generally recommend that you expect to rewind by one checkpoint any time you're scaling the number of processors or deploying/rebooting nodes. This is something to take into account when devising your checkpoint strategy.

Azure Service Bus: Ordered Processing of Session Sequences

Are there any recommended architectural patterns with Service Bus for ensuring ordered processing of nested groups of messages which are sent out of order? We are using Sessions, but when it comes down to ensuring that a set of Sessions must be processed sequentially in a certain order before moving onto another set of Sessions, the architecture becomes cumbersome very quickly. This question might best be illustrated with an example.
We are using Service Bus to integrate changes in real-time from a database to a third-party API. Every N minutes, we get notified of a new 'batch' of changes from the database which consists of individual records of data across different entities. We then transform/map each record and send it along to an API. For example, a 'batch' of changes might include 5 new/changed 'Person' records, 3 new/changed 'Membership' records, etc.
At the outer-most level, we must always process one entire batch before we can move on to another batch of data, but we also have a requirement to process each type of entity in a certain order. For example, all 'Person' changes must be processed for a given batch before we can move on to any other objects.
There is no guarantee that these records will be queued up in any order which is relevant to how they will need to be processed, particularly within a 'batch' of changes (e.g. the data from different entity types will be interleaved).
We actually do not necessarily need to send the individual records of entity data in any order to the API (e.g. it does not matter in which order I send those 5 Person records for that batch, as long as they are all sent before the 3 Membership records for that batch). However, we do group the messages into Sessions by entity type so that we can guarantee homogeneous records in a given session and target all records for that entity type (this also helps us support a separate requirement we have when calling the API to send a batch of records when possible instead of an individual call per record to avoid API rate limiting issues). Currently, our actual Topic Subscription containing the record data is broken up into Sessions which are unique to the entity type and the batch.
"SessionId": "Batch1234\Person"
We are finding that it is cumbersome to manage the requirement that all changes for a given batch must be processed before we move on to the next batch, because there is no Session which reliably groups those "groups of entities" together (let alone processing those groups of entities themselves in a certain order). There is, of course, no concept of a 'session of sessions', and we are currently handling this by having a separate 'Sync' queue to represent an entire batch of changes which needs to be processed what sessions of data are contained in that batch:
"SessionId": "Batch1234",
"Body":
{
"targets": ["Batch1234\Person", "Batch1234\Membership", ...]
}
This is quite cumbersome, because something (e.g. a Durable Azure Function) now has to orchestrate the entire process by watching the Sync queue and then spinning off separate processors that it oversees to ensure correct ordering at each level (which makes concurrency management and scalability much more complicated to deal with). If this is indeed a good pattern, then I do not mind implementing the extra orchestration architecture to ensure a robust, scalable implementation. However, I cannot help from feeling that I am missing something or not thinking about the architecture the right way.
Is anyone aware of any other recommended pattern(s) in Service Bus for handling ordered processing of groups of data which themselves contain groups of data which must be processed in a certain order?
For the record I'm not a service bus expert, specifically.
The entire batch construct sounds painful - can you do away with it? Often if you have a painful input, you'll have a painful solution - the old "crap in, crap out" maxim. Sometimes it's just hard to find an elegant solution.
Do the 'sets of sessions' need to be processed in a specific order?
Is a 'batch' of changes = a session?
I can't think of a specific pattern, but a "divide and conquer" approach seems reasonable (which is roughly what you have already?):
Watch for new batches, when one occurs hand it off to a BatchProcessor.
BatchProcessor applies all the rules to the batch, as you outlined.
Consider having the BatchProcessor dump it's results on a queue of some kind which is the source for the API - that way you have some kind of isolation between the batch processing and the API.

Hazelcast Near Cache is not controlled by Split Brain Protection(Quorum)

I used the embedded hazelcast 4.0.1 in the spring boot project to manage the cache of the project. I set up Near Cache, and also set up the split-brain protection function, which was called Quorum before 4.0.
However, I found a problem. For example, I put the cache operation on a service:
#Cacheable(value ="CacheSpaceName", key ="#id")
public String findById(String id) {
...
}
If the correct data has been cached in Near Cache, even if the split-brain protection is in effect, the service will still return the correct result instead of being rejected by the split-brain protection.
How can I make Near Cache also be controlled by Split Brain Protection? I hope that when split brain occurs, small clusters cannot operate normally, and only large clusters can operate normally.
The following is the near cache configuration and split-brain protection configuration code in the project:
final NearCacheConfig nearCacheConfig = new NearCacheConfig()
.setInMemoryFormat(InMemoryFOrmat.OBJECT)
.setCacheLocalEntries(true)
.setMaxIdleSeconds(xxx);
MapConfig allMapConfig = new MapConfgi.setName("*").setNearCacheConfig(nearCacheConfig)
.setBackupCount(0).setMaxIndleSeconds(xxx).setInMemoryFormat(InMemoryFormat.OBJECT)
.setMergePolicyConfig(xxx)
final SplitBrainProtectionConfig splitBrainProtectionConfig = new SplitBrainProtectionConfig("name", true, 2);
splitBrainProtectionConfig.setProtectOn(SplitBrainProtectionOn.READ_WRITE);
allMapConfig.setSplitBrainProtectionName("name");
config.addSplitBrainProtectionConfig(splitBrainProtectionConfig);
config.addMapConfig(allMapConfig);
NearCache is not covered by Split Brain protection as NearCache is application side caching and Hazelcast built-in split brain protection is meant to protect cluster members(servers).
Additionally to your point -
I hope that when split brain occurs, small clusters cannot operate
normally, and only large clusters can operate normally
In split brain clusters, both sides regardless of their sizes, work fine. Cluster size becomes relevant when network partitioning is resolved and the two sides are ready to merge. Hazelcast deploys a background task that periodically searches for split clusters. When a split is detected, the side that will initiate the merge process is decided. This decision is based on the cluster size; the smaller cluster, by member count, merges into the bigger one. If they have an equal number of members, then a hashing algorithm determines the merging cluster. When deciding the merging side, both sides ensure that there’s no intersection in their member lists.
The problem that I raised that NearCahce is not controlled by the brain protection function is that I hope that when the cluster has a split brain, the small cluster is prohibited from providing any services, and the large cluster can provide services normally. The reason for this demand is that some businesses rely on Hazelcast's cache synchronization function. We hope that the cache can be updated as needed at certain times to avoid using outdated data. If there is a split brain, the cache update cannot be performed in a complete cluster. Therefore, if the small cluster still provides services normally at this time, it is likely to provide wrong services.
There is also a similar "minimum number of members" configuration in the Hazelcast split-brain protection function. When Hazelcast detects that the current number of cluster members is less than this value, certain caching functions of the cluster are prohibited. But because it is only a restriction on the cache operation, and I found that the NearCache cache can't be controlled, I have my question. Although it was later discovered that Hazelcast's split-brain protection may not meet my needs at all.
But now I have found another way to achieve my needs. In fact, it is to use a filter to verify the minimum number of members. Hazelcast's split-brain protection function is no longer needed (currently it is still needed for split-brain recovery, so The merge strategy is also configured normally).

Custom controlled partitioning

I recently posted question and I received full answer. But I am encountering another problem.
Case scenario is the same as in my recent question.
How can I configure member to own partition key?
e.g. DataCenterOnRussia partition key must always be owned by member1 and DataCenterOnGermany partition key must always be owned by member2.
So member2 could request data from DataCenterOnRussia using PartitionAwareKey.
The intent of the PartitionAwareKey is to allow for data affinity ... orders for a customer should be stored in the same partition as the customer record, for example, since they are frequently accessed together.
The PartitionAwareKey allows grouping items together, but not a way to specify the placement of those items on a specific cluster member. (I guess if there were such a thing, it would likely be called MemberAwareKey).
A cluster in Hazelcast isn't a fixed-size entity; it is dynamically scalable, so members might be added or removed, and it is fault-tolerant, so a member could be lost without loss of the data that happened to be on that member. In order to support those features, the cluster must have the freedom to move partitions around to different machines as the cluster topology changes.
Hazelcast recommends that all members of a cluster be similarly configured (equivalent memory configuration, most particularly) because of the idea that cluster members are interchangeable, at least as far as data storage. (The MemberSelector facility does provide a provision for handling systems that have different processing capability, e.g., number of processor cores; but nothing similar exits to allow placement of specific data entries or partitions on a designated member).
If your use case requires specific placement on machines, it's an indication that those machines probably should not be part of the same cluster.

Are GridCacheQueue elements also GridCacheElements?

I'm in the process of evaluating GridGain and have read and re-read all the documentation I could find. While much of it is very thorough, you can tell that it's mostly written by the developers. It would be great if there were a reference book written by an outsider's perspective.
Anyway, I have five basic questions I'm hoping someone from GridGain can answer and clarify for me.
It's my understanding that GridCacheQueue (and the other Distributed Data Structures) are built on top of the GridCache implementation. Does that mean that each element of the GridCacheQueue is really just a GridCacheElement of the GridCache map, or is each GridCacheQueue a GridCacheElement, or do I have this totally wrong?
If I set a default TTL on the GridCache, will the elements of a GridCacheQueue expire in the TTL time, or does it only apply to GridCacheElements (which might be answered in #1 above)?
Is there a way to make a GridCacheQueue expire after some period of time without having to remove it manually?
If a cache is set-up to be backed-up onto other nodes and the cache is using off-heap memory and/or swap storage, is the off-heap memory and/or swap storage also replicated onto the back-up nodes?
Is it possible to create a new cache dynamically, or can it only be created via configuration when the node is created?
Thanks for any insightful information!
-Colin
After experimenting with a GridCache and a GridCacheQueue, here's what I've learned about my 5 questions:
I don't know how the GridCacheQueue or its elements are attached to a GridCache, but I know that the elements of a GridCacheQueue DO NOT show up as GridCacheElements of the GridCache.
If you set a TTL on a GridCache and add a GridCacheQueue to it, once the elements of the GridCache begin expiring, the GridCacheQueue becomes unusable and will cause a GridRuntimeException to be thrown.
Yes, see #2 above. However, there doesn't seem to be a safe way to test if the queue is still in existence once the elements of the GridCache start to expire.
Still have no information about this yet. Would REALLY like some feedback on that.
That was a question I never should have asked. A GridCache can be created entirely in code and configured.
Let me first of all say that GridGain supports several queue configuration parameters:
Colocated vs. non-colocated. In colocated mode you can have many queues. Each queue will be assigned to some grid node and all the data in that queue will be cached on that grid node. This way, if you have many queues, each queue may be cached on a different node, but queues themselves should be evenly distributed across all nodes. Non-colocated mode, on the other hand is meant for larger queues, where data for the same queue is partitioned across multiple nodes.
Capacity - this parameter defines maximum queue capacity. When queue reaches this capacity it will automatically start evicting elements oldest elements.
Now, let me try to tackle some of these questions.
I believe each element of GridCacheQuery is a separate element in cache, but implementation marks them as internal elements. That is why you don't see these elements when iterating through cache.
TTL should not be used with elements in the queue (GridGain will be adding this feature soon). For now, you should limit the maximum size of the queue by specifying queue 'capacity' at creation time.
I don't believe so, but I think this feature is being added. For now, you can try using org.gridgain.grid.schedule.GridScheduler to schedule a job that will delete a queue later.
The answer is YES. Both, data in off-heap and swap spaces is backed up and replicated the same way as main on-heap cache data.
A cache should be created in configuration, either from code or XML. However, GridGain has a cool notion of GridCacheProjection which allows to create various sub-caches (cache views) on the same cache. For example, if you store Person and Organization classes in the same cache, then you can use cache projection for type Person when working with Person class, and cache projection of type Organization when working with Organization class.

Resources