I recently posted question and I received full answer. But I am encountering another problem.
Case scenario is the same as in my recent question.
How can I configure member to own partition key?
e.g. DataCenterOnRussia partition key must always be owned by member1 and DataCenterOnGermany partition key must always be owned by member2.
So member2 could request data from DataCenterOnRussia using PartitionAwareKey.
The intent of the PartitionAwareKey is to allow for data affinity ... orders for a customer should be stored in the same partition as the customer record, for example, since they are frequently accessed together.
The PartitionAwareKey allows grouping items together, but not a way to specify the placement of those items on a specific cluster member. (I guess if there were such a thing, it would likely be called MemberAwareKey).
A cluster in Hazelcast isn't a fixed-size entity; it is dynamically scalable, so members might be added or removed, and it is fault-tolerant, so a member could be lost without loss of the data that happened to be on that member. In order to support those features, the cluster must have the freedom to move partitions around to different machines as the cluster topology changes.
Hazelcast recommends that all members of a cluster be similarly configured (equivalent memory configuration, most particularly) because of the idea that cluster members are interchangeable, at least as far as data storage. (The MemberSelector facility does provide a provision for handling systems that have different processing capability, e.g., number of processor cores; but nothing similar exits to allow placement of specific data entries or partitions on a designated member).
If your use case requires specific placement on machines, it's an indication that those machines probably should not be part of the same cluster.
Related
We are working on a project right now which implements and uses the Azure Eventhub.
We use the Event Processor Host to process the data from the Eventhub. We have 32 partitions distributed on to 3 nodes and are wondering how the Event Processor Host distributes and balances the partitions on to the receivers / nodes – especially when using partition key.
We currently have 4 different customers (blue, orange, purple and light blue) which sends us different sizes of data. As you can see the blue customer on the left sends approx. 132k strings of data, while the light blue customer on the right only sends 28.
Our theory was, that given a partitionkey based on the customer (the coloridentification) we would see that a customers data would only be placed in one node.
Instead we can see that the data is somehow evenly distributed on the 3 nodes as seen below:
Node 1:
Node 2:
Node 3:
Is there something we’ve misunderstood in regards to how the use of the partitionkey works? From what we’ve read in the documentation, then when we don’t specify partition keys, then a “round-robin” approach will be used – but even with the use of a partition key, it somehow distributes them evenly.
Are we somehow stressing the nodes – with a blue customer having a huge amount of data and another customer having almost nothing? Or what is going on?
To visualize our theory we've drawn the following:
So are we stressing the top node with a blue customer, that in the end has to move a partition to the middle node?
A partition key is intended to be used when you want to be sure that a set of events is routed to the same partition, but you don't want to assign an explicit partition. In short, use of a partition key is an explicit request to control routing and prevents the service from balancing across partitions.
When you specify a partition key, it is used to produce a hash value that the Event Hubs service uses to assign the partition to which the event will be routed. Every event that uses the same partition key will be published to the same partition.
To allow the service to round-robin when publishing, you cannot specify a partition key or an explicit partition identifier.
Jesse already explained what partition key is good for so I won't repeat that.
If you want customer to consumer-node affinity, you should consider dedicating an independent eventhub to each customer so that you can tell your system something like
node-1 processes data from customerA only by consuming events from eventhub-1
node-2 processes data from customerB only by consuming events from eventhub-2
and so on...
Making use of partition key doesn't really address you business logic here.
One more thing. If you are planning to run this with larger number of customers in the future then you also need to consider to scale out your design to create affinity between customer and EH namespace as well.
I use hazelcast to store data in memory, and sometimes one point will lose connection to others, and the value of hazelcast's map may be different.
I want to know how it comes when the connection is rebuilt when the values in map are different.
When network partitioning happens, your cluster might split into two and as you say, the values might differ as they run in parallel. After the clusters join again, split-brain recovery mechanism executes. Depending on the data structure you use, the data in both clusters merge according to the configured merge policy. I suggest reading Split Brain Recovery section of the Hazelcast manual for more detail.
In a distributed fault tolerant system conflict resolution is vital as multiple copies of same date will be mirrored and any request can go to any node concurrently while writing.
I used Riak and Cassandra before. Riak works based on vector clock and we can decide the conflict resolution whether the system will do it automatically or the user have to handle it, generally in case of sensitive date user don't want the system to decide which once to keep and which one to discard. Same goes with Cassandra and it's based on time stamp.
So as per cosomosdb is considered, we have various consistency level starting from Strong to Eventual. Based on the choice of consistency the system might generate sibling.
Q1. So my first question is how the sibling handling and hence conflict resolution is happening.Is there a way to handle the sibling from user side programmatically instead of system decide which one to keep and which one to discard in case of sibling.
Q2. And second question is like vector clock and time-stamp in Riak and Cassandra what is the mechanism in documentdb.
Cosmos DB employs single master write for a partition key range. Irrespective of the consistency level the writes are guaranteed to be conflict-free. The only time the possibility of conflict is during automatic failover of write region. This scenario is explained in greater detail here https://learn.microsoft.com/en-us/azure/cosmos-db/regional-failover.
In the case of write region failover, any unreplicated writes will be registered as conflicts. Applications can perform a manual merge of this record. Here are the details of ConflictFeedAsync https://msdn.microsoft.com/en-us/library/microsoft.azure.documents.client.documentclient.readconflictfeedasync.aspx
#Karthik is right until Cosmos was in single master but with the introduction of multi master(there are chances of concurrent multi region writes) conflict is inevitable.
Either let azure handle conflicts(LWW, based on timestamp) which is default if you didn't specify any policy or you write your custom logic to hand it based on your application needs(Custom).
For LWW you don't have to do anything but for Custom, once you specify the policy you can read conflicts and resolve based on your business logic. PFB the java snippet on how to resolve conflicts.
Java Async API
FeedResponse<Conflict> response = client.readConflicts(this.manualCollectionUri, null).first().toBlocking().single();
for (Conflict conflict : response.getResults()) {
/* Do something with conflict */
}
Java Sync API
Iterator<Conflict> conflictsIterator = client.readConflicts(this.collectionLink, null).getQueryIterator();
while (conflictsIterator.hasNext()) {
Conflict conflict = conflictsIterator.next();
/* Do something with conflict */
}
Reference: Manage conflict resolution polices
(I could not find a good source explaining this, so if it is available elsewhere, you could just point me to it)
Hazelcast replicates data across all nodes in clusters. So, if data is changed in one of the nodes, does the node update its own copy and then propagate it to other nodes?
I read somewhere that each data is owned by a node, how does Hazelcast determine the owner? Is the owner determined per datastructure or per key in the datastructure?
Does Hazelcast follow "eventually consistent" principle? (When the data is being propagated across the nodes, there could be a small window during which the data might be inconsistent between the nodes)
How are conflicts handled? (Two nodes update the same key-value simultaneously)
Hazelcast does not replicate (with exception of the ReplicatedMap, obviously ;-)) but partitions data. That means you have one node that owns a given key. All updates to that key will go to the owner and he notifies possible updates.
The owner is determined by consistent hashing using the following formula:
partitionId = hash(serialize(key)) % partitionCount
Since there is only one owner per key it is not eventually consistent but consistent whenever the mutating operations is returned. All following read operations will see the new value. Under normal operational circumstances. When any kind of failure happens (network, host, ...) we choose availability over consistency and it might happen that a not yet updated backup is reactivated (especially if you use async backups).
Conflicts can happen after split-brain when the split cluster re-merge. For this case you have to configure (or use the default one) MergePolicy to define the behavior on how conflicting elements are merged together or which one of both wins.
Two somewhat related questions.
1) Is there anyway to get an ID of the server a table entity lives on?
2) Will using a GUID give me the best partition key distribution possible? If not, what will?
we have been struggling for weeks on table storage performance. In short, it's really bad, but early on we realized that using a randomish partition key will distribute the entities across many servers, which is exactly what we want to do as we are trying to achieve 8000 reads per second. Apparently our partition key wasn't random enough, so for testing purposes, I have decided to just use a GUID. First impression is it is waaaaaay faster.
Really bad get performance is < 1000 per second. Partition key is Guid.NewGuid() and row key is the constant "UserInfo". Get is execute using TableOperation with pk and rk, nothing else as follows: TableOperation retrieveOperation = TableOperation.Retrieve(pk, rk); return cloudTable.ExecuteAsync(retrieveOperation). We always use indexed reads and never table scans. Also, VM size is medium or large, never anything smaller. Parallel no, async yes
As other users have pointed out, Azure Tables are strictly controlled by the runtime and thus you cannot control / check which specific storage nodes are handling your requests. Furthermore, any given partition is served by a single server, that is, entities belonging to the same partition cannot be split between several storage nodes (see HERE)
In Windows Azure table, the PartitionKey property is used as the partition key. All entities with same PartitionKey value are clustered together and they are served from a single server node. This allows the user to control entity locality by setting the PartitionKey values, and perform Entity Group Transactions over entities in that same partition.
You mention that you are targeting 8000 requests per second? If that is the case, you might be hitting a threshold that requires very good table/partitionkey design. Please see the article "Windows Azure Storage Abstractions and their Scalability Targets"
The following extract is applicable to your situation:
This will provide the following scalability targets for a single storage account created after June 7th 2012.
Capacity – Up to 200 TBs
Transactions – Up to 20,000 entities/messages/blobs per second
As other users pointed out, if your PartitionKey numbering follows an incremental pattern, the Azure runtime will recognize this and group some of your partitions within the same storage node.
Furthermore, if I understood your question correctly, you are currently assigning partition keys via GUID's? If that is the case, this means that every PartitionKey in your table will be unique, thus every partitionkey will have no more than 1 entity. As per the articles above, the way Azure table scales out is by grouping entities in their partition keys inside independent storage nodes. If your partitionkeys are unique and thus contain no more than one entity, this means that Azure table will scale out only one entity at a time! Now, we know Azure is not that dumb, and it groups partitionkeys when it detects a pattern in the way they are created. So if you are hitting this trigger in Azure and Azure is grouping your partitionkeys, it means your scalability capabilities are limited to the smartness of this grouping algorithm.
As per the the scalability targets above for 2012, each partitionkey should be able to give you 2,000 transactions per second. Theoretically, you should need no more than 4 partition keys in this case (assuming that the workload between the four is distributed equally).
I would suggest you to design your partition keys to group entities in such a way that no more than 2,000 entities per second per partition are reached, and drop using GUID's as partitionkeys. This will allow you to better support features such as Entity Transaction Group, reduce the complexity of your table design, and get the performance you are looking for.
Answering #1: There is no concept of a server that a particular table entity lives on. There are no particular servers to choose from, as Table Storage is a massive-scale multi-tenant storage system. So... there's no way to retrieve a server ID for a given table entity.
Answering #2: Choose a partition key that makes sense to your application. just remember that it's partition+row to access a given entity. If you do that, you'll have a fast, direct read. If you attempt to do a table- or partition-scan, your performance will certainly take a hit.
See
http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx for more info on key selection (Note the numbers are 3 years old, but the guidance is still good).
Also this talk can be of some use as far as best practice : http://channel9.msdn.com/Events/TechEd/NorthAmerica/2013/WAD-B406#fbid=lCN9J5QiTDF.
In general a given partition can support up to 2000 tps, so spreading data across partitions will help achieve greater numbers. Something to consider is that atomic batch transactions only apply to entities that share the same partitionkey. Additionally, for smaller requests you may consider disabling Nagle as small requests may be getting held up at the client layer.
From the client end, I would recommend using the latest client lib (2.1) and Async methods as you have literally thousands of requests per second. (the talk has a few slides on client best practices)
Lastly, the next release of storage will support JSON and JSON no metadata which will dramatically reduce the size of the response body for the same objects, and subsequently the cpu cycles needed to parse them. If you use the latest client libs your application will be able to leverage these behaviors with little to no code change.