Controlled partitioning and MapStore - hazelcast

Lets say, I have several Hazelcast members (servers) spread across world (e.g. Germany, Russia and etc).
It was required to store/split data in database by region and all data should be accessible from any server via IMap backed by MapStore.
I recently read this article which fulfills my requirement, but I am not sure about how will MapStore behave.
Crucial moment is that, if member1 (e.g. Russia) requests data from IMap with key owned by member2 (e.g. Germany), on which side MapStore.load() will be called?

You should not split members of the same cluster across different data centers. Members of a cluster depend on a regular heartbeat message to detect the health of the cluster; wide area networks cannot reliably deliver these in a consistent fashion and you will almost certainly have network partition issues (split brain syndrome).
Each data center (Germany, Russia, etc.) should have a separate cluster with region-specific maps. These maps can then be replicated (WAN replication) to the remote data centers both for disaster recovery and to provide a geographically close server to support users in that region needing access to the other region's data.
Since the data in the database is already split by region, matching this split on the Hazelcast side means that the MapLoader will always be loading from a database in the same location.

Related

Is there a better way to read locally and write globally? (Design Distributed Systems)

Scenario
I need to design a system which should be able to read from local database and perform writes to all its replicas
(currently using Azure Tables).
I would appreciate if someone could share their approach to achieve this.
Existing design
Region 1: A node with "computeScore" service running. And a database of all customer data.
Region 2: A node with "computeScore" service running.
ComputeScore Service is called at every successful login of a user. It reads the previous login information of the same user from the database in Region 1 and computes a score. This score is written again to the database in Region 1.
Issue
Whenever a customer request is routed to Region 2 (based on the current location of the users, the requests are routed to their nearest physical server), it makes an extra call to Region 1 database to perform the database operations, which obviously results in extra time when compared to customers hitting Region 1.
One way to do it is to maintain a copy of the database at both the regions, but the challenge here would be consistency (writing back the score that is computed) and how do you achieve this.
Looking for a solution to avoid the extra latency for customers directed to Region 2.

How do ArangoDB Graph Traversal Queries Execute in a Cluster?

In the description of SmartGraphs here it seems to imply that graph traversal queries actually follow edges from machine to machine until the query finishes executing. Is that how it actually works? For example, suppose that you have the following query that retrieves 1-hop, 2-hop, and 3-hop friends starting from the person with id 12345:
FOR p IN Person
FILTER p._key == 12345
FOR friend IN 1..3 OUTBOUND p knows
RETURN friend
Can someone please walk me through the lifetime of this query starting from the client and ending with the results on the client?
what actually happens can be a bit different compared to the schemas on our website. What we show there is kind of a "worst case" where the data can not be sharded perfectly (just to make it a bit more fun). But let's take a quick step back first to describe the different roles within an ArangoDB cluster. If you are already aware of our cluster lingo/architecture, please skip the next paragraph.
You have the coordinator which, as the name says, coordinates the query execution and is also the place where the final result set gets built up to send it back to the client. Coordinators are stateless, host a query engine and is are the place where Foxx services live. The actual data is stored on the DBservers in a stateful fashion but DBservers also have a distributed query engine which plays a vital role in all our distributed query processing. The brain of the cluster is the agency with at least three agents running the RAFT consensus protocol.
When you sharded your graph data set as a SmartGraph, then the following happens when a query is being sent to a Coordinator.
- The Coordinator knows which data needed for the query resides on which machine
and distributes the query accordingly to the respective DBservers.
- Each DBserver has its own query engine and processes the incoming query from the Coordinator locally and then sends the intermediate result back to the coordinator where the final result set gets put together. This runs in parallel.
- The Coordinator sends then result back to the client.
In case you have a perfectly shardable graph (e.g. a hierarchy with its branches being the shards //Use Case could be e.g. Bill of Materials or Network Analytics) then you can achieve the performance close to a single instance because queries can be sent to the right DBservers and no network hops are required.
If you have a much more "unstructured" graph like a social network where connections can occur among any two given vertices, sharding becomes an optimization question and, depending on the query, it is more likely that network hops between servers occur. This latter case is shown in the schemas on our website. In his case, the SmartGraph feature can minimize the network hops needed to a minimum but not completely.
Hope this helped a bit.

Why is the partitioning strategy for a service fabric service tied to the partition instead of to the service?

I am just getting started writing some dynamic endpoint discovery for my Service Fabric application and was looking for examples as to how to resolve service endpoints. I found the following code example on stackoverflow:
https://stackoverflow.com/a/38562986/4787510
I did some minor variations to this, so here is my code:
private readonly FabricClient m_fabricClient
public async Task RefreshEndpointList()
{
var appList = await m_fabricClient.QueryManager.GetApplicationListAsync();
var app = appList.Single(x => x.ApplicationName.ToString().Contains("<MyFabricDeploymentName>"));
// Go through all running services
foreach (var service in await m_fabricClient.QueryManager.GetServiceListAsync(app.ApplicationName))
{
var partitions = await m_fabricClient.QueryManager.GetPartitionListAsync(service.ServiceName);
// Go through all partitions
foreach (var partition in partitions)
{
// Check what kind of service we have - depending on that the resolver figures out the endpoints.
// E.g. Singleton is easy as it is just one endpoint, otherwise we need some load balancing later on
ServicePartitionKey key;
switch (partition.PartitionInformation.Kind)
{
case ServicePartitionKind.Singleton:
key = ServicePartitionKey.Singleton;
break;
case ServicePartitionKind.Int64Range:
var longKey = (Int64RangePartitionInformation)partition.PartitionInformation;
key = new ServicePartitionKey(longKey.LowKey);
break;
case ServicePartitionKind.Named:
var namedKey = (NamedPartitionInformation)partition.PartitionInformation;
key = new ServicePartitionKey(namedKey.Name);
break;
default:
throw new ArgumentOutOfRangeException($"Can't resolve partition kind for partition with id {partition.PartitionInformation.Id}");
}
var resolvedServicePartition = await ServicePartitionResolver.GetDefault().ResolveAsync(service.ServiceName, key, CancellationToken.None);
m_endpointCache.PutItem(service.ServiceTypeName, new ServiceDetail(service.ServiceTypeName, service.ServiceKind, ServicePartitionKind.Int64Range, resolvedServicePartition.Endpoints));
}
}
}
}
I'm quite happy I found this snippet, but while working through it, I found one thing where I am getting a little bit confused.
So, after reading through the SF docs, this seems to be the architecture it follows from top to bottom as far as I understood it:
Service Fabric Cluster -> Service Fabric application (E.g. myApp_Fabric) -> Services (E.g, frontend service, profile picture microservice, backend services)
From the services we can drill down to partitions, while a partition basically resembles a "container" on a node in my cluster on which multiple instances (replicas) can reside, instances being actual deployments of a service.
I'm not quite sure if I got the node / partition / replica difference right, though.
However, back to my confusion and actual question:
Why is the information regarding partition strategy (singleton, intRange, named) attached to the partitioninformation, rather than the service itself? As far as I understood it, a partition is basically the product of how I configured my service to be distributed across the service fabric nodes.
So, why is a partitionstrategy not directly tied to a service?
Regarding the services in Service Fabric, there are two types: stateful services and stateless services.
Stateless services do not deal with state using the reliable collections. If they need to maintain state they have to rely on external persistency solutions like databases etc. Since they do not deal with state provided by reliable collections they get assigned a Singelton Partition type.
Stateful services have the ability to store state in reliable collections. In order to be able to scale those services the data in those collections should be divided over partitions. Each service instance is assigned a specific partition. The amount of partitions is specified per service, like in the example below:
<Service Name="Processing">
<StatefulService ServiceTypeName="ProcessingType" TargetReplicaSetSize="3" MinReplicaSetSize="3">
<UniformInt64Partition PartitionCount="26" LowKey="0" HighKey="25" />
</StatefulService>
</Service>
So, given the example above, I do not understand your last remark about the partition strategy not being directly tied to a service.
Given the situation above, there will be 26 instances of that service running, one for each partition, multiplied by the number of replicas.
In case of a stateless services, there will be just one partition (the singleton partition) so the number of actual instances is 1 * 3 (the replica count) = 3. (3 replicas is just an example. Most times the instance count of a stateless service is set to -1, meaning 1 instance for every node in the cluster.)
One other thing: in your code you have a comment line in the piece of code iteration the partitions:
// E.g. Singleton is easy as it is just one endpoint, otherwise we need some load balancing later on
This comment is wrong stating that the partitioning has to do with load balancing. It is not, it has to do with how data is partitioned over the service instances and you need to get the address of the service that deals with a specific partition. Say I have a service with 26 partitions and I want to get data that is stored in, let's say, the 5th partition. I then need to get the endpoint for the instance that serves that partition.
You probably already read the docs. If not, I suggest reading it as well.
Addressing your comments:
I was just wondering, is it not possible that multiple services run on the same partition?
Reliable collections are coupled to the service using them, so are the underlying partitions. Hence, not more than one service can run on the same partition.
But, service instances can. If a service has a replica size of, let's say, 3, there will be 3 instances serving that partition. But only 1 is the primary instance, reading and writing the data that gets replicated to the secondary instances.
Imagine your service like a pizza, when you request a pizza, you request the flavor of the pizza(type of service), you generally don't specify how you want that pizza sliced(i.e: 8 pieces), generally the pizzeria handles that for you and some might come sliced in 4, 8 or more depending on the size of the pizza.
When you create an instance of the service, you can see in a similar way, you need a service, this service will hold your data and you shouldn't care how the data is stored.
As a consumer, when you need to understand the partitioning of your service, is like you call the pizzeria and ask them to cut the pizza in 4 slices, instead of 8, you still get the same pizza, but now your concern is how many pieces it will be sliced. The main problem with service partitioning, is that many applications designs leak this partitioning to the client, and the client need to be aware of how many partitions it has or where they are placed before consuming it.
You shouldn't care about Service Partitioning as a consumer, but should as a provider(pizzeria), let's say you order a big pizza and the pizzeria run out of boxes(node) to put the pizza, they can split the pizza in two small boxes. In the end the consumer receive the same pizza, but in separate boxes and will have to handle it to find the slices in it.
With this analogy, we can see the comparison as:
Flavor = Service Type
Pizza = Service
Size and How is sliced = Partition Scheme
Slice = Partition
Box = Node
Number of Pizzas = Replicas
In Service Fabric, the reason to have it decoupled, is because the consumer can ask for a service and the provider can decide how it want to partition it, in most cases, the partitions are statically defined at application creation, but it could be dynamic, as seem in the UniformInt64Partition, you can define how many partitions you need for a specific service instance, you could have multiple instances of the same service with different partitions or different schemes without changing a line of code. How you will expose these partitions to the client, is an implementation detail.

Cassandra DB - Node is down and a request is made to fetch data in that Node

If we configured our replication factor in such a way that there are no replica nodes (Data is stored in one place/Node only) and if the Node contains requested data is down, How will the request be handled by Cassandra DB?
Will it return no data or Other nodes gossip and somehow pick up data from failed Node(Storage) and send the required response? If data is picked up, Will data transfer between nodes happen as soon as Node is down(GOSSIP PROTOCOL) or after a request is made?
Have researched for long time on how GOSSIP happens and high availability of Cassandra but was wondering availability of data in case of "No Replicas" since I do not want to waste additional Storage for occasional failures and at the same time, I need availability and No data loss(though delayed)
I assume when you say that there is "no replica nodes" you mean that you have set the Replication Factor=1. In this case if the request is a Read then it will fail, if the request is a write it will be stored as a hint, up to the maximum hint time, and will be replayed. If the node is down for longer than the hint time then that write will be lost. Hinted Handoff: repair during write path
In general only having a single replica of data in your C* cluster goes against some the basic design of how C* is to be used and is an anti-pattern. Data duplication is a normal and expected part of using C* and is what allows for it's high availability aspects. Having an RF=1 introduces a single point of failure into the system as the server containing that data can go out for any of a variety of reasons (including things like maintenance) which will cause requests to fail.
If you are truly looking for a solution that provides high availability and no data loss then you need to increase your replication factor (the standard I usually see is RF=3) and setup your clusters hardware in such a manner as to reduce/remove potential single points of failure.

How to measure effectiveness of using Token Aware connection pool?

My team is testing the token aware connection pool of Astyanax. How can we measure effectiveness of the connection pool type, i.e. how can we know how the tokens are distributed in a ring and how client connections are distributed across them?
Our initial tests by counting the number of open connection on network cards show that only 3 out of 4 or more Cassandra instances in a ring are used and the other nodes participate in request processing in a very limited scope.
What other information would help making a valid judgment/verification? Is there an Cassandra/Astyanax API or command line tools to help us out?
Use Opscenter. This will show you how balanced your cluster is, i.e. whether each node has the same amount of data, as well asbeing able to graph the incoming read / write request per node and for your entire cluster. It is free and works with open source Cassandra as well as DSE. http://www.datastax.com/what-we-offer/products-services/datastax-opscenter

Resources