I am planning to migrate my existing cloud monolithic Restful Web API service to Service fabric in three steps.
The Memory cache (in process) has been heavily used in my cloud service.
Step 1) Migrate cloud service to SF stateful service with 1 replica and single partition. The cache code is as it is. No use of Reliable collection.
Step 2) Horizontal scaling of SF Monolithic stateful service to 5 replica and single partition. Cache code is modified to use Reliable collection.
Step 3) Break down the SF monolithic service to micro services (stateless / stateful)
Is the above approach cleaner? Any recommendation.? Any drawback?
More on Step 2) Horizontal scaling of SF stateful service
I am not planning to use SF partitioning strategy as I could not think of uniform data distribuition in my applictaion.
By adding more replica and no partitioning with SF stateful service , I am just making my service more reliable (Availability) . Is my understanding correct?
I will modify the cache code to use Reliable collection - Dictionary. The same state data will be available in all replicas.
I understand that the GET can be executed on any replica , but update / write need to be executed on primary replica?
How can i scale my SF stateful service without partitioning ?
Can all of the replica including secondory listen to my client request and respond the same? GET shall be able to execute , How PUT & POST call works?
Should i prefer using external cache store (Redis) over Reliable collection at this step? Use Stateless service?
This document has a good overview of options for scaling a particular workload in Service Fabric and some examples of when you'd want to use each.
Option 2 (creating more service instances, dynamically or upfront) sounds like it would map to your workload pretty well. Whether you decide to use a custom stateful service as your cache or use an external store depends on a few things:
Whether you have the space in your main compute machines to store the cached data
Whether your service can get away with a simple cache or whether it needs more advanced features provided by other caching services
Whether your service needs the performance improvement of a cache in the same set of nodes as the web tier or whether it can afford to call out to a remote service in terms of latency
whether you can afford to pay for a caching service, or whether you want to make due with using the memory, compute, and local storage you're already paying for with the VMs.
whether you really want to take on building and running your own cache
To answer some of your other questions:
Yes, adding more replicas increases availability/reliability, not scale. In fact it can have a negative impact on performance (for writes) since changes have to be written to more replicas.
The state data isn't guaranteed to be the same in all replicas, just a majority of them. Some secondaries can even be ahead, which is why reading from secondaries is discouraged.
So to your next question, the recommendation is for all reads and writes to always be performed against the primary so that you're seeing consistent quorum committed data.
Related
We use a Service Fabric cluster to deploy stateless microservices. One of the microservices is designed as a singleton. This means it is designed to be deployed on a single node only.
But does this mean when we scale up or scale down the VM scale set (horizontal scaling) the service will be down? Or does the Service Fabric cluster take care of it?
There are two main concepts do keep in mind about services in service fabric, mainly but not limited to Stateful Services. Partitions and Replicas.
Partitions define the approach used to split the data into groups of data, they are defined as:
Ranged partitioning (otherwise known as UniformInt64Partition). Used to split the data by a range of integer values.
Named partitioning. Applications using this model usually have data that can be bucketed, within a bounded set. Some common examples of data fields used as named partition keys would be regions, postal codes, customer groups, or other business boundaries.
Singleton partitioning. Singleton partitions are typically used when the service does not require any additional routing. For example, stateless services use this partitioning scheme by default.
When you use Singleton for Stateful services, it assumes the data is managed as a single group, no actual data partition is used.
Replicas defined the number of copies a partition will have around the cluster, in order to prevent data-loss on a primary replica failure.
In summary,
If you use a Singleton partition, shouldn't be a problem if the number of replicas is at least 3.
That means, once one NODE gets updated, the replica hosted on that node will be moved to another node, if this replica being moved is a primary replica, it will be demoted to secondary, a secondary will be promoted to primary, and then the demoted replica will shutdown and replicated onto another node.
The third replica is needed in case a replica fails during an upgrade, then the third get promoted to primary.
We are planning to use Service fabric actor model for one of our user services. We have thousands of users and they have their own profile data. By far reading the materials, service fabric actor model maintains its states with their service fabric cluster. I couldn't get a clear picture in disaster recovery/planned shutdown scenarios/offline data access. In such cases, Is it needed to persist the data out side of these actor service?
What happens to the data, if we decided to shutdown all the service fabric cluster one day, and wanted to reactivate few days later?
In an SF cluster in Azure, the data is stored on the temp drive. There's no guarantee that a node that is shutdown retains the temp drive. So shutting down all nodes simultaneously will result in data loss.
To avoid this, you should regularly create backups of your (Actor) Services. For instance by using this Nuget package. Store the resulting files outside the cluster.
The cluster technology will help keep your data safe during failures of nodes, e.g. in a 5 node cluster, 4 remaining healthy nodes can take over the work of a failed node. Data is stored redundantly, so your services remain operational. The same functionality also allows for rolling upgrades of services/actors.
Here's an article about DR.
I had implemented a large enterprise application in service fabric using actor model for management of orders.
Few things that might help while choosing a strategy for data backup and restoration
As the package https://github.com/loekd/ServiceFabric.BackupRestore is not full fledged and you need to take care of some of the scenario.
for example: During deployment your actor partitions moved to other nodes and if you try to take incremental backups it will failed with FabricMissingFullBackupException because on that node after becoming primary you haven't took the Full backup and some one needs to manually fix the issue.
How we added the retry pattern to fix that issue is not in the scope of this question.
Incremental backups didn't restore always during restoration process.
Some time Incremental backup creation failed even if you set the logTrunctationIntervalInMinutes properly.
Some developer by mistake deleted the service or application you will loss all your data.
if your system heavily dependent on Reminder's which was in our case.
During restoration all the reminders gets reset.
Good Solution: Override the default KvsActorStateProvider with your own implementation which stores the data in DocumentDB, MongoDB, Cassandra or Azure SQL if you want to use the power BI for some analytics.
We developed a server service that (in a few words) supports the communications between two devices. We want to make advantage of the scalability given by an Azure Scale Set (multi instance VM) but we are not sure how to share memory between each instance.
Our service basically stores temporary data in the local virtual machine and these data are read, modified and sent to the devices connected to this server.
If these data are stored locally in one of the instances the other instances cannot access and do not have the same information. Is it correct?
If one of the devices start making some request to the server the instance that is going to process the request will not always be the same so the data at the end is spread between instances.
So the question might be, how to share memory between Azure instances?
Thanks
Depending on the type of data you want to share and how much latency matters, as well as ServiceFabric (low latency but you need to re-architect/re-build bits of your solution), you could look at a shared back end repository - Redis Cache is ideal as a distributed cache; SQL Azure if you want to use a relation db to store the data; storage queue/blob storage - or File storage in a storage account (this allows you just to write to a mounted network drive from both vm instances). DocumentDB is another option, which is suited to storing JSON data.
You could use Service Fabric and take advantage of Reliable Collections to have your state automagically replicated across all instances.
From https://azure.microsoft.com/en-us/documentation/articles/service-fabric-reliable-services-reliable-collections/:
The classes in the Microsoft.ServiceFabric.Data.Collections namespace provide a set of out-of-the-box collections that automatically make your state highly available. Developers need to program only to the Reliable Collection APIs and let Reliable Collections manage the replicated and local state.
The key difference between Reliable Collections and other high-availability technologies (such as Redis, Azure Table service, and Azure Queue service) is that the state is kept locally in the service instance while also being made highly available.
Reliable Collections can be thought of as the natural evolution of the System.Collections classes: a new set of collections that are designed for the cloud and multi-computer applications without increasing complexity for the developer. As such, Reliable Collections are:
Replicated: State changes are replicated for high availability.
Persisted: Data is persisted to disk for durability against large-scale outages (for example, a datacenter power outage).
Asynchronous: APIs are asynchronous to ensure that threads are not blocked when incurring IO.
Transactional: APIs utilize the abstraction of transactions so you can manage multiple Reliable Collections within a service easily.
Working with Reliable Collections -
https://azure.microsoft.com/en-us/documentation/articles/service-fabric-work-with-reliable-collections/
We are working on an application that processes excel files and spits off output. Availability is not a big requirement.
Can we turn the VM sets off during night and turn them on again in the morning? Will this kind of setup work with service fabric? If so, is there a way to schedule it?
Thank you all for replying. I've got a chance to talk to a Microsoft Azure rep and documented the conversation in here for community sake.
Response for initial question
A Service Fabric cluster must maintain a minimum number of Primary node types in order for the system services to maintain a quorum and ensure health of the cluster. You can see more about the reliability level and instance count at https://azure.microsoft.com/en-gb/documentation/articles/service-fabric-cluster-capacity/. As such, stopping all of the VMs will cause the Service Fabric cluster to go into quorum loss. Frequently it is possible to bring the nodes back up and Service Fabric will automatically recover from this quorum loss, however this is not guaranteed and the cluster may never be able to recover.
However, if you do not need to save state in your cluster then it may be easier to just delete and recreate the entire cluster (the entire Azure resource group) every day. Creating a new cluster from scratch by deploying a new resource group generally takes less than a half hour, and this can be automated by using Powershell to deploy an ARM template. https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-creation-via-arm/ shows how to setup the ARM template and deploy using Powershell. You can additionally use a fixed domain name or static IP address so that clients don’t have to be reconfigured to connect to the cluster. If you have need to maintain other resources such as the storage account then you could also configure the ARM template to only delete the VM Scale Set and the SF Cluster resource while keeping the network, load balancer, storage accounts, etc.
Q)Is there a better way to stop/start the VMs rather than directly from the scale set?
If you want to stop the VMs in order to save cost, then starting/stopping the VMs directly from the scale set is the only option.
Q) Can we do a primary set with cheapest VMs we can find and add a secondary set with powerful VMs that we can turn on and off?
Yes, it is definitely possible to create two node types – a Primary that is small/cheap, and a ‘Worker’ that is a larger size – and set placement constraints on your application to only deploy to those larger size VMs. However, if your Service Fabric service is storing state then you will still run into a similar problem that once you lose quorum (below 3 replicas/nodes) of your worker VM then there is no guarantee that your SF service itself will come back with all of the state maintained. In this case your cluster itself would still be fine since the Primary nodes are running, but your service’s state may be in an unknown replication state.
I think you have a few options:
Instead of storing state within Service Fabric’s reliable collections, instead store your state externally into something like Azure Storage or SQL Azure. You can optionally use something like Redis cache or Service Fabric’s reliable collections in order to maintain a faster read-cache, just make sure all writes are persisted to an external store. This way you can freely delete and recreate your cluster at any time you want.
Use the Service Fabric backup/restore in order to maintain your state, and delete the entire resource group or cluster overnight and then recreate it and restore state in the morning. The backup/restore duration will depend entirely on how much data you are storing and where you export the backup.
Utilize something such as Azure Batch. Service Fabric is not really designed to be a temporary high capacity compute platform that can be started and stopped regularly, so if this is your goal you may want to look at an HPC platform such as Azure Batch which offers native capabilities to quickly burst up compute capacity.
No. You would have to delete the cluster and recreate the cluster and deploy the application in the morning.
Turning off the cluster is, as Todd said, not an option. However you can scale down the number of VM's in the cluster.
During the day you would run the number of VM's required. At night you can scale down to the minimum of 5. Check this page on how to scale VM sets: https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-scale-up-down/
For development purposes, you can create a Dev/Test Lab Service Fabric cluster which you can start and stop at will.
I have also been able to start and stop SF clusters on Azure by starting and stopping the VM scale sets associated with these clusters. But upon restart all your applications (and with them their state) are gone and must be redeployed.
I have REST Service hosted as AzureWeb App & Another Cloud-Service WorkerRole, both need to share few common info like DB Connection string / Storage Connection string Etc.,
What is the right way to do this?
Since your question is rather broad I will try to answer in a similar way - A good practice in distributed application and micro service architectures is to have services query a single store for their configuration by so allowing your configuration to be consistent and easily changed.
In these cases you would probably want to set up some kind of database known to all services as they initialize. Depending on how complex your config data is, you can decide between several options on Azure:
Easy, quick store for simple key value pairs such as strings: consider Azure Table Storage
For more complex document like configurations (e.g. JSON): consider DocumentDB
In some rare cases where latency and throughput is a concern and you might even want to consider an in-memory store such as Azure Redis cache, though mostly for configuration data this is an overkill.
Note that all of the suggested services above are Azure managed services meaning you get availability, redundancy and robustness out of the box. This is important since the configuration store you use can be a single point of failure in your system.