Turning off ServiceFabric clusters overnight - azure

We are working on an application that processes excel files and spits off output. Availability is not a big requirement.
Can we turn the VM sets off during night and turn them on again in the morning? Will this kind of setup work with service fabric? If so, is there a way to schedule it?

Thank you all for replying. I've got a chance to talk to a Microsoft Azure rep and documented the conversation in here for community sake.
Response for initial question
A Service Fabric cluster must maintain a minimum number of Primary node types in order for the system services to maintain a quorum and ensure health of the cluster. You can see more about the reliability level and instance count at https://azure.microsoft.com/en-gb/documentation/articles/service-fabric-cluster-capacity/. As such, stopping all of the VMs will cause the Service Fabric cluster to go into quorum loss. Frequently it is possible to bring the nodes back up and Service Fabric will automatically recover from this quorum loss, however this is not guaranteed and the cluster may never be able to recover.
However, if you do not need to save state in your cluster then it may be easier to just delete and recreate the entire cluster (the entire Azure resource group) every day. Creating a new cluster from scratch by deploying a new resource group generally takes less than a half hour, and this can be automated by using Powershell to deploy an ARM template. https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-creation-via-arm/ shows how to setup the ARM template and deploy using Powershell. You can additionally use a fixed domain name or static IP address so that clients don’t have to be reconfigured to connect to the cluster. If you have need to maintain other resources such as the storage account then you could also configure the ARM template to only delete the VM Scale Set and the SF Cluster resource while keeping the network, load balancer, storage accounts, etc.
Q)Is there a better way to stop/start the VMs rather than directly from the scale set?
If you want to stop the VMs in order to save cost, then starting/stopping the VMs directly from the scale set is the only option.
Q) Can we do a primary set with cheapest VMs we can find and add a secondary set with powerful VMs that we can turn on and off?
Yes, it is definitely possible to create two node types – a Primary that is small/cheap, and a ‘Worker’ that is a larger size – and set placement constraints on your application to only deploy to those larger size VMs. However, if your Service Fabric service is storing state then you will still run into a similar problem that once you lose quorum (below 3 replicas/nodes) of your worker VM then there is no guarantee that your SF service itself will come back with all of the state maintained. In this case your cluster itself would still be fine since the Primary nodes are running, but your service’s state may be in an unknown replication state.
I think you have a few options:
Instead of storing state within Service Fabric’s reliable collections, instead store your state externally into something like Azure Storage or SQL Azure. You can optionally use something like Redis cache or Service Fabric’s reliable collections in order to maintain a faster read-cache, just make sure all writes are persisted to an external store. This way you can freely delete and recreate your cluster at any time you want.
Use the Service Fabric backup/restore in order to maintain your state, and delete the entire resource group or cluster overnight and then recreate it and restore state in the morning. The backup/restore duration will depend entirely on how much data you are storing and where you export the backup.
Utilize something such as Azure Batch. Service Fabric is not really designed to be a temporary high capacity compute platform that can be started and stopped regularly, so if this is your goal you may want to look at an HPC platform such as Azure Batch which offers native capabilities to quickly burst up compute capacity.

No. You would have to delete the cluster and recreate the cluster and deploy the application in the morning.

Turning off the cluster is, as Todd said, not an option. However you can scale down the number of VM's in the cluster.
During the day you would run the number of VM's required. At night you can scale down to the minimum of 5. Check this page on how to scale VM sets: https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-scale-up-down/

For development purposes, you can create a Dev/Test Lab Service Fabric cluster which you can start and stop at will.
I have also been able to start and stop SF clusters on Azure by starting and stopping the VM scale sets associated with these clusters. But upon restart all your applications (and with them their state) are gone and must be redeployed.

Related

Virtual machine with SQL Server recovery using Premium disk

I have a VM with SQL Server and an application that uses no more than 50 users. I don't require to have a zero downtime application in case my VM or datacenter had an issue, but what I need at least to assure is that I can make the app available again in less than 30 minutes.
First approach: using an Availability Set with 2 VM's won't work actually because my SQL Server lives in the same VM and I don't think Availability Set will take care of the real time replication of my SQL Server data, it will care only about the web application itself and not the persistent data (if I'm wrong please let me know), so having the above statement AV Set is not for me. Also It will be twice expensive because of the 2 VMs.
Second approach: using Recovery Site with disaster recovery I was reading that wont warranty to have a zero data loss, because there is a minimum frequency of replication and I think is 1 hour, so you have to be prepared to deal with 1 hour of data loss and I don't like this.
Third option: Azure Backup for SQL Server VM, this option could work the only downside is that has a RPO of 15 minutes that is not that much, but the problem is that if by some reason the user generates in the app some critical records we wont be able to get them again into the app because the user always destroy everything right away when they register into the app.
Fourth approach: Because I don't really require a zero downtime app, I was thinking on just having the actual VM using 2 premium disks one for SQL Server data files and other for SQL Server logs. In case of a VM failure I will get notified by users inmediately and what I can do is to create a snapshot of OS disk, and SQL premium disks (total of 3) and then create a new VM using these snapshots, so I will get a new working VM maybe in a different region having the exact very last data inserted into SQL before the failure happened.
Of course I guess I will need on top the VM a load balancer so I can just reroute traffic to the new VM. The failed VM i will just kill it and use the new VM as my new system. If fail happens again I just follow same process so this way I just only pay for one VM and not two.
Is this someone has already tried, does this sound reasonable and doable or Im missing a big thing or maybe I wont get what I expect to get?
You better use Azure SQL (PaaS) instead of VM, there are many different options that you can do for your needs. Running SO + SQL in the same VM is not recommended, changing to a Azure SQL (PaaS) you can decrease your hardware for SO VM and configure your SQL for supporting 50 users. Also you can use Load Balancer as you said, either Traffic Manager (https://learn.microsoft.com/pt-br/azure/traffic-manager/traffic-manager-overview) or Application Gateway (https://learn.microsoft.com/pt-br/azure/application-gateway/overview) to route traffic to your SO VM's where the application is running. Depends on your application you can migrate to Azure Web App (https://learn.microsoft.com/en-us/azure/app-service/).
Azure SQL (Paas) you can have less than 30 minutes for sure, I would say almost zero down time although you don't required it.
Automatic backups and Point-in-time restores
https://learn.microsoft.com/pt-br/azure/sql-database/sql-database-automated-backups
Active geo-replication
https://learn.microsoft.com/pt-br/azure/sql-database/sql-database-active-geo-replication
Zone-redundant databases
https://learn.microsoft.com/pt-br/azure/sql-database/sql-database-high-availability
Finally I don't think having Always-on (https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/overview-of-always-on-availability-groups-sql-server?view=sql-server-ver15) solution is good, once it is expensive and there are only 50 users. That's why I believe you better thinking of a Saas + PaaS solution for your application and database. Your 4th option sounds fine, but you need to create a new VM, configure IP, install SQL, configure SQL and so on to bring up your SQL.
What users is going to do if it happens when you are not available to fix it immediately? Your 30 minutes won't be accomplished :)

What's the difference between primary and non-primary nodes in Azure Service Fabric?

I can't find any specific documentation that says what's the difference between primary node and a non-primary node, and how are they being used. Can somebody shed light on it? Thanks.
If you compare Service Fabric to other Orchestration Tools like Kubernetes, you will notice a small difference on how clusters are defined.
Kubernetes uses a concept of Master to host cluster management services, and Minion to host your application services(containers). Until version 1.1 it was not possible to run containers on the masters, because it had the idea that Master's should be isolated to avoid conflicting with containers running on it, like consuming too much memory, disk, cpu, and so on.
On Service Fabric this is a bit different. When you define a NodeType as Primary, what it means inside the cluster is that this NodeType will be responsible to host the Service Fabric Management Services(services needed to control the cluster health, orchestration and so on).
When you deploy a cluster via Azure Portal, depending on the durability tier (Bronze,Silver,Gold) you choose, it will require a certain number of nodes on Primary Node Type, to keep the cluster management healthy. For production workloads, 5 nodes the minimum recommended size for Primary NodeType or NonPrimary with stateful workloads in it. The minimum supported use VM SKU is Standard D1 or Standard D1_V2.
There is a catch for Primary Node-type, the change of VMSS Sku (Size) is not supported, you can do on your own risk, but is a recipe for disaster, because the risk of loosing management services is too high.
Non-primary NodeType, there is no overall difference other than these mentioned above. All NodeTypes will have a VMSS and a LoadBalancer(with an domain) being able to configure the access rules. All NodeType will have a limit of 100 nodes.
Compared to Kubernetes, SF does not add any constraints to prevent you deploying your services alongside the management services on primary nodes, Every node is part of a pool of resources(including the primary). So the default behaviour is deploy applications on every node available no matter the NodeType.
When you plan bigger clusters (100+ nodes), it is important that you take that in account, and isolate your Primary NodeType from your workloads, and remove the pressure on your management services nodes.
Having multiple node types can be useful in these situations:
You want to run services exposed to the internet & services not exposed. The first set would run on a node type (VMSS) attached to the Load Balancer and the second on a scale set that isn't.
You need to run services for certain customers on premium hardware and trials on cheaper hardware. The first set would run on nodes with lots of CPU, lots of RAM. The second on lower SKU's.
You want to build a cluster that exceeds the max node count that one VMSS can hold.
Or you need to add scale sets on the fly, to support huge growth.
And: The primary nodes run your system services, the secondaries don't.
There is not much of a difference. Nodes of different node types all share the same characteristics of a Service Fabric Cluster. They all participate in load balancing etc.
Except for one thing: system services run om the nodes of the primairy node type only (source):
Primary node type is where the system services run, so the VM SKU you choose for it, must take into account the overall peak load you plan to place into the cluster. Here is an analogy to illustrate what I mean here - Think of the primary node type as your "Lungs", it is what provides oxygen to your brain, and so if the brain does not get enough oxygen, your body suffers.
An important purpose of node types is to constraint service placement to specific node types. For example, you can have several node types, one uses VM's with higher cpu capacity and one with focus on amount of memory. The you can place memory resource hungry services on one node type and cpu intensive services on the other.

Azure - Linux Standard B2ms - Turned off automatically?

I have a Linux Standard B2ms azure virtual machine. I have disabled the autoshutdown feature you see in your dashboard under operations. For some reason this server was still shutdown after running about 8 days.
What reasons are there which could shutdown this server if I haven't changed anything on it the last three days?
What reasons are there which could shutdown this server if I haven't
changed anything on it the last three days?
There are many reasons will shutdown this VM, maybe we should try to find some logs about this.
First, we should check Azure Alerts via Azure portal, try to find some logs about you VM.
Second, we should check this VM's performance, maybe high CPU usage or high memory usage, we can find logs in /var/log/*.
Also we can try to find are there some issue about Azure service, we can check service Health -> Health history to find are there some issues in your region.
By the way, if we just create one VM in Azure, we can't avoid a single point of failure. In Azure, Microsoft recommended that two or more VMs are created within an availability set to provide for a highly available application and to meet the 99.95% Azure SLA.
An availability set is composed of two additional groupings that protect against hardware failures and allow updates to safely be applied - fault domains (FDs) and update domains (UDs).
Fault domains:
A fault domain is a logical group of underlying hardware that share a common power source and network switch, similar to a rack within an on-premises datacenter. As you create VMs within an availability set, the Azure platform automatically distributes your VMs across these fault domains. This approach limits the impact of potential physical hardware failures, network outages, or power interruptions.
Update domains:
An update domain is a logical group of underlying hardware that can undergo maintenance or be rebooted at the same time. As you create VMs within an availability set, the Azure platform automatically distributes your VMs across these update domains. This approach ensures that at least one instance of your application always remains running as the Azure platform undergoes periodic maintenance. The order of update domains being rebooted may not proceed sequentially during planned maintenance, but only one update domain is rebooted at a time.
In your scenario, maybe there are some unplanned maintenance events,when Microsoft update the VM host, they will migrate your VM to another host, they will shutdown your VM then migrate it.
To achieve a highly available, maybe we should create at least two VMs in one availability set.

Service fabric actors state

We are planning to use Service fabric actor model for one of our user services. We have thousands of users and they have their own profile data. By far reading the materials, service fabric actor model maintains its states with their service fabric cluster. I couldn't get a clear picture in disaster recovery/planned shutdown scenarios/offline data access. In such cases, Is it needed to persist the data out side of these actor service?
What happens to the data, if we decided to shutdown all the service fabric cluster one day, and wanted to reactivate few days later?
In an SF cluster in Azure, the data is stored on the temp drive. There's no guarantee that a node that is shutdown retains the temp drive. So shutting down all nodes simultaneously will result in data loss.
To avoid this, you should regularly create backups of your (Actor) Services. For instance by using this Nuget package. Store the resulting files outside the cluster.
The cluster technology will help keep your data safe during failures of nodes, e.g. in a 5 node cluster, 4 remaining healthy nodes can take over the work of a failed node. Data is stored redundantly, so your services remain operational. The same functionality also allows for rolling upgrades of services/actors.
Here's an article about DR.
I had implemented a large enterprise application in service fabric using actor model for management of orders.
Few things that might help while choosing a strategy for data backup and restoration
As the package https://github.com/loekd/ServiceFabric.BackupRestore is not full fledged and you need to take care of some of the scenario.
for example: During deployment your actor partitions moved to other nodes and if you try to take incremental backups it will failed with FabricMissingFullBackupException because on that node after becoming primary you haven't took the Full backup and some one needs to manually fix the issue.
How we added the retry pattern to fix that issue is not in the scope of this question.
Incremental backups didn't restore always during restoration process.
Some time Incremental backup creation failed even if you set the logTrunctationIntervalInMinutes properly.
Some developer by mistake deleted the service or application you will loss all your data.
if your system heavily dependent on Reminder's which was in our case.
During restoration all the reminders gets reset.
Good Solution: Override the default KvsActorStateProvider with your own implementation which stores the data in DocumentDB, MongoDB, Cassandra or Azure SQL if you want to use the power BI for some analytics.

Multi regional Azure Container Service DC/OS clusters

I'm experimenting a little with ACS using the DC/OS orchestrator, and while spinning up a cluster within a single region seems simple enough, I'm not quite sure what the best practice would be for doing deployments across multiple regions.
Azure itself does not seem to support deploying to more than one region right now. With that assumption, I guess my only other option is to create multiple, identical clusters in all the regions I wish to be available, and then use Azure Traffic Manager to route incoming traffic to the nearest available cluster.
While this solution works, it also causes a few issues I'm not 100% sure on how I should work around.
Our deployment pipelines must make sure to deploy to all regions when deploying a new version of a service. If we have a East US and North Europe region, during deployments from our CI tool I have to connect to the Marathon API in both regions to trigger the new deployments. If the deployment fails in one region, and succeeds in the other, I suddenly have a disparity between the two regions.
If i have a service using local persistent volumes deployed, let's say PostgreSQL or ElasticSearch, it needs to have instances in both regions since service discovery will only find services local to the region. That brings up the problem of replication between regions to keep all state in all regions; this seem to require some/a lot of manual configuration to get to work.
Has anyone ever used a setup somewhat like this using Azure Container Service (or really Amazon Container Service, as I assume the same challenges can be found there) and have some pointers on how to approach this?
You have multiple options for spinning up across regions. I would use a custom installation together with terraform for each of them. This here is a great starting point: https://github.com/bernadinm/terraform-dcos
Distributing agents across different regions should be no problem, ensuring that your services will keep running despite failures.
Distributing masters (giving you control over the services during failures) is a little more diffult as it involves distributing a zookeeper quorum across high latency links, so you should be careful in choosing the "distance" between regions.
Have a look at the documentation for more details.
You are correct ACS does not currently support Multi-Region deployments.
Your first issue is specific to Marathon in DC/OS, I'll ping some of the engineering folks over there to see if they have any input on best practice.
Your second point is something we (I'm the ACS PM) are looking at. There are some solutions you can use in certain scenarios (e.g. ArangoDB is in the DC/OS universe and will provide replication). The DC/OS team may have something to say here too. In ACS we are evaluating the best approaches to providing solutions for this use case but I'm afraid I can't give any indication of timeline.
An alternative solution is to have your database in a SaaS offering. This takes away all the complexity of managing redundancy and replication.

Resources