I have three .net programs currently running as windows services. We are migrating to Service Fabric and I have a few questions. Our intent is to migrate the services to StateFul service since we need to keep track of locations of files, batch size, etc. that are currently stored in an app.config file. So we can "lift and shift" the code from the onTimer event to the RunAsync as discussed in this stackoverflow question:
How to Migrate Windows Service in Azure Service Fabric
However there are some questions I have about these services. Of course part of using SF is to have the applications in a reliable environment to keep these applications available as much as possible, so the first question is:
Should we only deploy the service to one node and use the reliable
collection to maintain the state of the process should the node go down and
have to be brought back up?
Or, should we deploy the application to say 3 nodes and just have each
application on their node check the reliable collection to see if another
application is processing files and to wait?
files?
The application will "awake" at a determined interval and look at a folder, if there are any files in the folder, it will process them. This could take from a couple of seconds to many minutes. So if the application on was on three nodes, it is entirely possible that the other two applications on their nodes would wake up to process files. If they could check a reliable dictionary to see if one of the other instances of the application is running the file processing, they would just wait until the next time they are needed.
I know this is vague, I am looking for input on whether to launch the application on multiple nodes or a single node?
In short: statefull services have partitioned data. So you will have at least one, and probably more than one, partition. For each partition a primary instance will be up and running serving requests or doing work. Then for each primary instance there will be some secundary instances that will take over when the primairy fails. More info here.
In the configuration of the service you specify the number of partitions and the replica count:
<Service Name="Processing">
<StatefulService ServiceTypeName="ProcessingType" TargetReplicaSetSize="[Processing_TargetReplicaSetSize]" MinReplicaSetSize="[Processing_MinReplicaSetSize]">
<UniformInt64Partition PartitionCount="[Processing_PartitionCount]" LowKey="0" HighKey="25" />
</StatefulService>
</Service>
The primairy and secundairy instances (replica's) will be distributed over the cluster nodes so for example, when the node running the primairy instance goes down a replica on another node will take over.
There is more to it than what I have described but this is the basic idea behind it all.
So to answer your question: you should specify enough replica's on other nodes to gurantuee high availabilty.
Related
We use Service Fabric to deploy stateless microservices. One of the microservices is designed as a singleton. This means it is designed to be deployed on a single node only:
InstanceCount = 1
Normally, if there is more than 1 instance and one fails, the others keep working.
But how does the single instance behave? I cannot find this scenario in the documentation. I only found out that when the node is updated and the parameter IsSingletonReplicaMoveAllowedDuringUpgrade is set to true then it can be moved to other node, but no source explicitly says what happens when the singleton fails during execution.
Does it restart automatically? And if so, then how long is the downtime?
Service Fabric will restart the service for you automatically. The time it takes to restart can depend on how loaded the machine is, how large the service is, and they type of failure, but is typically within a couple of seconds.
The amount of time it takes to restart can also depend on how the service failed. Process crashes are quicker to recover from. Machine failures or networking cuts can take longer to detect, but even in these cases SF will usually restart things within 10-30 seconds.
I have an Asp.Net core 2.0 Wen API that has a relatively simple logic (simple select on a SQL Azure DB, return about 1000-2000 records. No joins, aggregates, functions etc.). I have only 1 GET API. which is called from an angular SPA. Both are deployed in service fabric as as stateless services, hosted in Kestrel as self hosting exes.
considering the number of users and how often they refresh, I've determined there will be around 15000 requests per minute. in other words 250 req/sec.
I'm trying to understand the different settings when creating my Service Fabric cluster.
I want to know:
How many Node Types? (I've determined as Front-End, and Back-End)
How many nodes per node type?
What is the VM size I need to select?
I have ready the azure documentation on cluster capacity planning. while I understand the concepts, I don't have a frame of reference to determine the actual values i need to provide to the above questions.
In most places where you read about the planning of a cluster they will suggest that this subject is part science and part art, because there is no easy answer to this question. It's hard to answer it because it depends a lot on the complexity of your application, without knowing the internals on how it works we can only guess a solution.
Based on your questions the best guidance I can give you is, Measure first, Measure again, Measure... Plan later. Your application might be memory intensive, network intensive, CPU, Disk and son on, the only way to find the best configuration is when you understand it.
To understand your application before you make any decision on SF structure, you could simply deploy a simple cluster with multiple node types containing one node of each VM size and measure your application behavior on each of them, and then you would add more nodes and span multiple instances of your service on these nodes and see which configuration is a best fit for each service.
1.How many Node Types?
I like to map node types as 1:1 to roles on your application, but is not a law, it will depend how much resource each service will consume, if the service consume enough resource to make a single VM(node) busy (Memory, CPU, Disk, IO), this is a good candidate to have it's own node type, in other cases there are services that are light-weight that would be a waste of resources provisioning an entire VM(node) just for it, an example is scheduled jobs, backups, and so on, In this case you could provision a set of machines that could be shared for these services, one important thing you have to keep in mind when you share a node-type with multiple service is that they will compete for resources(memory, CPU, network, disk) and the performance measures you took for each service in isolation might not be the same anymore, so they would require more resources, the option is test them together.
Another point is the number of replicas, having a single instance of your service is not reliable, so you would have to create replicas of it(the right number I describe on next answer), in this case you end up with a service load split in to multiple nodes, making this node-type under utilized, is where you would consider joining services on same node-type.
2.How many nodes per node type?
As stated before, it will depend on your service resource consumption, but a very basic rule is a minimum of 3 per node type.
Why 3?
Because 3 is the lowest number where you could have a rolling update and guarantee a quorum of 51% of nodes\service\instances running.
1 Node: If you have a service running 1 instance in a node-type of 1 node, when you deploy a new version of your service, you would have to bring down this instance before the new comes up, so you would not have any instance to serve the load while upgrading.
2 Nodes: Similar to 1 node, but in this case you keep only 1 node running, in case of failure, you wouldn't have a failover to handle the load until the new instance come up, it will worse if you are running a stateful service, because you will have only one copy of your data during the upgrade and in case of failure you might loose data.
3 Nodes: During a update you still have 2 nodes available, when the one being updated get back, the next one is put down and you still have 2 nodes running, in case of failure of one node, the other node can support the load until a new node is deployed.
3 nodes does not mean the your cluster will be highly reliable, it means the chances of failure and data loss will be lower, you might be unlucky a loose 2 nodes at same time. As suggested in the docs, in production is better to always keep the number of nodes as 5 or more, and plan to have a quorum of 51% nodes\services available. In this case I would recommend 5, 7 or 9 nodes in cases you really need higher uptime 99.9999...%
3.What is the VM size I need to select?
As said before, only measurements will give this answer.
Observations:
These recommendations does not take into account the planning for primary node types, it is recommended to have at least 5 nodes on primary Node Types, it is where SF system services are placed, they are responsible to manage the
cluster, so they must be highly reliable, otherwise you risk losing control of your cluster. If you plan to share these nodes with your application services, keep in mind that your services might impact them, so you have to always monitor them to check for any impact it might cause.
We have an microservices application with 5 stateless services
eShopWeb
eShopAPI
eShopOrder
eShopBasket and eShopPayments
We created an service fabrics cluster in azure with 3 nodes. Now we want to configure like as follows
eShopWeb and eShopOrder need to run on node 1
eShopAPI and eShopPayments needs to run on node 2
eShopOrder needs to run on node 3.
How to achieve the above configuration to rum multiple micro services on same node
You shouldn't care which node runs which service. By tying services to nodes you undermine the self-healing capabilities of SF (what if node 2 fails?). Also, you can't do rolling upgrades this way (except for eShopOrder).
I'd recommend avoiding placement constraints if you can. Unless you have multiple node types, or a large cluster.
Service affinity is for legacy services that are so chatty that they don't perform well when on separate nodes, because of latency in communication.
(And for production, you should use 5 nodes.)
You have two options here.
Placement constraints. You can get details in this question. But I would not recommend this as natural balancing ability will be useless here.
Service affinity. You can find more info here. It's when some services are on the same node as another. But be aware: this rule could be broken; read documentation carefully.
I would go with the second approach and see how it works at least in staging environment.
I'm creating a cloud service where I have a worker role running some heavy processing in the background, for which i would like a Redis instance to be running locally on the worker.
What i want to do is set up the worker role project in a way that the Redis instance is installed/configured when the worker is deployed.
The redis database would be cleared on every job startup.
I've looked at the MSOpenTech redis for windows with nuget installation, but i'm unsure how i would get this working on the worker role instance. Is there a smart way to set it up, or would it be by command-line calls?
Thanks.
I'm not expecting to get this marked as an answer, but just wanted to add the add that this is a really bad approach for a real-world deployment.
I can understand why you might want to do this from a learning perspective, however in a production environment its a really bad idea, for several reasons:
You cannot guarantee when a Worker Role will be restarted by the Azure Service Fabric (and you're not guaranteed to get the underlying VM in the same state before it went down) - you could potentially be re-populating the cache simply because the role was re-started.
In a real-world implementation of Redis, you would run multiple nodes within a cluster so you benefit from a) the ability to automatically split your dataset among multiple nodes and b) continue operations when a subset of the nodes are experiencing failures - running within a Worker Role doesn't give you any of this. You also run the risk of multiple Redis instances (unaware of each other) every time you scale-out your Worker Role.
You will need to manage your Redis installation within the Worker Role and they simply aren't designed for this. PaaS Worker Roles are designed to run the Worker Role Package that is deployed and nothing else. If you really want to run Redis yourself, you should probably look at IaaS VM's.
I would recommend that you take a look at the Azure Redis Cache SaaS offering (see http://azure.microsoft.com/en-gb/services/cache/) which offers a fully managed, highly-available, implementation of the Redis Cache. I use it on several projects and can highly recommend it.
To install any software on a worker role instance, you'd need to set this up to happen as a startup task.
You'll reference startup tasks in your ServiceDefinition.csdef file, in the <Startup> element, with a reference to your command file which installs whatever software you want (such as Redis).
I haven't tried installing Redis in a worker role instance, so I can't comment about whether this will succeed. And you'll also need to worry about opening the right ports (whether external- or internal-facing), and scaling (e.g. what happens when you scale to two worker role instances, both running redis?). My answer is specific to how you install software on a role instance.
More info on startup task setup is here.
I am trying to port an application to azure platform. I want to run an existing application multiple times. My initial idea is as follows: I have a master_process. I have many slave_processes. Each process is a worker role in Azure. Each slave_process will run an instance of the application independently. I want master_process to start many slave_processes and provide them the input arguments. At the end, master_process will collect the results. Currently, I have a working setup for calling the whole application from a C# wrapper. So, for the success, I need two things: First, I have to find a way to start slave workers inside of a master worker (just like threads). Second, I need to find a way to store results of the slave workers and reach these result files from master worker. Can anyone help me?
I think I would try and solve the problem differently. Deploying a whole new instance can take 15 to 30 minutes. Adding extra instances to an already running worker role is a little quicker, but not by much. I'm going to presume that you want results faster than that and that this process is something that is run frequently.
I would have just one worker role type that runs your existing logic and as many instances of that worker role that you determine you'll need. Whatever your client is will decide that it needs to break the work up in a certain number of pieces, let's say 10 for the sake of argument. It will give each piece of work an ID (e.g. a guid) and then put 10 messages that contain the parameters and the ID into a queue. Your worker role instances take messages out of the queue, do their work and write their results to storage somewhere (either SQL Azure, Azure Table Storage or maybe even blob storage depending on what the results are). The client polls that storage to wait for all of the results to be complete and then carries on.
If this is a process that is only run infrequently, then rather than having the worker roles deployed all of the time, you could use the same method I've described, but in addition get the client code to deploy the worker roles when it starts and then delete them when it's finished through the management API. There are samples on MSDN on how to use this.
I have a similar situation you might find useful:
I have a large sequential batch process I run in Azure which requires pre and post processing. The technique I used was to use instances of a single multifunctional worker role, but to use a "quorum" to nominate a head node, which then controls the workflow.
They way I do it is using an azure page blob as the quorum (basically a kind of global mutex/lock), because once a node grabs it for writing it's locked. For resilience, in case there's an issue with the head node, all nodes occasionally try to recapture the quorum.