What architecture and application development best practices must be followed in order to scale a TWX application?
The majority of applications start with few devices but with time they quickly build up to thousands of devices. Once the amount of traffic is too much for one TWX instance what strategy should be followed?
The same question applies when the front end is overwhelmed by the number of users.
Anytime I have had ThingWorx architecture concerns, I have been redirected to the PTC ThingWorx guide linked below. I do not believe you need a PTC account to view it, but if so it is free.
ThingWorx 8 High Availability Administrators Guide
http://support.ptc.com/WCMS/files/173281/en/ThingWorx_8_High_Availability_Administrators_Guide.pdf
In your case where you have big load concerns, the guide recommends using
two ThingWorx instances to handle the load.
At least two ThingWorx instances are required for HA configuration. A
single instance is started, which becomes leader and fully connects to
the database. Standby servers boot up and can become the leader if
needed, but they do not fully connect to the database or load
information like the leader does. All ThingWorx servers have a service
that is called by the load balancer, which indicates their
availability. Different codes identify the leader, which receives
traffic, and standby nodes, which do not receive traffic but may
become leader.
High-Level Architecture example from the referenced guide:
The Load Balancer determines which ThingWorx instance is to be used by the user. Usually it is used to determine which is available in a redundant architecture (which is what makes it Highly Available). However, it can also be used to determine which to use based on performance. In PTC's HA Admin Guide, they use HAProxy (see page 47) as the Load Balancer. See Section 3.2 of the HAProxy Config Doc for how to configure based on performance.
Hope this helps! It is a pretty open-ended topic
With ThingWorx 9.0 release, the ThingWorx Foundation platform supports true horizontal scalability with an active-active clustering setup providing no single points of failure. The document here provides the details about the install and setup.
There is also a ThingWorx 9.0 deployment architecture guide for an overview of all the architectural details.
ThingWorx High Availability Clustering setup image
Related
When using Azure web/worker roles users can specify osVersion to explicitly set "Guest OS image" version. This ensures that when Microsoft issues new critical updates they are first shown up on a newer "OS image" which users can explicitly specify and test their service on.
How is the same achieved with Azure Service Fabric? Suppose I deployed my service into Azure Service Fabric and it's been running for a month, then Microsoft issues updates for the OS on the server where the service is running - how are they applied such that I can test them first to ensure they don't break the service?
Brett is correct. SF cluster is based on Azure VMSS and the expectation is that the customer is responsible to patch the OS. https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-upgrade/
We have heard from majority of the SF customers that this is not at all desirable and that they do not want to be responsible for OS patching.
The feature to enable an OPT-IN automatic OS patching is indeed a very high priority within Azure Compute team. The exact details on how best to offer this is still in design, however the intent is to have this functionality enabled before the end of the year.
Although that is the right long term solution, to mitigate this issue in the short term, SF team is working on a set of steps that will enable the customers to opt into having the their VMs patched using WU in a safe manner. Once the steps are tested out, we will blog about it and will publish a document detailing the steps. Expect that in the next couple of months.
As I understand it you are currently responsible for managing patching on SF cluster nodes yourself. Apparently moving this to be a SF managed feature is planned but I have no idea how far down the road it might be.
I personally would make this a high priority. Having used Cloud Services for many years I have come to rely on never having to patch my VM's manually. SF is a large backwards step in this particular area.
It'd be great to hear from an Azure PM on this...
Automatic Image based patching like cloud services in service fabric.
Today you do not have that option. The image based patching capability is work in progress. I posted a road map to get there on the team blog : https://blogs.msdn.microsoft.com/azureservicefabric/2017/01/09/os-patching-for-vms-running-service-fabric/ Try out the script and report any issues you hit. Looking forward to your feedback.
Lots of parts of Service Fabric are huge rolling dumpster fires backwards. Whole new hosts of problems have been introduced that the IIS/WAS/WCF team have already solved that need to be developed for once again. The concept of releasing a PAAS platform while requiring OS patch management is laughable. To add insult to injury there is no migration path from "classic cloud PAAS" to this stuff. WEEEE I get to write my very own service host. Something that was provided out of the box for a decade by WAS. Not all of us were scared by the ability to control all aspects of service host communication options via configuration. Now we get to use code so a tweak channel configuration requires a full patch/release cycle!
I'm trying to come up with a solution for achieving Geo-Redundancy (2+ datacentres) while using Service Fabric reliable Actors/Services to manage state. It insinuates here that geo replication is possible
This may happen when, for example, if you aren’t geo replicated and your entire cluster is in one data center, and the entire data center goes down.
but doesn't explain how to switch it on.
Does anybody know if it's a planned feature for ASF that just hasn't been released yet, or whether it's present but not fully explored yet?
Alternatively does anybody have any recommended approaches for cross DC resilience when the state required to run the app is stored using ASF's StateManager?
thanks,
Alex
Alex,
Apparently the service fabric team is still to crack this problem - more info below. However, you should be able to GeoHA Service Fabric Cluster on Azure by yourself. Here's an example of that:
https://alexandrebrisebois.wordpress.com/2016/05/31/deploy-a-geo-ha-service-fabric-cluster-on-azure/
Not today, but this is a common request that we continue to investigate.
The core Service Fabric clustering technology knows nothing about Azure regions and can be used to combine machines running anywhere in the world, so long as they have network connectivity to each other. However, the Service Fabric cluster resource in Azure is regional, as are the virtual machine scale sets that the cluster is built on. In addition, there is an inherent challenge in delivering strongly consistent data replication between machines spread far apart. We want to ensure that performance is predictable and acceptable before supporting cross-regional clusters. Source: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-common-questions
Cheers,
Paulo
There is no reason you cannot install a series of nodes in different regions as part of the same Fabric, and use placement constraints to control service allocation. As long as the nodes can properly communicate with each other, there should be no problem with this.
If you're using Azure, you should deploy them to Virtual Networks, and link them together using VPNs. You could even cross to on-prem.
I believe the answer would be to use a custom replicator implementation and bridging multiple clusters with expressroute.
I have some questions regarding architecting enterprise applications using azure cloud services.
Back Story
We have a system made up of about a dozen WCF Windows Services on a SQL backend. We currently have about 10 clients but expect that to grow to potentially a hundred with perhaps a hundred fold increase in the throughput demands on the system. The current system is poorly engineered and is simply not capable of scaling. So now appears to be the appropriate juncture to reengineer on the azure platform.
Process Flow
Let me briefly describe a simplified set of the services and the process flow and then ask some questions I have regarding utilizing azure cloud services to build the new system.
Service A is logged on to an external systems and downloads data continuously
Service B is logged on to a second external systems and downloads data continuously
There can only ever be one logged in instance each of services A and B.
Both A and B hand off their data to Service C which reconciles the data from the two external sources.
Validated and reconciled data is then passed from C to Service D which performs some accounting functions and then passes the resulting data to Services E and F.
Service E is continually logged in to an external system and uploads data to it.
Service F generates reports and publishes them to clients via FTP etc
The system is actually far more complex than this but the above illustrates the processes involved. The system runs 24 hours a day 6 days a week. Queues will be used to buffer messaging between all the services.
We could just build this system using Azure persistent VMs and utilise the service bus, queues etc but that would ties us in to vertical scaling strategy. How could we utilise cloud services to implement it given the following questions.
Questions
Given that Service A, B and E are permanently logged in to external systems there can only ever be one active instance of each. If we implement these as single instance worker roles there is the issue with downtime and patching (which is unacceptable). If we created two instances of each is there a standard way to implement active-passive load balancing with worker roles on azure or would we have to build our own load balancer? Is there another solution to this problem that I haven’t thought of?
Services C and D are a good candidates to scale using multiple worker role instance. However each instance would have to process related data. For example, we could have 4 instances each processing data for 5 individual clients. How can we get messages to be processed in groups (client centric) by each instance? Also, how would we redistribute load from one instance to the remaining instances when patching takes place etc. For example, if instance 1, which processes data for 5 clients, goes down for OS patching, the data for its clients would then have to be processed by the remaining instances until it came back up again. Similarly, how could we redistribute the load if we decide to spin up additional worker roles?
Any insights or suggestions you are able to offer would be greatly appreciated.
Mat
Question #1: you will have to implement your own load-balancing. This shouldn't be terribly complex as you could use Blob storage Lease functionality to keep a mutex on some blob in the storage from one instance while holding the connection active to your external system. Every X period of time you could renew the lease if you know that connection is still active and successful. Every other worker in the Role could be checking on that lease to see if it expires. If it ever expires, the next worker would jump in and acquire the lease, and then open the connection to the external source.
Question #2: Look into Azure Service Bus. It has a capability to allow clients to process related messages. More info here: http://geekswithblogs.net/asmith/archive/2012/04/02/149176.aspx
All queuing methodologies imply that if a message gets picked up but does not get processed within a configurable amount of time, it goes back onto the queue so that the next available instance can pick it up and process it
You can use something like AzureWatch to monitor the depth of your queues (storage or service bus) and auto-scale number of instances in your C and D roles to match; and monitor instance statuses for roles A, B and E to make sure there is always a healthy instance there and auto-scale if quantity of ready instances drop to 0.
HTH
First, back up a step. One of the first things I do when looking at application architecture on Windows Azure is to qualify whether or not the app is a good candidate for migration to Windows Azure. I particularly look at how much integration is in the application — integration is always more difficult than expected, doubly so when doing it in the cloud. If most of your workload needs to be done through a single, always-on connection, then you are going to struggle to get the availability and scalability that we turn to the cloud for.
Without knowing the detail of your application, but by way of example, assume services A & B are feeds from a financial data provider. Providers of data feeds are really good at what they do, have high availability, and provide 'enterprise grade' (whatever that means) for enterprise grade costs. Their architectures are also old-school and, in some cases, very rigid. So first off, consider asking your feed provider (that gives to a login/connection and expects you to pull data) to push data to you via a web service. Exposed web services are the solution to scaling and performance, and are used from table storage on Azure, to high throughput database services like DynamoDB. (I'll challenge any enterprise data provider to explain how a service like Amazon S3 is mickey-mouse.) If your data supplier pushed data to a web service via an agreed API, you could perform all sorts of scaling and availability on the service for a low engineering cost.
Your alternative is, as you are discovering, to build a whole lot of stuff to make sure that your architecture fits in with the single-node model of your data supplier. While it can be done, you are going to spend a lot of engineering cash on hand-rolling a whole bunch of distributed computing principles. If you are going to have an active-passive architecture, you need to implement a leader election algorithm in order to determine when a passive node should become active. This is not as trivial as it sounds as an active node may look like it has disappeared, but is still processing — and you don't want to slot another one in its place. So then you will implement a heartbeat, or even a separate 'witness' node that does nothing other than keep an eye on which nodes are alive in order to do something about them. You mention that downtime and patching is unacceptable. So what is acceptable? A few minutes or a few seconds, or less than a second? Do you want the passive node to take over from where the other left off, or start again?
You will probably find that the development cost to implement all of this is lower than the cost of building and hosting a highly available physical server. Perhaps you can separate the loads and run the data feed services in a co-lo on a physical box, and have the heavy lifting of the processing done on Windows Azure. I wouldn't even look at Azure VMs, because although they don't recycle as much as roles, they are subject to occasional problems — at least more than enterprise-grade hardware. Start off with discussions with your supplier of the data feeds — they may have a solution, or one that can be cobbled together (e.g. two logins for the price of one, and the 'second' account/instance mostly throws away its data).
Be very careful of traditional enterprise integration. They ask for things that seem odd in today's cloud-oriented world. I've had a request that my calling service have a fixed ip address, for example. You may find that the code that you have to write to work around someone else's architecture would be better spent buying physical servers. Push back on the data providers — it is time that they got out of the 90s.
[Disclaimer] 'Enterprises', particularly those that are in financial services, keep saying that their requirements are special — higher throughput, higher security, high regulations and higher availability. With the exception of a very few cases (e.g. high frequency trading), I tend to call 'bull' on most of this. They are influenced by large IT budgets and vendors of expensive kit taking them to fancy lunches, and are indoctrinated to their server-hugging beliefs. My individual view on the enterprise hardware/software/services business has influenced this answer. Your mileage may vary.
I am conducting some research on emerging web technologies and have created a very simple Azure website which makes use of web sockets and mongo db as the database. I have managed to get all the components working together and now must perform load testing on the application.
The main criteria is the maximum user load that the app can support, at the moment there is 1 web role instance, so probably I would need to test the max user load for that instance, then try with 2 instances and so on.
I found some solutions online such as Loadstorm, however I cannot afford to pay to use these services so I need to be able to do this from my own development machine OR from another cloud service.
I have come across Visual Studio Load Tests and they seem quite useful, however it seems they require VS Ultimate and an active msdn subscription - the prerequisites are listed here. Also, from this video which shows the basics of load tests, it seems like these load tests are created completely separately from the actual web project, so does that mean I can only see metrics related to the user? i.e. I cannot see the amount of RAM being used, processor etc.
Any suggestions?
You might create a Linux virtual machine in Azure itself or another hosting provider and use ApacheBench (ab) or JMeter to do simple load testing on your application. Be aware that in such a setup your benchmark servers may be a bottleneck themselves.
Another approach is to use online load testing services wich allow some free usage, such as:
loader.io, by SendGrid Labs
LoadStorm
Blazemeter
Blitz
Neotys
Loadimpact
For load-testing, LoadStorm is very reasonably priced, especially compared to on-premises software (and has a free tier with up to 25 virtual clients). You can install code such as jmeter, but you'll still need machines (or vm's) to host and run it from, and you need to make sure that the load-generator machines aren't the bottleneck in your tests.
When you run your tests, you may want to consider separating your web tier from MongoDB. MongoDB will consume as much memory as possible (as that's what gives MongoDB its speed). In a real-world scenario, you'll likely have MongoDB in its own environment. So for your tests, I'd consider offloading MongoDB to its own instance(s), and 10gen has a Worker Role setup that's fairly straightforward to install.
Also remember that NIC bandwidth is 100Mbps per core, which could be a limiting factor on your tests, depending on how much load you're driving.
One alternative to self-hosting MongoDB: Offload MongoDB to a hoster such as MongoLab. This will allow you to test the capacity of your web app without worrying about the details around MongoDB setup, configuration, optimization, etc. Currently MongoLab offers their free tier hosted in Azure, US West and US East data centers.
Editing my response, didnt read the question carefully.
Check out this thread for various tools and links:
Open source Tool for Stress, Load and Performance testing
If you are interested in finding the performance counters of the application under test you can revisit some of the latest features added to Visual Load Cloud base load test.
http://blogs.msdn.com/b/visualstudioalm/archive/2014/04/07/get-application-performance-data-during-load-runs-with-visual-studio-online.aspx
To get more info on Visual Studio Cloud Load Testing solution - https://www.visualstudio.com/features/vso-cloud-load-testing-vs
I'm a developer of a MMO game and currently we're at my company facing some scalability issues which, I think, can be resolved with proper clustering of the game world.
I don't really want to reinvent the wheel that's why I think Linux Virtual Server could be a good choice especially with some Level 7 load balancing technique.
I'm currently looking at ktcpvs as a load balancing solution and wonder if it's a proper choice.
The main idea is to have a number of zones("locations" in terms of my game) running on dedicated servers. When a player decides to go to some specific location the load balancer decides which zone server will be actually serving the player(that's actually why I need a Level 7 load balancer)
What do you folks think about all said above?
Update: I posted the same question to LVS users mailing list http://marc.info/?l=linux-virtual-server&m=124976265209769&w=2
Update: I also started the similar topic on the gamedev.net forum http://www.gamedev.net/community/forums/topic.asp?topic_id=544386
In order to address your question we need to understand whether you need volume or response, but it is difficult to get both at the same time.
Layer 7 load balancing - is data based application level balancing, so the data content of the network packet needs to be routed to an end-point. You can achieve volume (more users) by implementing routing at the application level, service level or kernel level.
Scalability - I assume you are running out of memory, CPU resources and network bandwidth.
Application level - your application logic receives an application packet and routes accordingly.
Service level - your system framework (front end service of some kind) receives the packet and through a module - performs the routing (think of custom apache module, even network driver modules - like writing a network filter)
Kernel level - Performs routing at network packet level.
The closer you move to the metal, the better your response will be. I suggest using dedicated linux server up-front to perform the routing - go native, not virtual. Use multiple or teamed network adapters for the WAN and a dedicated adapter for each end-point (one+ wan, one each for each connected app server)
If response time is important then you need a kernel/supervisor state solution, it will save you a few context switches but be aware that you need to limit hops at all costs and could better be served by fewer, larger machines and your scalability will always be limited. There is a risk in using KTCPVS, it is quite old and not actively updated. If you judge that it works for you great, otherwise consider writing something akin to a network filter as long as it runs in system state.
If volume is important but response time is secondary, implement a custom built high-speed socket switch built in C++ running in problem/user state. It is the easiest to maintain and will offer the best scalability.
You will need to build some prototypes to figure out what suits your needs best.
Final thoughts -
Before doing any of the above first ensure that you have optimized your game design. You may know most of this, I list it here for the benefit of all.
(a) messages should fit comfortably within one network packet, less than 1500 bytes for most home routers
(b) Try to fit the logic of the routing in your game client instead of your servers. A simple download of a small table with zones and IP addresses to a client will allow you to forego all of the above.
(c) Try to limit zone visibility by to the clients, they should know about their zones and adjacent zones only (if you implement the point b above)
Hope this helps, sorry I cannot be more specific regarding KTCPVS.
You haven't specified where the bottleneck is. Network Traffic? Disk IO? CPU Cycles?
Assuming you mean a layer 7 load balancer and don't have enough CPU power, I think LVS ist not the optimal choice. I have done Web Server load balancing with LVS, which works straightforward and isn't exactly complicated.
But I think load balancing an MMORP this way needs considerable amounts of additional code in LVS, it might be easier to do the load balancing with a multithreaded application distributed over some multicore server. But this isn't fully scalable, this only gets you to 16 cores without prohibitve cost increase.
The biggest issue in something like this is what happens when players are near a boundary. Obviously they need to be able to see and interact with each other, but they're on separate servers. So you need some pretty fancy inter-server communication, sometimes just duplicating messages to both servers. It can get even more complicated when someone is near a "corner", and then you have to deal with 4 servers!
The book Massively Multiplayer Game Development has a chapter on "The Pitfalls of Shared Server Boundaries" which covers this issue in detail.
I haven't heard of Linux Virtual Server before now, so I don't understand how it fits. I think your actual server application needs to support this game-specific load balancing, rather than trying to run a cluster and assuming that it will automatically know how to split up your application (which it won't). If I were you, I would write the server program to handle its own piece of land, and it should connect to the pieces of land around it, and then design a server-to-server protocol for the passing of these messages ("here comes a player, I'm going to start telling you about him!" "make sure to tell me about messages near our boundary", "okay the player is out of my territory and into yours, here's his detailed data", etc). I think it's a bit more complicated than just running a different flavor of Linux and assuming you'll get automatic load balancing.
Why are you moving the distribution logic to the loadbalancer? It's a component that's not free and can break. It seems your clients are quite aware of which zone they're in. It seems they could very well connect to zone<n>.example.com. You'd then handle loadbalancing at DNS level.