Good resource for distributed computing / scale out patterns [closed] - azure

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
we're about to start working on a parallel cloud processing application and I'm looking for good resources how to set things up. Let me set the context:
First we load a DB with a whole lot of data
Then we have n-instances of cloud services that will generate PDFs from the data
PDFs will then be merged, again should be scaleable
Result stored in DB
Done.
I'm looking for resources to help me answer questions like:
How can you measure progress for all of these instances? I suppose one controller instance that's monitoring. Should we use polling or a pub/sub system?
How can you control these n-instances to start/stop/pause, whatever?
Should the controller process be aware of the processors, or should he listen to broadcasts?
We're thinking about a 'data'-queue, the queue from which each PDF generation instance gets the instructions to process, should we also use a 'command' queue for commands like 'start/stop'?
Or - is there already something out of the box for this? I'm looking for the 'Patterns of Enterprise Application Architecture' but tailored to scaling out/parallel/cloud processing.
Any thoughts?
EDIT
Thanks for the -1. In case you're in doubt, I have Googled it, I have searched PluralSight and I have looked through Azure videos. I haven't ran across any patterns describing a process controller/processor setup.

As #JuneT mentioned, look at the Cloud Design Patterns guide. I would recommend Leader Election pattern as mentioned in the comment.
Some other thoughts:
I don't think you should think about having one controller instance. All instances should have equal opportunity to become a leader. This way you're ensuring against leader failing. For deciding leader, you should look into Lease Blob functionality.
You should look into Windows Azure Diagnostics as a way to monitor the health of these instances. Windows Azure Diagnostics also support custom performance counters using which you can monitor the effectiveness of each instance. For scaling, you can rely on Windows Azure Auto Scaling feature available in the portal or look for 3rd party solutions like AzureWatch from Paraleap software. The process responsible for scaling should not be a part of your solution IMHO. It should sit somewhere outside.
So the general sequence of the events could be:
All instances will fight among themselves to be a leader. Only one instance will be elected to be the leader. All other instances will wait out to hear from the leader (lets call them followers).
Leader will fetch the data from database and push the information in a queue. Followers will poll this queue. Once the messages arrive in the queue, followers will start processing those messages. Once the follower has finished one task, they will go back to the queue and see if there's more work for them. Once the leader has put all the messages in a queue, it will become a follower and work on processing the messages.
Once all messages have been processed, all instances will go back to step 1.

I don't think you need to have controller at all. What business process starts the data load into the DB? Whatever this process is, it can populate a queue with messages as to which PDF's need to be generated.
You then want to have a Worker Role with N number of servers that basically keep looking at the queue, pull off messages from the queue if there are any, process them (ie: generate PDFs) and remove the messages from the queue
An autoscaling solution like Azure's native basic autoscaling, or a more powerful one like AzureWatch can create extra servers when the queue has messages in it and remove the non-needed servers when queues are depleted.
This is a very standard approach to distributing load across N number of instances. Progress can be measured by looking at the queue and see how many messages are left in it.

Related

Questions pertaining to micro-service architecture

I have a couple of questions that exist around micro service architecture, for example take the following services:
orders,
account,
communication &
management
Question 1: From what I read I understand that each service is suppose to have ownership of the data pertaining to that service, so orders would have an orders database. How important is that data ownership? Would micro-services make sense if they all called from one traditional database such that all data pertaining to the services would exist in one database? If so, are there an implications of structuring the services this way.
Question 2: Services should be able to communicate with one and other. How would that statement be any different than simply curling an existing API? & basing the logic on that response? Is calling a service more efficient than simply curling the API?
Question 3: Is it worth it? Now I understand this is a massive generality , and it's fundamentally predicated on the needs of the business. But when that discussion has been had, was the re-build worth it? & what challenges can you expect to face
I will try to answer all the questions.
Respect to all services using the same database. If you do so you have two main problems. First the database would become a bottleneck because all requests will go to the same point. And second you will have coupled all your services, so if the database goes down or it needs to update, all your services will be affected. (The database will became a single point of failure)
The communication between services could be whatever your services need (syncrhonous, asynchronous, via message passing (message broker), etc..) it all depends on the use cases you have to support. The recommended way to do to avoid temporal decoupling is to use a message broker like kafka, doing this your services don't have to known each other and in case some of them go down the others will still working. And when they are up again, they can continue to process the messages that have pending. However, if your services need to respond in synchronous way, you can define synchronous communication between services and use a circuit breaker to behave properly in case the callee service is down.
Microservices architecture is far more complicated to make it work, to monitoring and to debug than a traditional monolith architecture so, it is only worth if you will have very large requirements of scalability and availability and/or if the system is very large and it will require several teams working in different parts of the system and it is recommendable to avoid dependencies among them. So each team can work at their own pace deploying their own services

CQRS and Event Sourcing Guide

I want to create a CQRS and Event Sourcing architecture that is very cheap and very flexible and very uncomplicated.
I want to make sure that events never fail to at least reach the publisher/event store, ever, ever, because that's where business is.
Now, i have several options in mind:
Azure
With azure, i seem to not know what to use.
Azure service bus
Azure Function
Azure webjob (i suppose this can be replaced with Azure functions)
?? (something else i forgot or dont know?)
How reliable are these azure server-less solutions??
Custom
For this i am thinking of using RabbitMQ, the problem is the cost of a virtual machine to run it.
All in all, i want:
Ability to replay the messages/events in case of failure.
Ability to easily add subscribers.
Ability to select the subscribers upon which to replay the messages.
The Event store should be able to store very large sizes of event messages (or how else shall queue an image or file??).
The event store MUST NEVER EVER get chocked, or sleep.
Speed of implementation/prototyping would be an added
advantage.
What does your experience suggest?
What about other alternatives? (eg: apache-kafka)?
Why not run Event Store? Created by Greg Young himself. Host where you need.
I am a java user, I have been using hornetq (aka artemis which I dont use) an alternative to rabbitmq for the longest; the only problem is it does not support replication but gets the job done when it comes to eventsourcing. For your custom scenario, rabbitmq is a good choice but try running it on a digital ocean instance for low costs. If you are looking for simplicity and flexibility you have only 2 choices , build your own or forgo simplicity and pick up apache kafka with all its complexities but will give you flexibility. Again you can also build an eventstore with mongodb. https://www.mongodb.com/blog/post/event-sourcing-with-mongodb
Your requirements are too vague to make the optimal choice. You need to consider a lot of things, one of them would be, for instance, the numbers of events per one aggregate, the number of aggregates (note that this has to be statistical). Those are important primarily because if you allow tens of thousands of events for each aggregate then you would need to have snapshotting which adds complexity which you might not need.
But for regular use cases you could just use a relational database like Postgres as your (linearizable) event store. It also has a listen/notify functionality to you would not really need any message bus either and your application could be written in a reactive way.

Can this technology stack scale? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
My client ask me to build a realtime application that could chat, send images and videos all in realtime. He asked me to come up with my own technology stack, so I did a lot of research and found out that the easiest one to build would be using below tech stack
1) Node.js and cluster to max out the CPU core for one instance of server - Language
2) Socket.io - realtime framework
3) Redis - pub/sub for multiple instances of server
4) Nginx - to reverse proxy and load balance multiple servers
5) Amazon EC2 - to run the server
6) Amazon S3 and CloudFront - to save the images/videos and to deliver
Correct me if I'm wrong for the above stack. My real question is, can the above tech stack scale 1,000,000 messages per seconds (text, images, videos)?
Anyone who have experienced with node.js and socket.io, could give me an insights or an alternatives of the above stack.
Regards,
SinusGob
My real question is, can the above tech stack scale 1,000,000 messages
per seconds (text, images, videos)?
Sure it can. With the right design and enough hardware. The question your client should be asking is really not whether it can be made to go that big, but at what cost and practicality can it be done and are those the best choices.
Let's look at each piece you've mentioned:
node.js - For an I/O centric app, it's an excellent choice for high scale and it can scale by deploying many CPUs in a cluster (both multi-process per server and multi-server). How practical this type of scale is depends a lot on what kind of shared data all these server processes need access to. Usually, the data store ultimately ends up being the harder bottleneck in scaling because it's easy to throw more servers at the request processing. It's not so easy to throw more hardware at a centralized data store. There are ways to do that, but it depends a lot on the demands of the app for how you do it and how hard it is.
socket.io - If you need efficient server push of smallish messages, then socket.io is probably the best way to go because it's the most efficient at push to the client. It is not great at all types of transport though. For example, I wouldn't be moving large images or video around through socket.io as there are more purpose built ways to do that. So, the use of socket.io depends a lot on what exactly the app wants to use it for. If you wanted to push a video to a client, you could also push just an URL and have the client turn around and request the video via a regular http URL using well known high scale technology.
Redis - Again, great for some things, not great at everything. So, it really depends upon what you're trying to do. What I explained earlier is that the design of your data store and the number of transactions through it is probably where your real scale problems lie. If I were starting this job, I'd start with an understanding of the data storage needs for a server, transactions per second of various types, caching strategy, redundancy, fail-over, data persistence, etc... and design the high scale access to data first. I wouldn't be entirely sure redis was the preferred choice. I'd probably suggest you need a high scale database guy as a consultant early in the project.
Nginx - Lots of high scale sites using nginx so it's certainly a good tool. Whether it's exactly the right tool for you depends upon your design. I'd probably work on this part last because it seems less central to the design and once the rest of the system is laid out, you can then consider what you need here.
Amazon EC2 - One of several possible choices. These choices are hard to compare directly in an apples to apples comparison. Large scale systems have been built out of EC2 so there is proof of concept there and the general architecture seems an appropriate match. If you wanted to know where the real gremlins are there, you'd need a consultant that had done high scale stuff on EC2.
Amazon S3 - I personally know some very high storage and bandwidth sites using S3 for both video and images. It works for that.
So ... these are generally likely good tools to use if they are used in the right way. Redis would be a question-mark depending upon the storage needs of the actual application (you've provided zero requirements and a database can't be selected with zero requirements). A more reasoned answer would be based on putting together a high level set of requirements that analyze what the system needs to be able to do to serve 1,000,000 whatever. Those requirements could be compared with known capabilities for some of these pieces to start a ballpark on scaling a system. Then, you'd have to put together some benchmarking tests to run some tests on certain pieces of the system. As much of the success of failure would depend upon how the app was built and how the tools were used as it would which tools were selected. You can likely make a successful scale with many different types of tools. Heck, Facebook runs on PHP (well, a highly modified, customized PHP that is not really typical PHP at all at runtime).

When do I need to use worker processes in Heroku

I have a Node.js app with a small set of users that is currently architected with a single web process. I'm thinking about adding an after save trigger that will get called when a record is added to one of my tables. When that after save trigger is executed, I want to perform a large number of IO operations to external APIs. The number of IO operations depends on the number of elements in an array column on the record. Thus, I could be performing a large number of asynchronous operations after each record is saved in this particular table.
I thought about moving this work to a background job as suggested in Worker Dynos, Background Jobs and Queueing. The article gives as a rule of thumb that tasks that take longer than 500 ms be moved to background job. However, after working through the example using RabbitMQ (Asynchronous Web-Worker Model Using RabbitMQ in Node), I'm not convinced that it's worth the time to set everything up.
So, my questions are:
For an app with a limited amount of concurrent users, is it ok to leave a long-running function in a web process?
If I eventually decide to send this work to a background job it doesn't seem like it would be that hard to change my after save trigger. Am I missing something?
Is there a way to do this that is easier than implementing a message queue?
For an app with a limited amount of concurrent users, is it ok to leave a long-running function in a web process?
this is more a question of preference, than anything.
in general i say no - it's not ok... but that's based on experience in building rabbitmq services that run in heroku workers, and not seeing this as a difficult thing to do.
with a little practice, you may find that this is the simpler solution, as I have (it allows simpler code, and more robust code, as it splits the web away from the background processor - allowing each to run without knowing about each other directly)
If I eventually decide to send this work to a background job it doesn't seem like it would be that hard to change my after save trigger. Am I missing something?
are you missing something? not really
as long as you write your current in-the-web-process code in a well structured and modular fashion, moving it to a background process is not usually a big deal
most of the panic that people get from having to move code into the background, comes from having their code tightly coupled to the HTTP request / response process (i know from personal experience how painful it can be)
Is there a way to do this that is easier than implementing a message queue?
there are many options for distributed computing and background processing. i personally like RabbitMQ and the messaging patterns that it uses.
i would suggest giving it a try and seeing if it's something that can work well for you.
other options include redis with pub/sub libraries on top of it, using direct HTTP API calls to another web server, or just using a timer in your background process to check database tables on a given frequency and having the code run based on the data it finds.
p.s. you may find my RabbitMQ For Developers course of interest, if you are wanting to dig deeper into RMQ w/ node: http://rabbitmq4devs.com

Architecting multi-service enterprise applications using Azure cloud services

I have some questions regarding architecting enterprise applications using azure cloud services.
Back Story
We have a system made up of about a dozen WCF Windows Services on a SQL backend. We currently have about 10 clients but expect that to grow to potentially a hundred with perhaps a hundred fold increase in the throughput demands on the system. The current system is poorly engineered and is simply not capable of scaling. So now appears to be the appropriate juncture to reengineer on the azure platform.
Process Flow
Let me briefly describe a simplified set of the services and the process flow and then ask some questions I have regarding utilizing azure cloud services to build the new system.
Service A is logged on to an external systems and downloads data continuously
Service B is logged on to a second external systems and downloads data continuously
There can only ever be one logged in instance each of services A and B.
Both A and B hand off their data to Service C which reconciles the data from the two external sources.
Validated and reconciled data is then passed from C to Service D which performs some accounting functions and then passes the resulting data to Services E and F.
Service E is continually logged in to an external system and uploads data to it.
Service F generates reports and publishes them to clients via FTP etc
The system is actually far more complex than this but the above illustrates the processes involved. The system runs 24 hours a day 6 days a week. Queues will be used to buffer messaging between all the services.
We could just build this system using Azure persistent VMs and utilise the service bus, queues etc but that would ties us in to vertical scaling strategy. How could we utilise cloud services to implement it given the following questions.
Questions
Given that Service A, B and E are permanently logged in to external systems there can only ever be one active instance of each. If we implement these as single instance worker roles there is the issue with downtime and patching (which is unacceptable). If we created two instances of each is there a standard way to implement active-passive load balancing with worker roles on azure or would we have to build our own load balancer? Is there another solution to this problem that I haven’t thought of?
Services C and D are a good candidates to scale using multiple worker role instance. However each instance would have to process related data. For example, we could have 4 instances each processing data for 5 individual clients. How can we get messages to be processed in groups (client centric) by each instance? Also, how would we redistribute load from one instance to the remaining instances when patching takes place etc. For example, if instance 1, which processes data for 5 clients, goes down for OS patching, the data for its clients would then have to be processed by the remaining instances until it came back up again. Similarly, how could we redistribute the load if we decide to spin up additional worker roles?
Any insights or suggestions you are able to offer would be greatly appreciated.
Mat
Question #1: you will have to implement your own load-balancing. This shouldn't be terribly complex as you could use Blob storage Lease functionality to keep a mutex on some blob in the storage from one instance while holding the connection active to your external system. Every X period of time you could renew the lease if you know that connection is still active and successful. Every other worker in the Role could be checking on that lease to see if it expires. If it ever expires, the next worker would jump in and acquire the lease, and then open the connection to the external source.
Question #2: Look into Azure Service Bus. It has a capability to allow clients to process related messages. More info here: http://geekswithblogs.net/asmith/archive/2012/04/02/149176.aspx
All queuing methodologies imply that if a message gets picked up but does not get processed within a configurable amount of time, it goes back onto the queue so that the next available instance can pick it up and process it
You can use something like AzureWatch to monitor the depth of your queues (storage or service bus) and auto-scale number of instances in your C and D roles to match; and monitor instance statuses for roles A, B and E to make sure there is always a healthy instance there and auto-scale if quantity of ready instances drop to 0.
HTH
First, back up a step. One of the first things I do when looking at application architecture on Windows Azure is to qualify whether or not the app is a good candidate for migration to Windows Azure. I particularly look at how much integration is in the application — integration is always more difficult than expected, doubly so when doing it in the cloud. If most of your workload needs to be done through a single, always-on connection, then you are going to struggle to get the availability and scalability that we turn to the cloud for.
Without knowing the detail of your application, but by way of example, assume services A & B are feeds from a financial data provider. Providers of data feeds are really good at what they do, have high availability, and provide 'enterprise grade' (whatever that means) for enterprise grade costs. Their architectures are also old-school and, in some cases, very rigid. So first off, consider asking your feed provider (that gives to a login/connection and expects you to pull data) to push data to you via a web service. Exposed web services are the solution to scaling and performance, and are used from table storage on Azure, to high throughput database services like DynamoDB. (I'll challenge any enterprise data provider to explain how a service like Amazon S3 is mickey-mouse.) If your data supplier pushed data to a web service via an agreed API, you could perform all sorts of scaling and availability on the service for a low engineering cost.
Your alternative is, as you are discovering, to build a whole lot of stuff to make sure that your architecture fits in with the single-node model of your data supplier. While it can be done, you are going to spend a lot of engineering cash on hand-rolling a whole bunch of distributed computing principles. If you are going to have an active-passive architecture, you need to implement a leader election algorithm in order to determine when a passive node should become active. This is not as trivial as it sounds as an active node may look like it has disappeared, but is still processing — and you don't want to slot another one in its place. So then you will implement a heartbeat, or even a separate 'witness' node that does nothing other than keep an eye on which nodes are alive in order to do something about them. You mention that downtime and patching is unacceptable. So what is acceptable? A few minutes or a few seconds, or less than a second? Do you want the passive node to take over from where the other left off, or start again?
You will probably find that the development cost to implement all of this is lower than the cost of building and hosting a highly available physical server. Perhaps you can separate the loads and run the data feed services in a co-lo on a physical box, and have the heavy lifting of the processing done on Windows Azure. I wouldn't even look at Azure VMs, because although they don't recycle as much as roles, they are subject to occasional problems — at least more than enterprise-grade hardware. Start off with discussions with your supplier of the data feeds — they may have a solution, or one that can be cobbled together (e.g. two logins for the price of one, and the 'second' account/instance mostly throws away its data).
Be very careful of traditional enterprise integration. They ask for things that seem odd in today's cloud-oriented world. I've had a request that my calling service have a fixed ip address, for example. You may find that the code that you have to write to work around someone else's architecture would be better spent buying physical servers. Push back on the data providers — it is time that they got out of the 90s.
[Disclaimer] 'Enterprises', particularly those that are in financial services, keep saying that their requirements are special — higher throughput, higher security, high regulations and higher availability. With the exception of a very few cases (e.g. high frequency trading), I tend to call 'bull' on most of this. They are influenced by large IT budgets and vendors of expensive kit taking them to fancy lunches, and are indoctrinated to their server-hugging beliefs. My individual view on the enterprise hardware/software/services business has influenced this answer. Your mileage may vary.

Resources