Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
My client ask me to build a realtime application that could chat, send images and videos all in realtime. He asked me to come up with my own technology stack, so I did a lot of research and found out that the easiest one to build would be using below tech stack
1) Node.js and cluster to max out the CPU core for one instance of server - Language
2) Socket.io - realtime framework
3) Redis - pub/sub for multiple instances of server
4) Nginx - to reverse proxy and load balance multiple servers
5) Amazon EC2 - to run the server
6) Amazon S3 and CloudFront - to save the images/videos and to deliver
Correct me if I'm wrong for the above stack. My real question is, can the above tech stack scale 1,000,000 messages per seconds (text, images, videos)?
Anyone who have experienced with node.js and socket.io, could give me an insights or an alternatives of the above stack.
Regards,
SinusGob
My real question is, can the above tech stack scale 1,000,000 messages
per seconds (text, images, videos)?
Sure it can. With the right design and enough hardware. The question your client should be asking is really not whether it can be made to go that big, but at what cost and practicality can it be done and are those the best choices.
Let's look at each piece you've mentioned:
node.js - For an I/O centric app, it's an excellent choice for high scale and it can scale by deploying many CPUs in a cluster (both multi-process per server and multi-server). How practical this type of scale is depends a lot on what kind of shared data all these server processes need access to. Usually, the data store ultimately ends up being the harder bottleneck in scaling because it's easy to throw more servers at the request processing. It's not so easy to throw more hardware at a centralized data store. There are ways to do that, but it depends a lot on the demands of the app for how you do it and how hard it is.
socket.io - If you need efficient server push of smallish messages, then socket.io is probably the best way to go because it's the most efficient at push to the client. It is not great at all types of transport though. For example, I wouldn't be moving large images or video around through socket.io as there are more purpose built ways to do that. So, the use of socket.io depends a lot on what exactly the app wants to use it for. If you wanted to push a video to a client, you could also push just an URL and have the client turn around and request the video via a regular http URL using well known high scale technology.
Redis - Again, great for some things, not great at everything. So, it really depends upon what you're trying to do. What I explained earlier is that the design of your data store and the number of transactions through it is probably where your real scale problems lie. If I were starting this job, I'd start with an understanding of the data storage needs for a server, transactions per second of various types, caching strategy, redundancy, fail-over, data persistence, etc... and design the high scale access to data first. I wouldn't be entirely sure redis was the preferred choice. I'd probably suggest you need a high scale database guy as a consultant early in the project.
Nginx - Lots of high scale sites using nginx so it's certainly a good tool. Whether it's exactly the right tool for you depends upon your design. I'd probably work on this part last because it seems less central to the design and once the rest of the system is laid out, you can then consider what you need here.
Amazon EC2 - One of several possible choices. These choices are hard to compare directly in an apples to apples comparison. Large scale systems have been built out of EC2 so there is proof of concept there and the general architecture seems an appropriate match. If you wanted to know where the real gremlins are there, you'd need a consultant that had done high scale stuff on EC2.
Amazon S3 - I personally know some very high storage and bandwidth sites using S3 for both video and images. It works for that.
So ... these are generally likely good tools to use if they are used in the right way. Redis would be a question-mark depending upon the storage needs of the actual application (you've provided zero requirements and a database can't be selected with zero requirements). A more reasoned answer would be based on putting together a high level set of requirements that analyze what the system needs to be able to do to serve 1,000,000 whatever. Those requirements could be compared with known capabilities for some of these pieces to start a ballpark on scaling a system. Then, you'd have to put together some benchmarking tests to run some tests on certain pieces of the system. As much of the success of failure would depend upon how the app was built and how the tools were used as it would which tools were selected. You can likely make a successful scale with many different types of tools. Heck, Facebook runs on PHP (well, a highly modified, customized PHP that is not really typical PHP at all at runtime).
Related
I have been working on a Web App for visualizing live data. It is crucial that this data is kept up to date on the client side without such updates being invoked directly by the client (e.g. no button presses or refreshing the page). Currently, on page load, I grab the current data set from a database (DynamoDB) via Ajax, and subsequent updates are pushed to any listening clients every 5 minutes via a Websockets connection (using Socket.io).
I have overlooked the computational load of this update job. It has to mine some data, process it, update the database, and send the update out to all clients. As a result, the web server is left unresponsive for about 30 seconds with each update. Furthermore, my current architecture limits me from putting my server behind a load balancer, which is something I anticipate coming up in the future. For both these reasons, I really need to get this update job off my web server.
I am relatively inexperienced in web development, and I don't feel I am knowledgeable enough about these technologies to know the drawbacks of the solutions I have come up with. Currently, I am considering:
Break the update off into a separate process so it does not block the Node event loop. This would solve my issue in the short term, but if I ever want to load balance my application, I can't have the update running on multiple machines.
Drop Websockets entirely and just have the client query the database every 5 minutes, while a separate process (or separate server if I want load balancing) keeps the database up to date without interacting directly with the client. Will this kind of access pattern put too much load on my db?
Have a separate server run the update, and send the result via Websockets (or maybe some other protocol) to my load balanced application servers, which then push that update to all listening clients as usual. Is this even possible?
Perhaps there are other solutions. It seems like this would be a relatively common problem, so I was hoping I could find some guidance here. What are the potential issues with the solutions I have proposed, and are there other possible solutions that my suit my use case better?
It sounds like you want one process sitting somewhere which crunches the data and publishes it to a stream. Clients can then subscribe to the stream as and when they like. Redis handles streams nicely, you could process your data and push it into a redis stream. You could then create a small node service which subscribes to the redis stream and pushes the formatted data out over a websocket or via polling.
In this scenario you can then scale up either the publishing process (the one crunching the numbers) if your data load goes up, or scale up your subscribed process (which serves the data over a websocket to browsers) if you get an influx of clients watching the data.
You can also easily distribute the hosting of these services across other machines, and even write them in different languages if you decide the number crunching needs something like threading.
You're then left with the issue of clients (web browsers) consuming this data with a load balance in-between. This can be a hard problem if you use websockets and is bundled with pros and cons. But importantly you'll have separated your data crunching from your result publishing and that'll isolate out your issue to only the load balancing.
I have done pretty much the same to check ressources on some of our servers.
I have a C# service getting the information on each server that we manage, sending them to a queue (Amq).
From there, I have a stomp client fetching data from amq and emiting them to a websocket.
My main micro service is fetching the data to save them into a db.
My visualisation webapp is connected to the same ws and is fetching the data as they are sent to display them.
The Amq step isn't mandatory at all, it's just something I had to work with (historical).
I don't know what type of data your are working with, so I don't know if my solution can apply to you.
Don't hesitate if I'm not clear or you have any question.
This is a big question and I'm not going to try and give you a definitive answer.
For option 2
It really depends on how expensive your queries are. You can make DynamoDB fast if you pay for enough throughput. That said, on the face it, re-loading your whole dataset, when that sounds like its probably large, probably isn't good engineering.
For option 3
This option seems best to me if its achievable, although admittedly its hard to say with such a complex system - obviously you can't share your whole project.
Given your are already using AWS you might want to look into AWS Lambda. If you can move the update process into a stand alone job, you can host it on lambda and move the load off the web server. Lambda is essentially infinitely scalable and you only pay for the compute you use.
This really depends on you being able to split the update task off into a separate service. Its likely you would need a fair bit of refactoring to isolate it as a service. If you can break little bits off at a time, and make the move gradually, even better.
If you consider trying this, and you've not used Lambda before, I would definitely start small with some hello world examples. Then try a very simple service in your application, and build up to taking on the update service.
You might also consider looking in AWS Simple Message Queue Service to handle the comms between clients and server.
Database tuning
If a lot of your update time is spent waiting for database actions to complete, rather than server processing, you can consider tuning that side of things up. Things to consider are:
Buying more throughput
Using batch operations (as these move load to DynamoDB from your server)
Tuning keys, indexes and database access
I am currently working on a web-based MMORPG game and would like to setup an auto-scaling strategy based on Docker and DigitalOcean droplets.
However, I am wondering how I could manage to do so:
My game server would have to be splittable across different Docker containers BUT every game server instance should act as if it was only one gigantic game server. That means that every modification happening in one (character moving) should also be mirrored in every other game server.
I am trying to get this to work (at least conceptually) but can't find a way to synchronize all my instances properly. Should I use a master only broadcasting events or is there an alternative?
I was wondering the same thing about my MySQL database: since every game server would have to read/write from/to the db, how would I make it scale properly as the game gets bigger and bigger? The best solution I could think of was to keep the database on a single server which would be very powerful.
I understand that this could be easy if all game servers didn't have to "share" their state but this is primarily thought so that I can scale quickly in case of a sudden spike of activity.
(There will be different "global" game servers like A, B, C... but each of those global game servers should be, behind the scenes, composed of 1-X docker containers running the "real" game server so that the "global" game server is only a concept)
The problem you state is too generic and it's difficult to give a concrete response. However let me be reckless and give you some general-purpose scaling advices:
Remove counters from databases. Instead primary keys that are auto-incremented IDs, try to assign random UUIDs.
Change data that must be validated against a central point by data that is self contained. For example, for authentication, instead of having the User Credentials in a DB, use JSON Web Tokens that can be verified by any host.
Use techniques such as Consistent Hashing to balance the load without need of load balancers. Of course use hashing functions that distribute well, to avoid/minimize collisions.
The above advices are basically about changing the design to migrate from stateful to stateless in as much as aspects as you can. If you anyway need to provide stateful parts, try to guess which entities will have more chance to share stateful data and allocate them in the same (or nearly server). For example, if there are cities in your game, try to allocate in the same server the users that are in the same city, since they are more willing to interact between them (and share stateful data) than users that are in different cities.
Of course if the city is too big and it's very crowded, you will probably need to partition the city in more servers to avoid overloading the server.
Your question is too broad and a general scaling problem as others have mentioned. It'd have been helpful if you'd stated more clearly what your system requirements are.
If it has to be real-time, then you can choose Redis as your main DB but then you'd need slaves (for replication) and you would not be able to scale automatically as you go*, since Redis doesn't support that. I assume that's not a good option when you're working with games (Sudden spikes are probable)
*there seems to be some managed solutions, you need to check them out
If it can be near real-time, using Apache Kafka can prove to be useful.
There's also a highly scalable DB which has everything you need called CockroachDB (I'm a contributor, yay!) but you need to run tests to see if it meets your latency requirements.
Overall, going with a very powerful server is a bad choice, since there's a ceiling and it'd cost you more to scale vertically.
There's a great benefit in scaling horizontally such an application. I'll try to write down some ideas.
Option 1 (stateful):
When planning stateful applications you need to take care about synchronisation of the state (via PubSub, Network Broadcasting or something else) and be aware that every synchronisation will take time to occur (when not blocking each operation). If this is ok for you, lets go ahead.
Let's say you have 80k operations per second on your whole cluster. That means that every process need to synchronise 80k state changes per second. This will be your bottleneck. Handling 80k changes per second is quiet a big challenge for a Node.js application (because it's single threaded and therefore blocking).
At the end you'll need to provision precisely the maximum amount of changes you want to be able to sync and perform some tests with different programming languages. The overhead of synchronising needs to be added to the general work load of the application. It could be beneficial to use some multithreaded language like C, Java/Scala or Go.
Option 2 (stateful with routing):*
In some cases it's feasible to implement a different kind of scaling.
When for example your application can be broken down into areas of a map, you could start with one app replication which holds the full map and when it scales up, it shares the map in a proportional way.
You'll need to implement some routing between the application servers, for example to change the state in city A of world B => call server xyz. This could be done automatically but downscaling will be a challenge.
This solution requires more care and knowledge about the application and is not as fault tolerant as option 1 but it could scale endlessly.
Option 3 (stateless):
Move the state to some other application and solve the problem elsewhere (like Redis, Etcd, ...)
I am implementing a system composed of a collection of small systems, ie. Raspberry, Yun, Beaglebone, the occasional PC. Crossbar.io has real promise ... but, as I understand it, doesn't currently support multiple nodes. Am I correct? Does anyone know when that might happen?
In the meantime it occurred to me that each individual node can offer an http interface that I might be able to use for my purposes. My initial thought is to crate workers that wrap access to the web the interface on subsidiary nodes. This fits the overall architecture of the system I want to create - does it have any merit? Is it tractable? I'm new to websockets - and insight would be a great help.
Thanks for your time,
Al
In general that does sound like a fit for Crossbar.io.
There is no timeline on multi-node (i.e. multiple routers), but we hope to have at least hot-standby nodes for high availability ready in Q1. Other than for high availability, I think that a single instance should provide sufficient performance for most applications out there - on a single current (non-high-end) Xeon we're talking tens of thousands of events a second, and concurrent connections are mostly limited by RAM (and 100s of thousands on a single box are definitely not a problem). (If you need more than that then I'd be very interested in your specific use case - we want to learn more about our users.)
I don't completely understand the second part of your question: What precisely is the architecture you're planning here? If you're talking about the integrated Web server, then with recent optimizations (it can now use multiple cores) this should be enough for even moderately big sites, and with SPAs you're not likely to ever run into performance issues.
Hope this helps, and I'll be glad to answer in more detail once you've clarified the second part.
My website is written in Node.js, has no database or external dependencies, but does have lot of large media files (images and some video) totalling some 2gb. The structure of the website is drawn from a couple of simple JSON files.
My problem is drastic and sudden scaling. Traffic to my site is usually easily handled by any small VPS instance, but occasionally traffic can get to hundreds of times it normal level for short periods. My problem is how to scale quickly, without downtime, and automatically. I know there are issues with autoscaling, but perhaps lacking a database will negate some of that.
What sort of scaling issues and options should I be looking at?
(For context, I am currently using a Digital Ocean VPS, but I can't find a clean way to scale it with no downtime. I am not wedded to my provider.)
Scalability is important, but scaling when you need to is also important. We all do not have the scaling needs of Facebook or Twitter : ) This might just be a case of resource management.
Test the problem
Without a database and using NodeJS, some of the strengths of node are its number of concurrent connections. For simple io load, it would seem you have picked a good choice of framework. And, since your problem set is a particular resource being bombarded, run some load testing on your server. Popular and free tools include:
Apache Bench
httperf
OpenLoad
And there are pay service like NeoLoad, LoadImpact (which is free at small levels), forecastweb, E-Load, etc..
With those results, Determine the Cause
Is it the size of the file being served? Is it the number of concurrent requests? What resources are being used, or maxed out, during a slowdown (ram, ports, file system, some other IO, CPU, bandwidth, etc...)?
Have a look at this question, which defines a few concepts for server load. To implement a solution, you will need to determine the cause of the slowdown. Is it: 1)Some queues fill up? 2)Problem with TCP Connections and Ports? 3) Too slow allocation of resources? That will help shape your solution.
Plan for scaling.
The type of scaling needed for your project may only be the portion needed for another. If you know the root cause in this case, it will increase your options.
Is the problem bandwidth? Perhaps using your web server as a router to multiple cloud instances of file serving would effectively increase the bandwidth your users see. Even just storing your files on a larger cloud that can guarantee the bandwidth you may need.
Is the problem CPU, RAM, etc? You may need multiple instances of the same web app (or an increased allotment for your VFS). This is the "Elastic" portion of Amazon's Elastic Cloud Computing (EC2), and other models like it. Create a "golden image" and duplicate when you see traffic start spiking, using built-in monitoring tools, turning it off when the rush is done. Can be programatic or simply manual.
Is the problem concurrent requests? The bottleneck should not be NodeJS, up to 1000's of concurrent requests anyway. Perhaps just check your implementation to ensure there is not a slowdown of the single node thread. Maybe node clustering or some worker threads would alleviate the bottleneck enough for your purposes.
Last Note: For serving static files I've heard nginx or even Apache Tomcat is a little more well-suited than NodeJS. Depending on your web app's complexity, you might be able to switch or benchmark fairly easily.
In case anyone is reading this rather specific question years later, I have gained some perspective on it. As Clay says, the ultimate answer is to spin up more servers, either manually or programatically based on load.
However, in my case that would be massive overkill - I'm not running Twitter. The problem was a relatively simple mistake in architecture. My app was reading the JSON data files from disk with every page request, and the disk I/O was getting saturated. I changed to loading the data files into memory on startup, and reloading them when they change using fs.watch().
My modest VPS can now easily handle the sorts of traffic that would previously crash it. I've never seen traffic that would make me want to up-size it.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'd need to build a simple analytics back-end for capturing user behaviour. This will be captured via a Javascript snippet on a webpage just like Google Analytics or Mixpanel data.
The system needs to capture close-to-realtime browser data (scrolling position of page, mouse position etc.) It will record the state of the users' page every 5 seconds. There are only three attributes on each measurement but they are have to be taken frequently.
The data doesn't necessarily need to be sent every 5 seconds, it could be bussed up less frequently however it's imperative that I get all of the data while the user is on the page. i.e. I can't bus it once per minute and lose the last 59 seconds of data for someone who leaves after 119 seconds.
If possible I'd like to build a system that will scale for the foreseeable future which means it working for 10,000 sites, each with 100 concurrent visitors, i.e. 100,000 concurrent users each sending one event every 5 seconds.
I'm not worried about querying the data, that can be done using a separate system. I'm most interested in how to handle the capture of the data itself.
Requirements
Based on the budgeting above, the system needs to handle 20,000 events per second coming from a pool of 100,000 users.
I'd like to host this service on Heroku however while I've done a lot of work with Rails, I'm completely new to high throughput systems (other than knowing you don't process them using Rails).
Questions
Is there a commercial system that would be good for doing this (like Pusher but for data capture as well as distribution)?
Should I be looking to do this using HTTP requests or websockets?
Is node.js the right choice for this or just trendy?
If I were to chose a socket-based solution, how many sockets can a dyno on Heroku handle for each webserver
What are the pertinent considerations for choosing between Mongo / Reddis etc. for storage
Is this the type of problem which actually requires two solutions - the first to get you to reasonable scale quickly and inexpensively and the second to take you past that scale on lower incremental cost but with more development effort required upfront?
My high level comment for you is to build your system following the 12 factor design, and then worry about scaling as the customers arrive. I'm thrilled with Node.js and the npm ecosystem, but I also think you could build a perfectly acceptable platform with Rails. If it took 3 dynos to support 100 K concurrent users with Node, and double that with Rails, you still might be better off with Rails, if your comfort with Ruby got you to market 3 months faster. Anyway, assuming you go with Node, here are my answers:
Here are some alternatives to Pusher that might work for you and a discussion of Pusher vs. Pubnub. Also see Ably.
Use socket.io. It's largely the standard, because it uses the best transport available and falls back from WebSockets to HTTP methods.
Node is a fantastic choice and is also trendy (see the module growth rate). I suspect you could make your system work fine in Node, Rails or several other frameworks.
A Heroku dyno should be able to support tens of thousands of concurrent connections, depending on how efficient you are with RAM. A server with 16 GB of RAM was able to support a million concurrent connections. Assuming you're RAM-limited, a Heroku dyno with 512 MB of RAM should be able to support ~30 K connections.
You likely want to pick two different systems, one for storage and processing of your data, and one for caching. Here's a great post about picking your core data platform from the creator of Instagram. For core data, I recommend Postgres (on Heroku) using the Sequelize ORM. But, Mongo with SOLR for search would probably work fine too. Note that Postgres 9.2 can be used as a NoSQL datastore if that's the way you want to go. For a caching system I highly recommend Redis.
No, I would try to avoid throw away engineering. Instead, build something that works, and expect that everytime you reach an order of magnitude more traffic, some piece of the system will break and need to be replaced. But, if you follow the 12 Factor principles, you should be in good shape to scale horizontally while you're investing in the replacement.
Good luck.
There are many services for sockets, but Pusher and Pubnub seem to be the market leaders in this space. What ever you do, don't host your own like socket.io because heroku times out requests longer than 30 seconds, including websockets. So a hosted socket would definitely be out of the question unless you plan on closing and re-opening the socket every few seconds.
If you were to use a socket service like Pusher, then you will need to implement a http endpoint for the service to send you the data anyway. So I would just cut the middle man out and go with a direct http request. Granted you need to collect constant user interactions, but that can all be recorded on the JavaScript client and sent back to the app periodically through CORS XHR or a tracking image.
node is a great choice, it's light, pretty easy to set up and the npm libraries available will have everything you need to get you started. Rails can be pretty swift too, especially if you cut out the things you don't need. There is a great railscast on this subject. The important thing is to keep it as simple as possible. Maybe split it into two applications; one for collecting data, the other for analysing/process it. This way you could collect the data in node cause it's fast and analyse/process it in rails cause it's easy.
As I mentioned in 1. sockets just aren't going to work in heroku and even if you used pusher you're still going to have to support the same number of http requests because when pusher receives the data it's going to send it straight on to you. As for how many dynos will you need, this will be something that will be easily tested but not something I can estimate. It will depend entirely on the efficiency of the code collecting the data. A simple Apache AB test with the load and concurrency you are expecting will give you a good indication of what you will need. Node comes with it's own concurrency but if you were to use rails to collect the data then use unicorn or puma as your server because they support concurrency. Also try different configurations when Apache AB testing; heroku now provide 2x dynos which are 1024mb instead of 512 which will allow you more concurrency
This stackoverflow thread suggests redis is faster and faster is what you're going to want for collecting the data. Though after collecting it, you'll probably want to process it and store it in more than a key, value store. Mongo is a good option for that but i would go with a graph database like neo4j because of the intricate connections analytics have.
If your entering new ground here, then you are not going to get it right first time, you will find yourself iterating over it to get the best performance and the most accurate data. Eventually you'll probably delete it and start again with a new architecture and the cycle will continue. Keeping the data collection and the analysis separate means you can focus on getting each bit right separately.
A few addional points I would like to mention is use a CDN for distribution of the JavaScript client, or better yet, provide the full JS to serve from the page. Either way, load fast and load asynchronously. It sounds like a fun project. Good luck!
EDIT In an alternate universe, where you do not have to use heroku, websockets would be an awesome solution.