What are the best papers for learning about algorithms for communicating updates in a distributed system? - p2p

I have a distributed system in mind (multiple nodes in a single datacenter) that I want to have the following properties:
nodes can enter and leave the system at any time.
There is no data replication between nodes.
Which node the client makes use of is up to the client (i.e. it could be consistent hashing, it could be something else)
no master (i.e. no central point of failure)
each node may receive a piece of information that needs to be forwarded to the rest of the nodes
What algorithms (links to papers are best) are suitable for this?
(I assume some of the answers will include P2P algorithms, but most of them that I've encountered in the past have acted more like distributed hash tables, where nodes enter and take over some part of the keyspace, etc. I also recognize that multicast with simple UDP messages might be appropriate here, but what existing work would help make the messaging reliable?)

How about trying to implement ADHOC nodes with JXTA? See the Practical JXTA II book available online at Scribd.

Related

Node.js built-in distributed cache

I'm looking for something for distributed cache for node.js. It should be:
built-in (works inside node process)
distributed and replicated between any number of servers (1 or 2 should be enough for concensus)
consistency is not critical
automatic peer discovery is preferably (e.g. by UDP)
should have one lider in one period of time (who will periodically update the cache)
I tried to find something for LevelDB (memdown), but all project are not maintained for several years. Maybe they are so stable, but I'm not sure. It is also possible to take a discovery and consensus algorithm and implement other things manually, but I can't find js-implementaion of Raft etc. for 1+ peers.
Advise me something, please.
I've chosen http://cote.js.org/ - mesh network. This is not full decision, just discovery and pub sub service. Plus lider choosing algorithm which I wrote manually.

How to scale a NodeJS stateful application

I am currently working on a web-based MMORPG game and would like to setup an auto-scaling strategy based on Docker and DigitalOcean droplets.
However, I am wondering how I could manage to do so:
My game server would have to be splittable across different Docker containers BUT every game server instance should act as if it was only one gigantic game server. That means that every modification happening in one (character moving) should also be mirrored in every other game server.
I am trying to get this to work (at least conceptually) but can't find a way to synchronize all my instances properly. Should I use a master only broadcasting events or is there an alternative?
I was wondering the same thing about my MySQL database: since every game server would have to read/write from/to the db, how would I make it scale properly as the game gets bigger and bigger? The best solution I could think of was to keep the database on a single server which would be very powerful.
I understand that this could be easy if all game servers didn't have to "share" their state but this is primarily thought so that I can scale quickly in case of a sudden spike of activity.
(There will be different "global" game servers like A, B, C... but each of those global game servers should be, behind the scenes, composed of 1-X docker containers running the "real" game server so that the "global" game server is only a concept)
The problem you state is too generic and it's difficult to give a concrete response. However let me be reckless and give you some general-purpose scaling advices:
Remove counters from databases. Instead primary keys that are auto-incremented IDs, try to assign random UUIDs.
Change data that must be validated against a central point by data that is self contained. For example, for authentication, instead of having the User Credentials in a DB, use JSON Web Tokens that can be verified by any host.
Use techniques such as Consistent Hashing to balance the load without need of load balancers. Of course use hashing functions that distribute well, to avoid/minimize collisions.
The above advices are basically about changing the design to migrate from stateful to stateless in as much as aspects as you can. If you anyway need to provide stateful parts, try to guess which entities will have more chance to share stateful data and allocate them in the same (or nearly server). For example, if there are cities in your game, try to allocate in the same server the users that are in the same city, since they are more willing to interact between them (and share stateful data) than users that are in different cities.
Of course if the city is too big and it's very crowded, you will probably need to partition the city in more servers to avoid overloading the server.
Your question is too broad and a general scaling problem as others have mentioned. It'd have been helpful if you'd stated more clearly what your system requirements are.
If it has to be real-time, then you can choose Redis as your main DB but then you'd need slaves (for replication) and you would not be able to scale automatically as you go*, since Redis doesn't support that. I assume that's not a good option when you're working with games (Sudden spikes are probable)
*there seems to be some managed solutions, you need to check them out
If it can be near real-time, using Apache Kafka can prove to be useful.
There's also a highly scalable DB which has everything you need called CockroachDB (I'm a contributor, yay!) but you need to run tests to see if it meets your latency requirements.
Overall, going with a very powerful server is a bad choice, since there's a ceiling and it'd cost you more to scale vertically.
There's a great benefit in scaling horizontally such an application. I'll try to write down some ideas.
Option 1 (stateful):
When planning stateful applications you need to take care about synchronisation of the state (via PubSub, Network Broadcasting or something else) and be aware that every synchronisation will take time to occur (when not blocking each operation). If this is ok for you, lets go ahead.
Let's say you have 80k operations per second on your whole cluster. That means that every process need to synchronise 80k state changes per second. This will be your bottleneck. Handling 80k changes per second is quiet a big challenge for a Node.js application (because it's single threaded and therefore blocking).
At the end you'll need to provision precisely the maximum amount of changes you want to be able to sync and perform some tests with different programming languages. The overhead of synchronising needs to be added to the general work load of the application. It could be beneficial to use some multithreaded language like C, Java/Scala or Go.
Option 2 (stateful with routing):*
In some cases it's feasible to implement a different kind of scaling.
When for example your application can be broken down into areas of a map, you could start with one app replication which holds the full map and when it scales up, it shares the map in a proportional way.
You'll need to implement some routing between the application servers, for example to change the state in city A of world B => call server xyz. This could be done automatically but downscaling will be a challenge.
This solution requires more care and knowledge about the application and is not as fault tolerant as option 1 but it could scale endlessly.
Option 3 (stateless):
Move the state to some other application and solve the problem elsewhere (like Redis, Etcd, ...)

synthetic multi-node crossbar system implementation

I am implementing a system composed of a collection of small systems, ie. Raspberry, Yun, Beaglebone, the occasional PC. Crossbar.io has real promise ... but, as I understand it, doesn't currently support multiple nodes. Am I correct? Does anyone know when that might happen?
In the meantime it occurred to me that each individual node can offer an http interface that I might be able to use for my purposes. My initial thought is to crate workers that wrap access to the web the interface on subsidiary nodes. This fits the overall architecture of the system I want to create - does it have any merit? Is it tractable? I'm new to websockets - and insight would be a great help.
Thanks for your time,
Al
In general that does sound like a fit for Crossbar.io.
There is no timeline on multi-node (i.e. multiple routers), but we hope to have at least hot-standby nodes for high availability ready in Q1. Other than for high availability, I think that a single instance should provide sufficient performance for most applications out there - on a single current (non-high-end) Xeon we're talking tens of thousands of events a second, and concurrent connections are mostly limited by RAM (and 100s of thousands on a single box are definitely not a problem). (If you need more than that then I'd be very interested in your specific use case - we want to learn more about our users.)
I don't completely understand the second part of your question: What precisely is the architecture you're planning here? If you're talking about the integrated Web server, then with recent optimizations (it can now use multiple cores) this should be enough for even moderately big sites, and with SPAs you're not likely to ever run into performance issues.
Hope this helps, and I'll be glad to answer in more detail once you've clarified the second part.

Should users be directed to specific data nodes when using an eventually consistent datastore?

When running a web application in a farm that uses a distributed datastore that's eventually consistent (CouchDB in my case), should I be ensuring that a given user is always directed to same the datastore instance?
It seems to me that the alternate approach, where any web request can use any data store, adds significant complexity to deal with consistency issues (retries, checks, etc). On the other hand, if a user in a given session is always directed to the same couch node, won't my consistency issues revolve mostly around "shared" user data and thus be greatly simplified?
I'm also curious about strategies for directing users but maybe I'll keep that for another question (comments welcome).
According to the CAP Theorem, distributed systems can either have complete consistency (all nodes see the same data at the same time) or availability (every request receives a response). You'll have to trade one for the other during a partition or datastore instance failure.
Should I be ensuring that a given user is always directed to same the datastore instance?
Ideally, you should not! What will you do when the given instance fails? A major feature of a distributed datastore is to be available in spite of network or instance failures.
If a user in a given session is always directed to the same couch node, won't my consistency issues revolve mostly around "shared" user data and thus be greatly simplified?
You're right, the architecture would be a lot more simpler that way, but again, what would you do if that instance fails? A lot of engineering effort has gone into distributed systems to allow multiple instances to reply to a query. I am not sure about CouchDB, but Cassandra allows you to choose your consistency model, you'll have to tradeoff availability for higher degree of consistency. The client is configured to request servers in a round-robin fashion by default, which distributes the load.
I would recommend you read the Dynamo paper. The authors describe a lot of engineering details behind a distributed database.

Is it distributed or parallel computing?

The differences between distributed and parallel computing are not that clear to me. I've a "distributed systems" course this semester, and of course there's a project which I should work on. I'm interested in security, so I picked a security-related project, I chose to work on a password-cracker system, please don't get me wrong, it's for educational purposes!
The system consists of several processors/computers in which that each computer will receive a request to crack a hashed password, then if the computer is busy (probably working on other password) it will hands over the request to one of its peers (some computer connected via the same network), and if the computer was free/idle it will handle the request.
I'm wondering if this mechanism is considered as a distributed or as a parallel computing,. And some might consider this as a collaborative computing. Please guide me to the right path.
Thanks in advance :)
Distributed computing -- spreading a computation across different network nodes.
Parallel computing -- allowing multiple parts of a computation to happen at the same time.
I don't think the architecture you describe is either distributed or parallel.
It sounds like you have one machine delegating work to others. In that case, no two machines are working on the same task at the same time. If so, you're not actually distributing the task across multiple nodes, so you shouldn't call it distributed computing.
If the machine that is working on a task has multiple threads or processes working at the same time, then you could consider it a parallel computation.

Resources