How does a project like folding#home communicate with your computer to solve protein folding? - multithreading

How does a program like folding#home work? Does my computer
individually perform a unit of "work" on it completely separate to other computers running folding#home? Then send the answer back when it's completed?
Or does Folding#home see all the computers connected to it as the project having let's say 1000 cores and then when work is done it's the equivalent of saying something like make -j <total number of cores>

Projects likes Folding#Home and BOINC are examples of loosely-coupled parallel computing where each task is fully self-contained and can be completed without communication with other computing entities. They are also examples of a pattern known as controller/worker (used to be known as master/worker), in which a central controller splits a large task into a pool of small(er) subtasks and distributes it to a bunch of worker processes on a first come first served basis, which corresponds to your first point.
In F#H (and BOINC), client computers connect to the server, request a task, work on it until it's complete, then connect to the server again to return the result and request a new task. The benefits of this are automatic load balancing, fault tolerance (via redundancy), and no need for scheduling.
When you run make -j #cores, make launches a number of parallel jobs but those jobs are usually interdependent, so make has to schedule them in an optimal way. The jobs are then run as processes on the same computer which affords make full process control. If a build step fails, the entire build job aborts immediately and the user can quickly look into the problem, fix it, and restart the build. This is not a viable model for when a client computer could have an arbitrary compute speed, could connect and disconnect at any time, and/or could decide to simply stop processing tasks. There are distributed versions of make like dmake that run different parts of the build process on different remote nodes, but that still happens in a tightly controlled environment, typically on a build cluster.
Note that on a very high level of abstraction the two are basically equivalent with the main difference being whether jobs are pushed or pulled. While job pulling works fine on all kinds of systems, job pushing usually requires (tightly-coupled) systems with predictable characteristics and good scheduling algorithms to be efficient.

Related

If multiple jobs exist in the event loop for one process. What happens to the remaining jobs if the current job crashes the process?

In Node.js cluster mode, if multiple jobs exist in the event loop for one process, should the current job crash the process, what happens to the remaining job?
I'm assuming the remaining jobs in the event loop would go unfulfilled or return a server error. My question is, why is this an acceptable risk? Why would someone opt to use Node.js cluster mode in production then, rather than use something like PHP in production, where there is no risk of this, because PHP handles each request in its own process.
Edit:
Obviously this doesn't just apply to Node.js cluster mode. It can happen on a single instance, in which case obviously the end user would just get a server error. Cluster mode just happens to be my personal use case.
I'm looking for a way to pick back up a job in the queue job should a previous job cause the process to exit, before the subsequent job gets a change to be fulfilled. I am currently reading about how you can use a tool like RabbitMQ to handle your job queue outside of the node.js cluster, and each cluster instance just pulls jobs from the RabbitMQ queue. If anyone has any input on that, that would also be greatly appreciated.
If multiple jobs exist in the event loop for one process. What happens to the remaining jobs if the current job crashes the process?
If a node.js process crashes, the same thing happens to it that happens to any other process. All open sockets get automatically disconnected and the client will receive an immediate close on their socket (socket connection dropped essentially).
If you were using a Java server that was in the middle of handling 10 requests (perhaps in threads) and it crashed, the consequences would be the same. All 10 socket connections would get dropped.
If process isolation from one request to another is your #1 criteria for selecting a server environment, then I guess you wouldn't pick any environment that ever serves multiple requests from the same process. But, you would give up a lot of get that. One of the reasons for the node.js design is that is scales really, really well for a high number of concurrent connections that are all doing mostly I/O things (disk, networking, database stuff, etc...) which happens to be most web servers. Whereas a design that fires up a new process for every incoming connection does not scale as well for a large number of concurrent connections because a process is a much more heavy-weight thing in the eyes of the operating system (memory usage, other system resource usage, task switching overhead, etc...) than the way node.js does things.
And, there are obviously hundreds of other considerations too when choosing a server environment. So, you kind of have to look at the whole picture of what you're designing for and make the best set of tradeoffs.
In general, I wouldn't put this issue anywhere on the radar for why you should choose one over the other unless you expect to be running risky code (perhaps out of your control) that crashes a lot and this issue is therefore more important in your deployment than all the other differences. And, if that was the case, I'd probably isolate the risky code to its own process (even when using nodejs) to alleviate any pain from that crash. You could have a process pool waiting to process risky things. For example, if you were running code submitted by a user, I might run that code in its own isolated VM.
If you're just worried about your own code crashing a lot, then you probably have bigger problems and need more extensive unit testing, more robust error handling and need to take advantage of other tools just as a linter and other code analysis tools to find potential problem areas. With proper design, implementation and error handling, you should be able to keep a single incoming request from harming anything other than itself. That's certainly the philosophy that every server environment that serves multiple requests from the same process advises and the people/companies deploying those servers use.

Nodejs scaling and prioritising functions

We have a node application running on the server that gets hit a lot and has to compile a zip file for download. That works well so far but I am nervous we will hit a point where performance becomes an issue.
(The application is currently running with forever on a ubuntu 14.04 machine.)
I am now asked to add all kinds of new features to the app which are more secondary and should not decrease the performance of the main function (the zip download). It would be OK to have those additional features fail in case the app is hit too many times in favour of the main zipping process.
What is the best practise here. Creating a REST API for the secondary features and put everything into a waiting list? It surely isn't enough to just create a second app and spawn a new process each time the main zip process finishes? How Can I ensure the most redundancy? I'm not talking about a multi-core cluster setup or load-balancing on NGINX, but a smart way of prioritising application functions on application level.
I hope this is not too broad. Cheers
First off, everything should be using async I/O, no synchronous I/O anywhere in your server. That's the #1 rule for building a scalable node.js server.
Second off, the highest priority tasks that have any significant CPU usage should be allowed to use multiple cores. If, as you say, the highest priority tasks is creating the zip download, then you should makes sure that that operation can take advantage of multiple cores.
You can accomplish that either with clustering (your whole server runs multiple instances that can each be on a separate core) or by creating a set of processes specifically for creating the zip files and then create a work queue in the main process that feeds these other processes work and gets the result back from them. This second option is likely a bit more complex to code than clustering, but it does prioritize the zip file creation so only one core is serving other server needs and all other cores of working on zip file creation. Clustering shares all cores with all server responsibilities.
At the pure server application level, your server can maintain a work queue of all incoming work to be done no matter what kind and it can prioritize that work. For example, if an API call comes in and there are already N zip file requests in the queue, you could immediately fail the API call to keep it from building up on the server. I don't think I'd personally recommend that solution unless your API calls are really heavy operations because it's very hard for a developer to reliably use your API if it regularly just fails on them. They would generally find it better for the API to just be slow sometimes than to regularly fail.
You might not even have to use a queue, you could just use a counter to keep track of how many ZIP file requests were "in process", but you'd have to make absolutely sure the counter was accurate in all cases. If there was ever an accumulating error in the counter, then you might just end up failing all API requests until your server was restarted.

Controlling the flow of requests without dropping them - NodeJS

I have a simple nodejs webserver running, it:
Accepts requests
Spawns separate thread to perform background processing
Background thread returns results
App responds to client
Using Apache benchmark "ab -r -n 100 -c 10", performing 100 requests with 10 at a time.
Average response time of 5.6 seconds.
My logic for using nodejs is that is typically quite resource efficient, especially when the bulk of the work is being done by another process. Seems like the most lightweight webserver option for this scenario.
The Problem
With 10 concurrent requests my CPU was maxed out, which is no surprise since there is CPU intensive work going on the background.
Scaling horizontally is an easy thing to, although I want to make the most out of each server for obvious reasons.
So how with nodejs, either raw or some framework, how can one keep that under control as to not go overkill on the CPU.
Potential Approach?
Could accepting the request storing it in a db or some persistent storage and having a separate process that uses an async library to process x at a time?
In your potential approach, you're basically describing a queue. You can store incoming messages (jobs) there and have each process get one job at the time, only getting the next one when processing the previous job has finished. You could spawn a number of processes working in parallel, like an amount equal to the number of cores in your system. Spawning more won't help performance, because multiple processes sharing a core will just run slower. Keeping one core free might be preferred to keep the system responsive for administrative tasks.
Many different queues exist. A node-based one using redis for persistence that seems to be well supported is Kue (I have no personal experience using it). I found a tutorial for building an implementation with Kue here. Depending on the software your environment is running in though, another choice might make more sense.
Good luck and have fun!

Resource mange external nodes in Jenkins for tests

My problem is that I have code that need a rebooted node. I have many long running Jenkins test jobs that needs to be executed on rebooted nodes.
My existing solution is to define multiple "proxy" machines in Jenkins with the same label (TestLable) and 1 executor per machine. I bind all the test jobs to the label (TestLable). In the test execution script I detect the Jenkins machine (Jenkins env. NODE_NAME) and use that to know what physical physical machine the tests should use.
Do anybody know of a better solution?
The above works but I need to define a high number of “nodes/machines” that may not be needed. What I would like was a plugin that would be able to grant a token to a Jenkins job. This way a job would not be executed before a Jenkins executor and a token was free. The token should be a string so that my test jobs could use it to know what external node it could use.
We have written our own scheduler that allocates stuff before starting Jenkins nodes. There may be a better solution - but this works for us mostly. I've yet to come across an off-the-shelf scheduler that can deal with complicated allocation of different hardware resources. We have n box types, allocated to n build types.
Some build types we have are not compatible together without destroying all persistent data - which may be required as it takes a long time to gather. Some jobs require combinations of these hardware types. We store the details in a DB, and then use business logic to determine how it is allocated. We've often found that particular job types need additional business logic or extra data fields to account for their specific requirements.
So it may be the best way is to write your own scheduler, in your own language of choice, which takes into account your particular needs.

Architecture to Sandbox the Compilation and Execution Of Untrusted Source Code

The SPOJ is a website that lists programming puzzles, then allows users to write code to solve those puzzles and upload their source code to the server. The server then compiles that source code (or interprets it if it's an interpreted language), runs a battery of unit tests against the code, and verifies that it correctly solves the problem.
What's the best way to implement something like this - how do you sandbox the user input so that it can not compromise the server? Should you use SELinux, chroot, or virtualization? All three plus something else I haven't thought of?
How does the application reliably communicate results outside of the jail while also insuring that the results are not compromised? How would you prevent, for instance, an application from writing huge chunks of nonsense data to disk, or other malicious activities?
I'm genuinely curious, as this just seems like a very risky sort of application to run.
A chroot jail executed from a limited user account sounds like the best starting point (i.e. NOT root or the same user that runs your webserver)
To prevent huge chunks of nonsense data being written to disk, you could use disk quotas or a separate volume that you don't mind filling up (assuming you're not testing in parallel under the same user - or you'll end up dealing with annoying race conditions)
If you wanted to do something more scalable and secure, you could use dynamic virtualized hosts with your own server/client solution for communication - you have a pool of 'agents' that receive instructions to copy and compile from X repository or share, then execute a battery of tests, and log the output back via the same server/client protocol. Your host process can watch for excessive disk usage and report warnings if required, the agents may or may not execute the code under a chroot jail, and if you're super paranoid you would destroy the agent after each run and spin up a new VM when the next sample is ready for testing. If you're doing this large scale in the cloud (e.g. 100+ agents running on EC2) you only ever have enough spun up to accommodate demand and therefore reduce your costs. Again, if you're going for scale you can use something like Amazon SQS to buffer requests, or if you're doing a experimental sample project then you could do something much simpler (just think distributed parallel processing systems, e.g. seti#home)

Resources