How does NodeJS handle multi-core concurrency? - node.js

Currently I am working on a database that is updated by another java application, but need a NodeJS application to provide Restful API for website use. To maximize the performance of NodeJS application, it is clustered and running in a multi-core processor.
However, from my understanding, a clustered NodeJS application has a their own event loop on each CPU core, if so, does that mean, with cluster architect, NodeJS will have to face traditional concurrency issues like in other multi-threading architect, for example, writing to same object which is not writing protected? Or even worse, since it is multi-process running at same time, not threads within a process blocked by another...
I have been searching Internet, but seems nobody cares that at all. Can anyone explain the cluster architect of NodeJS? Thanks very much
Add on:
Just to clarify, I am using express, it is not like running multiple instances on different ports, it is actually listening on the same port, but has one process on each CPUs competing to handle requests...
the typical problem I am wondering now is: a request to update Object A base on given Object B(not finish), another request to update Object A again with given Object C (finish before first request)...then the result would base on Object B rather than C, because first request actually finishes after the second one.
This will not be problem in real single-threaded application, because second one will always be executed after first request...

The core of your question is:
NodeJS will have to face traditional concurrency issues like in other multi-threading architect, for example, writing to same object which is not writing protected?
The answer is that that scenario is usually not possible because node.js processes don't share memory. ObjectA, ObjectB and ObjectC in process A are different from ObjectA, ObjectB and ObjectC in process B. And since each process are single-threaded contention cannot happen. This is the main reason you find that there are no semaphore or mutex modules shipped with node.js. Also, there are no threading modules shipped with node.js
This also explains why "nobody cares". Because they assume it can't happen.
The problem with node.js clusters is one of caching. Because ObjectA in process A and ObjectA in process B are completely different objects, they will have completely different data. The traditional solution to this is of course not to store dynamic state in your application but to store them in the database instead (or memcache). It's also possible to implement your own cache/data synchronization scheme in your code if you want. That's how database clusters work after all.
Of course node, being a program written in C, can be easily extended in C and there are modules on npm that implement threads, mutex and shared memory. If you deliberately choose to go against node.js/javascript design philosophy then it is your responsibility to ensure nothing goes wrong.
Additional answer:
a request to update Object A base on given Object B(not finish), another request to update Object A again with given Object C (finish before first request)...then the result would base on Object B rather than C, because first request actually finishes after the second one.
This will not be problem in real single-threaded application, because second one will always be executed after first request...
First of all, let me clear up a misconception you're having. That this is not a problem for a real single-threaded application. Here's a single-threaded application in pseudocode:
function main () {
timeout = FOREVER
readFd = []
writeFd = []
databaseSock1 = socket(DATABASE_IP,DATABASE_PORT)
send(databaseSock1,UPDATE_OBJECT_B)
databaseSock2 = socket(DATABASE_IP,DATABASE_PORT)
send(databaseSock2,UPDATE_OPJECT_C)
push(readFd,databaseSock1)
push(readFd,databaseSock2)
while(1) {
event = select(readFD,writeFD,timeout)
if (event) {
for (i=0; i<length(readFD); i++) {
if (readable(readFD[i]) {
data = read(readFD[i])
if (data == OBJECT_B_UPDATED) {
update(objectA,objectB)
}
if (data == OBJECT_C_UPDATED) {
update(objectA,objectC)
}
}
}
}
}
}
As you can see, there's no threads in the program above, just asynchronous I/O using the select system call. The program above can easily be translated directly into single-threaded C or Java etc. (indeed, something similar to it is at the core of the javascript event loop).
However, if the response to UPDATE_OBJECT_C arrives before the response to UPDATE_OBJECT_B the final state would be that objectA is updated based on the value of objectB instead of objectC.
No asynchronous single-threaded program is immune to this in any language and node.js is no exception.
Note however that you don't end up in a corrupted state (though you do end up in an unexpected state). Multithreaded programs are worse off because without locks/semaphores/mutexes the call to update(objectA,objectB) can be interrupted by the call to update(objectA,objectC) and objectA will be corrupted. This is what you don't have to worry about in single-threaded apps and you won't have to worry about it in node.js.
If you need strict temporally sequential updates you still need to either wait for the first update to finish, flag the first update as invalid or generate error for the second update. Typically for web apps (like stackoverflow) an error would be returned (for example if you try to submit a comment while someone else have already updated the comments).

Related

How async approach to rest api can reduce thread count?

Many people are saying that modern rest apis should be "async", and as a main argument they say that on some platforms, for example in Java, "blocking" way of doing things produce many threads and "async" way allows to limit thread count and overhead.
What I don't understand, is how it is achieved.
Consider I have an app in a framework like vert.x (but actually it doesn't matter, you can think of NodeJS as well), and say 1_000_000 concurrent connections for a service which makes some request to a database. The framework allows each request itself to be processed async on the long task i|o operations, so database data exchange looks syntactically asynchronous in the business logic code. BUT. As I understand, DB request is made not in the vacuum - it is processed in some other thread, and that thread actually blocks until db request is finished. So it means, that despite the fact, that request business logic looks async and non blocking, long time operations which are called from such logic are actually blocking somewhere under the hood of framework and the more such operations are done, the more threads should be consumed anyway (for NodeJS you can think of threads, created in C++ code of a framework itself)
So as I see the big picture - in async approach there is only one thread, which processes all the requests, it's ok, but there is a bunch of threads, which are doing the actual I/O work in the background anyway, and if one doesn't limit their count, then the number of threads will be the same as for a blocking approach + 1. On the other hand if you limit the number of background thread pool programmatically, then what will be the benefits compared to the blocking approach, which combines a queue for user requests and a limit for the number of request processing threads?
Since you're asking a fairly low level question I'll answer with a low level answer. Hope you're comfortable with C.
First, a disclaimer: I'll be talking mostly about networking code because the only widely used database I know of that use file I/O is sqlite. Since you're asking about postgres I can assume you're interested about how socket I/O (be it TCP socket or unix local sockets) can work with only one thread.
At the core of almost all async systems and libraries is a piece of code that looks like this:
while (1)
{
read_fd_set = active_fd_set;
// This blocks until we receive a packet or until timeout expires:
select(FD_SETSIZE, &read_fd_set, NULL, NULL, timeout);
// Process timed events:
timeout = process_timeout();
// Process I/O:
for (i = 0; i < FD_SETSIZE; ++i) {
if (FD_ISSET(i, &read_fd_set)) {
if (i == sock) {
/* Connection arriving on listening socket */
int new;
size = sizeof(clientname);
new = accept (sock,(struct sockaddr *) &clientname, &size);
FD_SET (new, &active_fd_set);
}
else {
/* Data arriving on an already-connected socket. */
if (read_from_client(i) < 0) {
close (i);
FD_CLR (i, &active_fd_set);
}
}
}
}
}
(code example paraphrased from a GNU socket programming example)
As you can see, the code above uses no threading whatsoever. Yet it can handle many connections simultaneously. If you take a look at the for loop it is also obvious that it is basically a simple state machine that processes sockets one at a time if they have any packets waiting to be read (if not it is skipped by the if (FD_ISSET...) statement).
Non-I/O events can logically only come from timed events. And that's where the timeout management (details not shown for clarity) comes in. All I/O related stuff (basically almost all your async code) gets called back from the read_from_client() function (again, details omitted for clarity).
There is zero code running in parallel.
Where does the parallelization come from?
Basically the server you're connecting to. Most databases support some form of parallelism. Some support mulththreading. Some even support node.js or vert.x style parallelism by supporting asynchronous disk I/O (like postgres). Some configurations of databases allow higher level of parallelism by storing data on more than one server via partitioning and/or sharding and/or master/slave servers.
That's where the big parallelism comes from -- parallel computing. Most databases have very strong support for read parallelism but weaker support for write parallelism (master/slave setups for example allow you to write only to the master database). But this is still a big win because most apps read more data than they write.
Where does disk parallelism come from?
The hardware. Mostly this has to do with DMA which can transfer data without the CPU. DMA is not one thing. It is more like a concept. Different systems like the PCI bus, SATA, USB even the CPU RAM bus itself has various kinds of DMA to transfer data directly to RAM (and in the case of RAM, to transfer data higher up to the various levels of CPU cache) or to a faster buffer.
While waiting for the DMA to complete. The CPU is not doing anything. And while it is doing nothing and there happens to be a network packet coming in or a setTimeout() expiring the code that handles them can be executed on the CPU. All while a file is being read into RAM.
But Node.js docs keep mentioning I/O threads
Only for disk I/O. It's not impossible to do async disk I/O with a single thread. Tcl has done that for years and many other programming languages and frameworks have too. It's just very-very messy since BSD does it differently form Linux which does it differently from Windows and even OSX may be subtly different form BSD even though it is derived from it etc. etc.
For the sake of simplicity and solid reliability node developers have opted to process disk I/O in separate threads.
Note that even for socket I/O it is not as simple as the code example I gave above. Since select() has some limitations (for example, you're forced to loop over ALL sockets to check for incoming data even though most won't have incoming data), people have come up with better APIs. And obviously different OSes do it differently. That is why there are a lot of libraries created to handle cross platform event processing like libevent and libuv (the one node.js uses).
OK. But postgres still runs on my PC
Asynchronous, event-oriented systems does not automagically give you performance superpowers. What they DO give you is choice: the app server is blazing fast so where you put your database servers and what database you use us up to you.
OK. But I can do this with threads. Why async?
Benchmarks.
Since 1999, many people have run many benchmarks and in the majority of cases single threaded (or low thread count), event-oriented systems have outperformed simple multithreaded systems. It was especially true in the old days of single CPU, single core servers. It is still partly true now (since cores are still limited).
That is why Apache was re-written into Apache2 to use a thread pool of async listeners and why Nginx was written from scratch to use a thread pool of async code.
Yes, on modern servers ideally you'd still want some threads in order to use all your CPUs. The alternative is a process pool like how the cluster module works in node.js. But you'd want the number of threads/processes to be constant or as constant as possible to avoid the overhead of context switching and thread creation.
This is true to some async frameworks where JDBC client is still synchronised.
When querying DB in Vert.x you reuse same application threads.
Please see the following example:
#Test
public void testMultipleThreads() throws InterruptedException {
Vertx vertx = Vertx.vertx();
System.out.println("Before starting server: " + Thread.activeCount());
// Start server
vertx.createHttpServer().
requestHandler(httpServerRequest -> {
// System.out.println("Request");
httpServerRequest.response().end();
}).
listen(8080, o -> {
System.out.println("Server ready");
});
// Start counting threads
vertx.setPeriodic(500, (o) -> {
System.out.println(Thread.activeCount());
});
// Create requests
HttpClient client = vertx.createHttpClient();
int loops = 1_000_000;
CountDownLatch latch = new CountDownLatch(loops);
for (int i = 0; i < loops; i++) {
client.getNow(8080, "localhost", "/", httpClientResponse -> {
// System.out.println("Response received");
latch.countDown();
});
}
latch.await();
}
You'll notice that the number of threads doesn't change, even though you serve as many connections as you would like. You can also add Vert.x JDBC client to test it.

How do I Yield() to another thread in a Win8 C++/Xaml app?

Note: I'm using C++, not C#.
I have a bit of code that does some computation, and several bits of code that use the result. The bits that use the result are already in tasks, but the original computation is not -- it's actually in the callstack of the main thread's App::App() initialization.
Back in the olden days, I'd use:
while (!computationIsFinished())
std::this_thread::yield(); // or the like, depending on API
Yet this doesn't seem to exist for Windows Store apps (aka WinRT, pka Metro-style). I can't use a continuation because the bits that use the results are unconnected to where the original computation takes place -- in addition to that computation not being a task anyway.
Searching found Concurrency::Context::Yield(), but Context appears not to exist for Windows Store apps.
So... say I'm in a task on the background thread. How do I yield? Especially, how do I yield in a while loop?
First of all, doing expensive computations in a constructor is not usually a good idea. Even less so when it's the "App" class. Also, doing heavy work in the main (ASTA) thread is pretty much forbidden in the WinRT model.
You can use concurrency::task_completion_event<T> to interface code that isn't task-oriented with other pieces of dependent work.
E.g. in the long serial piece of code:
...
task_completion_event<ComputationResult> tce;
task<ComputationResult> computationTask(tce);
// This task is now tied to the completion event.
// Pass it along to interested parties.
try
{
auto result = DoExpensiveComputations();
// Successfully complete the task.
tce.set(result);
}
catch(...)
{
// On failure, propagate the exception to continuations.
tce.set_exception(std::current_exception());
}
...
Should work well, but again, I recommend breaking out the computation into a task of its own, and would probably start by not doing it during construction... surely an anti-pattern for a responsive UI. :)
Qt simply uses Sleep(0) in their WinRT yield implementation.

Multithreaded Game Loop Rendering/Updating (boost-asio)

So I have a single-threaded game engine class, which has separate functions for input, update and rendering, and I've just started learning to use the wonderful boost library (asio and thread components). And I was thinking of separating my update and render functions into separate threads (and perhaps separate the input and update functions from each other as well). Of course these functions will sometimes access the same locations in memory, so I decided to use boost/thread's strand functionality to prevent them from executing at the same time.
Right now my main game loop looks like this:
void SDLEngine::Start()
{
int update_time=0;
quit=false;
while(!quit)
{
update_time=SDL_GetTicks();
DoInput();//get user input and alter data based on it
DoUpdate();//update game data once per loop
if(!minimized)
DoRender();//render graphics to screen
update_time=SDL_GetTicks()-update_time;
SDL_Delay(max(0,target_time-update_time));//insert delay to run at desired FPS
}
}
If I used separate threads it would look something like this:
void SDLEngine::Start()
{
boost::asio::io_service io;
boost::asio::strand strand_;
boost::asio::deadline_timer input(io,boost::posix_time::milliseconds(16));
boost::asio::deadline_timer update(io,boost::posix_time::milliseconds(16));
boost::asio::deadline_timer render(io,boost::posix_time::milliseconds(16));
//
input.async_wait(strand_.wrap(boost::bind(&SDLEngine::DoInput,this)));
update.async_wait(strand_.wrap(boost::bind(&SDLEngine::DoUpdate,this)));
render.async_wait(strand_.wrap(boost::bind(&SDLEngine::DoRender,this)));
//
io.run();
}
So as you can see, before the loop went: Input->Update->Render->Delay->Repeat
Each one was run one after the other. If I used multithreading I would have to use strands so that updates and rendering wouldn't be run at the same time. So, is it still worth it to use multithreading here? They would still basically be running one at a time in separate cores. I basically have no experience in multithreaded applications so any help is appreciated.
Oh, and another thing: I'm using OpenGL for rendering. Would multithreading like this affect the way OpenGL renders in any way?
You are using same strand for all handlers, so there is no multithreading at all. Also, your deadline_timer is in scope of Start() and you do not pass it anywhere. In this case you will not able to restart it from the handler (note its not "interval" timer, its just a "one-call timer").
I see no point in this "revamp" since you are not getting any benefit from asio and/or threads at all in this example.
These methods (input, update, render) are too big and they do many things, you cannot call them without blocking. Its hard to say precisely because i dont know whats the game and how it works, but I'd prefer to do following steps:
Try to revamp network i/o so its become fully async
Try to use all CPU cores
About what you have tried: i think its possible if you search your code for actions that really can run in parallel right now. For example: if you calculate for each NPC something that is not depending on other characters you can io_service.post() each to make use all threads that running io_service.run() at the moment. So your program stay singlethreaded, but you can use, say, 7 other threads on some "big" operations

How game servers with Boost:Asio work asynchronously?

I am trying to create a game server, and currently, I am making it with threads. Every object( a player , monster ), has its own thread with while(1) cycle , in witch particular functions are performed.
And the server basically works like this:
main(){
//some initialization
while(1)
{
//reads clients packet
//directs packet info to a particular object
//object performs some functions
//then server returns result packet back to client
Sleep(1);
}
I have heard that is not efficient to make the server using threads like that,
and I should consider to use Boost::Asio, and make the functions work asynchronously.
But I don't know how then the server would work. I would be grateful if someone would explain how basically such servers work.
Every object( a player , monster ), has its own thread.
I have heard that is not efficient to make the server using threads
like that
You are correct, this is not a scalable design. Consider a large game where you may have 10,000 objects or even a million. Such a design quickly falls apart when you require a thread per object. This is known as the C10K problem.
I should consider to use Boost::Asio, and make the functions work
asynchronously. But I don't know how then the server would work.
I would be grateful if someone would explain how basically such
servers work.
You should start by following the Boost::Asio tutorials, and pay specific attention to the Asynchronous TCP daytime server. The concept of asynchronous programming compared to synchronous programming is not difficult after you understand that the flow of your program is inverted. From a high level, your game server will have an event loop that is driven by a boost::asio::io_service. Overly simplified, it will look like this
int
main()
{
boost::asio::io_service io_service;
// add some work to the io_service
io_service.run(); // start event loop
// should never get here
}
The callback handlers that are invoked from the event loop will chain operations together. That is, once your callback for reading data from a client is invoked, the handler will initiate another asynchronous operation.
The beauty of this design is that it decouples threading from concurrency. Consider a long running operation in your game server, such as reading data from a client. Using asynchronous methods, your game server does not need to wait for the operation to complete. It will be notified when the operation has completed on behalf of the kernel.

What multithreading package for Lua "just works" as shipped?

Coding in Lua, I have a triply nested loop that goes through 6000 iterations. All 6000 iterations are independent and can easily be parallelized. What threads package for Lua compiles out of the box and gets decent parallel speedups on four or more cores?
Here's what I know so far:
luaproc comes from the core Lua team, but the software bundle on luaforge is old, and the mailing list has reports of it segfaulting. Also, it's not obvious to me how to use the scalar message-passing model to get results ultimately into a parent thread.
Lua Lanes makes interesting claims but seems to be a heavyweight, complex solution. Many messages on the mailing list report trouble getting Lua Lanes to build or work for them. I myself have had trouble getting the underlying "Lua rocks" distribution mechanism to work for me.
LuaThread requires explicit locking and requires that communication between threads be mediated by global variables that are protected by locks. I could imagine worse, but I'd be happier with a higher level of abstraction.
Concurrent Lua provides an attractive message-passing model similar to Erlang, but it says that processes do not share memory. It is not clear whether spawn actually works with any Lua function or whether there are restrictions.
Russ Cox proposed an occasional threading model that works only for C threads. Not useful for me.
I will upvote all answers that report on actual experience with these or any other multithreading package, or any answer that provides new information.
For reference, here is the loop I would like to parallelize:
for tid, tests in pairs(tests) do
local results = { }
matrix[tid] = results
for i, test in pairs(tests) do
if test.valid then
results[i] = { }
local results = results[i]
for sid, bin in pairs(binaries) do
local outcome, witness = run_test(test, bin)
results[sid] = { outcome = outcome, witness = witness }
end
end
end
end
The run_test function is passed in as an argument, so a package can be useful to me only if it can run arbitrary functions in parallel. My goal is enough parallelism to get 100% CPU utilization on 6 to 8 cores.
Norman wrote concerning luaproc:
"it's not obvious to me how to use the scalar message-passing model to get results ultimately into a parent thread"
I had the same problem with a use case I was dealing with. I liked lua proc due to its simple and light implementation, but my use case had C code that was calling lua, which was triggering a co-routine that needed to send/receive messages to interact with other luaproc threads.
To achieve my desired functionality I had to add features to luaproc to allow sending and receiving messages from the parent thread or any other thread not running from the luaproc scheduler. Additionally, my changes allow using luaproc send/receive from coroutines created from luaproc.newproc() created lua states.
I added an additional luaproc.addproc() function to the api which is to be called from any lua state running from a context not controlled by the luaproc scheduler in order to set itself up with luaproc for sending/receiving messages.
I am considering posting the source as a new github project or contacting the developers and seeing if they would like to pull my additions. Suggestions as to how I should make it available to others are welcome.
Check the threads library in torch family. It implements a thread pool model: a few true threads (pthread in linux and windows thread in win32) are created first. Each thread has a lua_State object and a blocking job queue that admits jobs added from the main thread.
Lua objects are copied over from main thread to the job thread. However C objects such as Torch tensors or tds data structures can be passed to job threads via pointers -- this is how limited shared memory is achieved.
This is a perfect example of MapReduce
You can use LuaRings to accomplish your parallelization needs.
Concurrent Lua might seem like the way to go, but as I note in my updates below, it doesn't run things in parallel. The approach I tried was to spawn several processes that execute pickled closures received through the message queue.
Update
Concurrent Lua seems to handle first-class functions and closures without a hitch. See the following example program.
require 'concurrent'
local NUM_WORKERS = 4 -- number of worker threads to use
local NUM_WORKITEMS = 100 -- number of work items for processing
-- calls the received function in the local thread context
function worker(pid)
while true do
-- request new work
concurrent.send(pid, { pid = concurrent.self() })
local msg = concurrent.receive()
-- exit when instructed
if msg.exit then return end
-- otherwise, run the provided function
msg.work()
end
end
-- creates workers, produces all the work and performs shutdown
function tasker()
local pid = concurrent.self()
-- create the worker threads
for i = 1, NUM_WORKERS do concurrent.spawn(worker, pid) end
-- provide work to threads as requests are received
for i = 1, NUM_WORKITEMS do
local msg = concurrent.receive()
-- send the work as a closure
concurrent.send(msg.pid, { work = function() print(i) end, pid = pid })
end
-- shutdown the threads as they complete
for i = 1, NUM_WORKERS do
local msg = concurrent.receive()
concurrent.send(msg.pid, { exit = true })
end
end
-- create the task process
local pid = concurrent.spawn(tasker)
-- run the event loop until all threads terminate
concurrent.loop()
Update 2
Scratch all of that stuff above. Something didn't look right when I was testing this. It turns out that Concurrent Lua isn't concurrent at all. The "processes" are implemented with coroutines and all run cooperatively in the same thread context. That's what we get for not reading carefully!
So, at least I eliminated one of the options I guess. :(
I realize that this is not a works-out-of-the-box solution, but, maybe go old-school and play with forks? (Assuming you're on a POSIX system.)
What I would have done:
Right before your loop, put all tests in a queue, accessible between processes. (A file, a Redis LIST or anything else you like most.)
Also before the loop, spawn several forks with lua-posix (same as the number of cores or even more depending on the nature of tests). In parent fork wait until all children will quit.
In each fork in a loop, get a test from the queue, execute it, put results somewhere. (To a file, to a Redis LIST, anywhere else you like.) If there are no more tests in queue, quit.
In the parent fetch and process all test results as you do now.
This assumes that test parameters and results are serializable. But even if they are not, I think that it should be rather easy to cheat around that.
I've now built a parallel application using luaproc. Here are some misconceptions that kept me from adopting it sooner, and how to work around them.
Once the parallel threads are launched, as far as I can tell there is no way for them to communicate back to the parent. This property was the big block for me. Eventually I realized the way forward: when it's done forking threads, the parent stops and waits. The job that would have been done by the parent should instead be done by a child thread, which should be dedicated to that job. Not a great model, but it works.
Communication between parent and children is very limited. The parent can communicate only scalar values: strings, Booleans, and numbers. If the parent wants to communicate more complex values, like tables and functions, it must code them as strings. Such coding can take place inline in the program, or (especially) functions can be parked into the filesystem and loaded into the child using require.
The children inherit nothing of the parent's environment. In particular, they don't inherit package.path or package.cpath. I had to work around this by the way I wrote the code for the children.
The most convenient way to communicate from parent to child is to define the child as a function, and to have the child capture parental information in its free variables, known in Lua parlances as "upvalues." These free variables may not be global variables, and they must be scalars. Still, it's a decent model. Here's an example:
local function spawner(N, workers)
return function()
local luaproc = require 'luaproc'
for i = 1, N do
luaproc.send('source', i)
end
for i = 1, workers do
luaproc.send('source', nil)
end
end
end
This code is used as, e.g.,
assert(luaproc.newproc(spawner(randoms, workers)))
This call is how values randoms and workers are communicated from parent to child.
The assertion is essential here, as if you forget the rules and accidentally capture a table or a local function, luaproc.newproc will fail.
Once I understood these properties, luaproc did indeed work "out of the box", when downloaded from askyrme on github.
ETA: There is an annoying limitation: in some circumstances, calling fread() in one thread can prevent other threads from being scheduled. In particular, if I run the sequence
local file = io.popen(command, 'r')
local result = file:read '*a'
file:close()
return result
the read operation blocks all other threads. I don't know why this is---I assume it is some nonsense going on within glibc. The workaround I used was to call directly to read(2), which required a little glue code, but this works properly with io.popen and file:close().
There's one other limitation worth noting:
Unlike Tony Hoare's original conception of communicating sequential processing, and unlike most mature, serious implementations of synchronous message passing, luaproc does not allow a receiver to block on multiple channels simultaneously. This limitation is serious, and it rules out many of the design patterns that synchronous message-passing is good at, but it's still find for many simple models of parallelism, especially the "parbegin" sort that I needed to solve for my original problem.

Resources