I'm looking to parallelize some code across a Beowulf cluster, such that the CPUs involved don't share address space. I want to parallelize a function call in the outer loop. The function calls do not have any "important" side effects (though they do use a random number generator, allocate memory, etc.).
I've looked at libs like MPI and the problem I see is that they seem to make it very non-trivial to pass complex object graphs between nodes. The input to my function is a this pointer that points to a very complex object graph. The return type of my function is another complex object graph.
At a language-agnostic level (I'm working in the D programming language, and I'm almost sure no canned solution is available here, but I'm willing to create one), is there a "typical" way that passing complex state across nodes is dealt with? Ideally, I want the details of how the state is copied to be completely abstracted away and for the calls to look almost like normal function calls. I don't care that copying this much state over a network isn't particularly efficient, as the level of parallelism in question is so coarse-grained that it probably won't matter.
Edit: If there is no easy way to pass complex state, then how is message passing typically used? It seems to me like anything involving copying data over a network requires coarse grained parallelism, yet coarse grained parallelism usually requires passing complex state so that a lot of work can be done in one work unit.
I do a fair bit of MPI programming but I don't know of any typical way of passing complex state (as you describe it) between processes. Here's how I've been thinking about your problem, it probably matches your own thinking ...
I surmise that your complex object graphs are represented, in memory, by blocks of data and pointers to other blocks of data -- a usual sort of implementation of a graph. How best can you move one of these COGs (to coin an abbreviation) from the address space of one process to the address space of another ? To the extent that a pointer is a memory address, a pointer in one address space is no use in another address space, so you will have to translate it into some neutral form for transport (I think ?).
To send a COG, therefore, it has to be put into some form from which the receiving process can build, in its own address space, a local version of the graph with the pointers pointing to local memory addresses. Do you ever write these COGs to file ? If so, you already have a form in which one could be transported. I hate to suggest it, but you could even use files to communicate between processes -- and that might be easier to handle than the combination of D and MPI. Your choice !
If you don't have a file form for the COGs can you easily represent them as adjacency matrices or lists ? In other words, work out your own representation for transport ?
I'll be very surprised (but pleased to learn) if you can pass a COG between processes without transforming it from pointer-based to some more static structure such as arrays or records.
Edit, in response to OP's edit. MPI does provide easy ways to pass complex state around, provided that the complex state is represented as values not pointers. You can pass complex state around in either the intrinsic or customised MPI datatypes; as one of the other answers shows you these are flexible and capable. If our program does not keep the complex state in a form that MPI custom datatypes can handle, you'll have to write functions to pack/unpack to a message-friendly representation. If you can do that, then your message calls will look (for most purposes) like function calls.
As to the issues surrounding complex state and the graininess of parallelism, I'm not sure I quite follow you. We (include yourself in this sweeping generalisation if you want, or not) typically resort to MPI programming because we can't get enough performance out of a single processor, we know that we'll pay a penalty in terms of computation delayed by waiting for communication, we work hard to minimise that penalty, but in the end we accept the penalty as the cost of parallelisation. Certainly some jobs are too small or too short to benefit from parallelisation, but a lot of what we (parallel computationalists that is) do is just too big and too long-running to avoid parallelisation
You can do marvelous things with custom MPI datatypes. I'm currently working on a project where several MPI processes are tracking particles in a piece of virtual space, and when particles cross over from one process' territory into another one's, their data (position/speed/size/etc) has to be sent over the network.
The way I achieved this is the following:
1) All processes share an MPI Struct datatype for a single particle that contains all its relevant attributes, and their displacement in memory compared to the base address of the particle object.
2) On sending, the process iterates over whatever data structure it stores the particles in, notes down the memory address of each one that needs to be sent, and then builds a Hindexed datatype where each block is 1 long (of the above mentioned particle datatype) and starts at the memory addresses previously noted down. Sending 1 object of the resulting type will send all the necessary data over the network, in a type safe manner.
3) On the receiving end, things are slightly trickier. The receiving process first inserts "blank" particles into its own data structure: "blank" means that all the attributes that will be received from the other process are initialized to some default value. The memory addresses of the freshly inserted particles are noted down, and a datatype similar to that of the sender is created from these addresses. Receiving the sender's message as a single object of this type will automatically unpack all the data into all the right places, again, in a type safe manner.
This example is simpler in the sense that there are no relationships between particles (as there would be between nodes of a graph), but you could transmit that data in a similar way.
If the above description is not clear, I can post the C++ code that implements it.
I'm not sure I understand the question correctly so forgive me if my answer is off. From what I understand you want to send non-POD datatypes using MPI.
A library that can do this is Boost.MPI. It uses a serialization library to send even very complex data structures. There is a catch though: you will have to provide code to serialize the data yourself if you use complicated structures that Boost.Serialize does not already know about.
I believe message passing is typically used to transmit POD datatypes.
I'm not allowed to post more links so here is what I wanted to include:
Explanation of POD: www.fnal.gov/docs/working-groups/fpcltf/Pkg/ISOcxx/doc/POD.html
Serialization Library: www.boost.org/libs/serialization/doc
it depends on organization of your data. If you use pointers or automatic memory inside your objects, it will be difficult. If you can organize your objects to be contiguous in memory, you have two choices: send memory as bytes,cast it back to object type on the receiver or define mpi derived type for your object. If however you use inheritance, things will become complicated due to how objects are laid out in memory.
I do not know your problem, but maybe can take a look at ARMCI if you manage memory manually.
Related
I have a task which consists of 3 concurrent self-defined (recursive to each other) processes. I need somehow to make it execute on computer, but any attempt to convert a requirement to program code with just my brain fails since first iteration produces 3^3 entities with 27^2 cross-relations, but it needs to implement at least several iterations to try if program even works at all.
So I decided to give up on trying to understand the whole system and formalized the problem and now want to map it to hardware to generate an algorithm and run. Language doesn't matter (maybe even directly to machine/assembly one?).
I never did anything like that before, so all topics I searched through like algorithm synthesis, software and hardware co-design, etc. mention hardware model as the second half (in addition to problem model) of solution generation, but I never seen one. The whole work supposed to look like this:
I don't know yet what level hardware model described at, so can't decide how problem model must be formalized to fit hardware model layer.
For example, target system may contain CPU and GPGPU, let's say target solution having 2 concurrent processes. System must decide which process to run on CPU and which on GPGPU. The highest level solution may come from comparing computational intensity of processes with target hardware, which is ~300 for CPUs and ~50 for GPGPUs.
But a normal model gotta be much more complete with at least cache hierarchy, memory access batch size, etc.
Another example is implementing k-ary trees. A synthesized algorithm could address parents and children with computing k * i + c / ( i - 1 ) / k or store direct pointers - depending on computations per memory latency ratio.
Where can I get a hardware model or data to use? Any hardware would suffice for now - to just see how it can look like - later would be awesome to get models of modern processors, GPGPUs and common heterogeneous clusters.
Do manufacturers supply such kinds of models? Description of how their systems work in any formal language.
I'm not pretty sure if it might be the case for you, but as you're mentioning modeling, I just thought about Modelica. It's used to model physical systems and combined with a simulation environment, you can run some simulations on it.
I am new to F# so maybe the solution can be clear to someone, but I can not find it.
Imagine a game world of world chunks (similar to Minecraft), but for more players.
In theory language like C++, java, or C# can modify multiple chunks of world at same time. Two or more players try to place or remove block in different chunks and all these actions can change the state of the world without affecting each other as long as no more than one action in each chunk is happening. Serializing will only happen when multiple players in one chunk perform the modification.
My understanding of F# is I need to serialize these actions on global level and no two actions can happen in same time in entire world, because the update function need actual world state update params(like add/remove blok) and return new world state.
For that example the world state contains chunk list.
Is there a way to do world update in parallel?
Can the world state be stored differently to allow update to multiple chunks at same time?
It sounds like you need to ensure that each chunk has one action run at a time. You can protect pieces of state by storing them inside mailbox processors (often referred to as just "agents"). You can send agents several messages from several threads. They will be queued and processed one at a time.
There is a detailed discussion of this here: https://fsharpforfunandprofit.com/posts/concurrency-actor-model/
Firstly, I don't this really adds any technical detail to a previous answer, so you if you like their solution you should go ahead and mark that as the answer. However, I hope this gives some extra context...
Underlying your problem is the question of how consistent do you require the state of your world to be in order to make decisions about modifying chunks.
Consider a world where I have two chunks, let's call them A and B. Consider the use case where I want to add or remove a block from chunk A. The all important question is:
Do I need to know about the blocks in chunk B in order to validate, and then perform the addition / removal of a block from chunk A.
For example, if I only have finite number of blocks in my world, I may well need this information to validate that I can actually add a block without going over my limit. The key here is that my "consistency boundary" is my entire world - in order to perform the addition of a new block to chunk A I need consistent information about everythign in my world. It is no good if halfway through my decision making another thread jumps in and adds a block to chunk B. If this is a requirement then you have no option - even in the C#/C++ case - you need to lock down access to your world so only one such action can be performed at anyone time.
From the way you phrase the question, I suspect this is not the case. In which case we need to examine exactly what your consistency requirements are. A weaker requirement is that if I'm adding blocks to chunk A, I at least have to have consistent information about the number (and position) of blocks in chunk A. In the C#/C++ case this would mean having to put locks around accessing to individual "chunk data", but not the whole world.
A simple way of modelling this in F# would be (using the suggestion in this answer):
open FSharp.Core
type ChunkMessage =
AddBlock
| RemoveBlock
type MyWorld =
{
Blocks : List<MailboxProcessor<ChunkMessage>>
}
Note that MyWorld is a mutable, but each MailboxProcessor encapsulates state which can only change through processing one message at a time.
The implementation of Blocks doesn't have to be a list of MailboxProcessor's, you could use a thread-safe collection of objects for which you had thread-safe methods on, but the use of them here as suggested by The Quick Brown Fox leads to a particularly nice programming model.
Consider this sequential procedure on a data structure containing collections (for simplicity, call them lists) of Doubles. For as long as I feel like, do:
Select two different lists from the structure at random
Calculate a statistic based on those lists
Flip a coin based on that statistic
Possibly modify one of the lists, based on the outcome of the coin toss
The goal is to eventually achieve convergence to something, so the 'solution' is linear in the number of iterations. An implementation of this procedure can be seen in the SO question here, and here is an intuitive visualization:
It seems that this procedure could be better performed - that is, convergence could be achieved faster - by using several workers executing concurrently on separate OS threads, ex:
I guess a perfectly-realized implementation of this should be able to achieve a solution in O(n/P) time, for P the number of available compute resources.
Reading up on Haskell concurrency has left my head spinning with terms like MVar, TVar, TChan, acid-state, etc. What seems clear is that a concurrent implementation of this procedure would look very different from the one I linked above. But, the procedure itself seems to essentially be a pretty tame algorithm on what is essentially an in-memory database, which is a problem that I'm sure somebody has come across before.
I'm guessing I will have to use some kind of mutable, concurrent data structure that supports decent random access (that is, to random idle elements) & modification. I am getting a bit lost when I try to piece together all the things that this might require with a view towards improving performance (STM seems dubious, for example).
What data structures, concurrency concepts, etc. are suitable for this kind of task, if the goal is a performance boost over a sequential implementation?
Keep it simple:
forkIO for lightweight, super-cheap threads.
MVar, for fast, thread safe shared memory.
and the appropriate sequence type (probably vector, maybe lists if you only prepend)
a good stats package
and a fast random number source (e.g. mersenne-random-pure64)
You can try the fancier stuff later. For raw performance, keep things simple first: keep the number of locks down (e.g. one per buffer); make sure to compile your code and use the threaded runtime (ghc -O2) and you should be off to a great start.
RWH has a intro chapter to cover the basics of concurrent Haskell.
What does the term "Leaky Abstraction" mean? (Please explain with examples. I often have a hard time grokking a mere theory.)
Here's a meatspace example:
Automobiles have abstractions for drivers. In its purest form, there's a steering wheel, accelerator and brake. This abstraction hides a lot of detail about what's under the hood: engine, cams, timing belt, spark plugs, radiator, etc.
The neat thing about this abstraction is that we can replace parts of the implementation with improved parts without retraining the user. Let's say we replace the distributor cap with electronic ignition, and we replace the fixed cam with a variable cam. These changes improve performance but the user still steers with the wheel and uses the pedals to start and stop.
It's actually quite remarkable... a 16 year old or an 80 year old can operate this complicated piece of machinery without really knowing much about how it works inside!
But there are leaks. The transmission is a small leak. In an automatic transmission you can feel the car lose power for a moment as it switches gears, whereas in CVT you feel smooth torque all the way up.
There are bigger leaks, too. If you rev the engine too fast, you may do damage to it. If the engine block is too cold, the car may not start or it may have poor performance. And if you crank the radio, headlights, and AC all at the same time, you'll see your gas mileage go down.
It simply means that your abstraction exposes some of the implementation details, or that you need to be aware of the implementation details when using the abstraction. The term is attributed to Joel Spolsky, circa 2002. See the wikipedia article for more information.
A classic example are network libraries that allow you to treat remote files as local. The developer using this abstraction must be aware that network problems may cause this to fail in ways that local files do not. You then need to develop code to handle specifically errors outside the abstraction that the network library provides.
Wikipedia has a pretty good definition for this
A leaky abstraction refers to any implemented abstraction, intended to reduce (or hide) complexity, where the underlying details are not completely hidden
Or in other words for software it's when you can observe implementation details of a feature via limitations or side effects in the program.
A quick example would be C# / VB.Net closures and their inability to capture ref / out parameters. The reason they cannot be captured is due to an implementation detail of how the lifting process occurs. This is not to say though that there is a better way of doing this.
Here's an example familiar to .NET developers: ASP.NET's Page class attempts to hide the details of HTTP operations, particularly the management of form data, so that developers don't have to deal with posted values (because it automatically maps form values to server controls).
But if you wander beyond the most basic usage scenarios the Page abstraction begins to leak and it becomes hard to work with pages unless you understand the class' implementation details.
One common example is dynamically adding controls to a page - the value of dynamically-added controls won't be mapped for you unless you add them at just the right time: before the underlying engine maps the incoming form values to the appropriate controls. When you have to learn that, the abstraction has leaked.
Well, in a way it is a purely theoretical thing, though not unimportant.
We use abstractions to make things easier to comprehend. I may operate on a string class in some language to hide the fact that I'm dealing with an ordered set of characters that are individual items. I deal with an ordered set of characters to hide the fact that I'm dealing with numbers. I deal with numbers to hide the fact that I'm dealing with 1s and 0s.
A leaky abstraction is one that doesn't hide the details its meant to hide. If call string.Length on a 5-character string in Java or .NET I could get any answer from 5 to 10, because of implementation details where what those languages call characters are really UTF-16 data-points which can represent either 1 or .5 of a character. The abstraction has leaked. Not leaking it though means that finding the length would either require more storage space (to store the real length) or change from being O(1) to O(n) (to work out what the real length is). If I care about the real answer (often you don't really) you need to work on the knowledge of what is really going on.
More debatable cases happen with cases like where a method or property lets you get in at the inner workings, whether they are abstraction leaks, or well-defined ways to move to a lower level of abstraction, can sometimes be a matter people disagree on.
I'll continue in the vein of giving examples by using RPC.
In the ideal world of RPC, a remote procedure call should look like a local procedure call (or so the story goes). It should be completely transparent to the programmer such that when they call SomeObject.someFunction() they have no idea if SomeObject (or just someFunction for that matter) are locally stored and executed or remotely stored and executed. The theory goes that this makes programming simpler.
The reality is different because there's a HUGE difference between making a local function call (even if you're using the world's slowest interpreted language) and:
calling through a proxy object
serializing your parameters
making a network connection (if not already established)
transmitting the data to the remote proxy
having the remote proxy restore the data and call the remote function on your behalf
serializing the return value(s)
transmitting the return values to the local proxy
reassembling the serialized data
returning the response from the remote function
In time alone that's about three orders (or more!) of magnitude difference. Those three+ orders of magnitude are going to make a huge difference in performance that will make your abstraction of a procedure call leak rather obviously the first time you mistakenly treat an RPC as a real function call. Further a real function call, barring serious problems in your code, will have very few failure points outside of implementation bugs. An RPC call has all of the following possible problems that will get slathered on as failure cases over and above what you'd expect from a regular local call:
you might not be able to instantiate your local proxy
you might not be able to instantiate your remote proxy
the proxies may not be able to connect
the parameters you send may not make it intact or at all
the return value the remote sends may not make it intact or at all
So now your RPC call which is "just like a local function call" has a whole buttload of extra failure conditions you don't have to contend with when doing local function calls. The abstraction has leaked again, even harder.
In the end RPC is a bad abstraction because it leaks like a sieve at every level -- when successful and when failing both.
What is abstraction?
Abstraction is a way of simplifying the world.
It means you don't have to worry about what is actually happening under the hood.
Example: Flying a 737/747 is "abstracted" away
Planes are complicated systems: it involves: jet engines, oxygen systems, electrical systems, landing gear systems etc.
...but the pilot doesn't have to worry about it... all of that is "abstracted away". The only thing a pilot needs to focus on is yoke (i.e. steering wheel of the plane).
He pushes the yoke left to go left, and right to go right etc.
....that is in an ideal world. In reality, flying a plane is much more complicated. Because many details ARE NOT "abstracted away".
Leaky Abstractions in 737 Example
Pilots in reality have to worry about a LOT of things: wind speed, thrust, angles of attack, fuel, altitude, weather problems, angles of descent. Computers can help the pilot in these tasks, but not everything is automated / simplified......not everything is "abstracted away".
e.g. If the pilot pulls up too hard on the column - the plane will obey, but then the plane might stall, and that's really bad.
In other words, it is not enough for the pilot to simply control the steering wheel without knowing anything else.........nooooo.......the pilot must know about the underlying risks and limitations of the plane before the pilot flies one.......the pilot must know how the plane works, and how the plane flies; the pilot must know implementation details..... that pulling up too hard will lead to a stall, or that landing too steeply will destroy the plane etc.
Those things are not abstracted away. A lot of things are abstracted, but not everything. The abstraction is "leaky".
Leaky Abstractions in Code
......it's the same thing in your code. If you don't know the underlying implementation details, then you're gonna have problems.
ORMs abstract a lot of the hassle in dealing with database queries, but if you've ever done something like:
User.all.each do |user|
puts user.name # let's print each user's name
end
Then you will realise that's a nice way to kill your app. You need to know that calling User.allwith 25 million users is going to spike your memory usage, and is going to cause problems. You need to know some underlying details. The abstraction is leaky.
An example in the django ORM many-to-many example:
Notice in the Sample API Usage that you need to .save() the base Article object a1 before you can add Publication objects to the many-to-many attribute. And notice that updating the many-to-many attribute saves to the underlying database immediately, whereas updating a singular attribute is not reflected in the db until the .save() is called.
The abstraction is that we are working with an object graph, where single-value attributes and mult-value attributes are just attributes. But the implementation as a relational database backed data store leaks... as the integrity system of the RDBS appears through the thin veneer of an object interface.
The fact that at some point, which will guided by your scale and execution, you will be needed to get familiar with the implementation details of your abstraction framework in order to understand why it behave that way it behave.
For example, consider this SQL query:
SELECT id, first_name, last_name, age, subject FROM student_details;
And its alternative:
SELECT * FROM student_details;
Now, they do look like a logically equivalent solutions, but the performance of the first one is better due the individual column names specification.
It's a trivial example but eventually it comes back to Joel Spolsky quote:
All non-trivial abstractions, to some degree, are leaky.
At some point, when you will reach a certain scale in your operation, you will want to optimize the way your DB (SQL) works. To do it, you will need to know the way relational databases works. It was abstracted to you in the beginning, but it's leaky. You need to learn it at some point.
Assume, we have the following code in a library:
Object[] fetchDeviceColorAndModel(String serialNumberOfDevice)
{
//fetch Device Color and Device Model from DB.
//create new Object[] and set 0th field with color and 1st field with model value.
}
When the consumer calls the API, they get an Object[]. The consumer has to understand that the first field of the object array has color value and second field is the model value. Here the abstraction has leaked from library to the consumer code.
One of the solutions is to return an object which encapsulates Model and Color of the Device. The consumer can call that object to get the model and color value.
DeviceColorAndModel fetchDeviceColorAndModel(String serialNumberOfTheDevice)
{
//fetch Device Color and Device Model from DB.
return new DeviceColorAndModel(color, model);
}
Leaky abstraction is all about encapsulating state. very simple example of leaky abstraction:
$currentTime = new DateTime();
$bankAccount1->setLastRefresh($currentTime);
$bankAccount2->setLastRefresh($currentTime);
$currentTime->setTimestamp($aTimestamp);
class BankAccount {
// ...
public function setLastRefresh(DateTimeImmutable $lastRefresh)
{
$this->lastRefresh = $lastRefresh;
} }
and the right way(not leaky abstraction):
class BankAccount
{
// ...
public function setLastRefresh(DateTime $lastRefresh)
{
$this->lastRefresh = clone $lastRefresh;
}
}
more description here.
It's a well known truth, that you don't can trust user inputs. These inputs can be even an security-problem, if they are used unfiltered. XSS and SQL-injections are possible problems coming from using unfiltered user-input (or input, that can be changed by the user).
To avoid such problems, you have to control all strings, that can be influenced by external resources. Perl supports this with it's 'Taint-mode'.
The problems I know about, all are arising from manipulated Strings. Are you know examples of security-problems coming from ints, floats etc. manipulated by external influences? Or can this datatypes assumed to be safe?
Ultimately, all values are passed as strings to your program, whether you eventually convert them or not. All should be seen as potentially harmful, and not just for this reason.
For example, if you put non-numeric characters in a number field, you can get a parsing error. If you put a zero in where it isn't anticipated, you can get a divide by zero error. If you put a much larger value in than is expected, or a negative when not expected, or any number of other things, you can get an error of some sort. And it is quite possible for system errors to leak more information than you want. For instance, in an ASP.NET application, if custom errors aren't turned on, then database connection information, physical path information or information on third part libraries that your site uses can be leaked in the default error messages.
Strings are probably more likely to be problems than other values, but all user input should be treated as potentially harmful.
No, you can have security problems arising from accepting numbers from an external source. If an external source gives you a number of elements in (say) an array that you will need to process, a large number, trusted blindly, can cause denial of service by allocating enough memory to cause slow-down or memory exhaustion. Conversely, if you accept too-low a number and blindly continue reading more elements than you've allocated storage for, you can cause stack or heap over-writes.
It's not the datatypes that would be safe or not -- it's the code underneath that determines this.
That said, strings generally cause problems because of either buffer overruns or injection-style attacks against some underlying interpreter (SQL or some scripting language). You won't see these kinds of problems from numeric-type variables, obviously.
What you can have happen are bugs relating to bad external values which could lend themselves to something like a denial-of-service attack.
You should never trust any data that crosses a trust boundary.
A trust boundary occurs when one
component doesn't trust the component
on the other side of the boundary.
There is always a trust boundary
between elements running at different
privilege levels, but there sometimes
are trust boundaries between different
components running at the same
privilege level.
Threat Modeling, once again - Larry Osterman
As described in The New Threat Modeling Process on the Microsoft Security Development Lifecycle (SDL) Blog, expanded on by Larry Osterman's series of articles on threat modeling (updated here) and demonstrated by his Threat Modeling Again, Presenting the PlaySound Threat Model post, anywhere data crosses a trust boundary you need to identify the possible threats.