Improving simulation performance via concurrency - haskell

Consider this sequential procedure on a data structure containing collections (for simplicity, call them lists) of Doubles. For as long as I feel like, do:
Select two different lists from the structure at random
Calculate a statistic based on those lists
Flip a coin based on that statistic
Possibly modify one of the lists, based on the outcome of the coin toss
The goal is to eventually achieve convergence to something, so the 'solution' is linear in the number of iterations. An implementation of this procedure can be seen in the SO question here, and here is an intuitive visualization:
It seems that this procedure could be better performed - that is, convergence could be achieved faster - by using several workers executing concurrently on separate OS threads, ex:
I guess a perfectly-realized implementation of this should be able to achieve a solution in O(n/P) time, for P the number of available compute resources.
Reading up on Haskell concurrency has left my head spinning with terms like MVar, TVar, TChan, acid-state, etc. What seems clear is that a concurrent implementation of this procedure would look very different from the one I linked above. But, the procedure itself seems to essentially be a pretty tame algorithm on what is essentially an in-memory database, which is a problem that I'm sure somebody has come across before.
I'm guessing I will have to use some kind of mutable, concurrent data structure that supports decent random access (that is, to random idle elements) & modification. I am getting a bit lost when I try to piece together all the things that this might require with a view towards improving performance (STM seems dubious, for example).
What data structures, concurrency concepts, etc. are suitable for this kind of task, if the goal is a performance boost over a sequential implementation?

Keep it simple:
forkIO for lightweight, super-cheap threads.
MVar, for fast, thread safe shared memory.
and the appropriate sequence type (probably vector, maybe lists if you only prepend)
a good stats package
and a fast random number source (e.g. mersenne-random-pure64)
You can try the fancier stuff later. For raw performance, keep things simple first: keep the number of locks down (e.g. one per buffer); make sure to compile your code and use the threaded runtime (ghc -O2) and you should be off to a great start.
RWH has a intro chapter to cover the basics of concurrent Haskell.

Related

Failing to see the difference between these two lines of thought in dynamic programming

Always there are multiple ways people describe differences in tabulation and memoization in dynamic programming, but I will summarize to what is normally said.
memoization is a where we add caching to a function, to make recursive calls take less computations. typically used on recursive functions for a top down solution that starts with the initial problem and then recursively calls itself to smaller problems
tabulation uses a table to keep track of subproblem results and works up in bottom up manner, solving smallest sub problems before larger ones in a iterative manner.
Well my question is whats the difference? Sometimes I look at different situations and the line is super blurred. Also, with memoization working in a "top down" fashion, its really just referring to the stack nature of it, and in that sense its still going to the base case, aka bottom and then using those results to build up to the final result, so how is that really different from a tabulation going from bottom up until its done? Or is it a situational case where tabulation aproaches don't involve recursion, the fact that a dynmaic programming problem uses it IS what differentiates the two different methods? If someone knowledgable could offer there thoughts it would be much appreciated
You're right that they're just two implementation methods for the same computation. A recursive formulation with memoization will fill in the memo cache with the same entries that an explicit tabular formulation will put in its table.
Explicit tabular formulations are strictly less useful, however. This is because they need more information about the problem in advance. They start by enumerating all possibly useful base cases and putting those in the table. (So what's "possibly useful?" That's the rub!) Then they enumerate the new "layer" of all possible problem versions that can be solved with the base cases. Then a layer of others that can be solved with those, etc. etc. This continues until it the "top level" problem turns up in a layer.
For the kinds of problems typically seen in textbooks and coding interviews, determining all useful base cases is deliberately easy. The problem parameters are 2 or 3 "dense" natural numbers, so the table of solutions can be a 2d or 3d array with all elements containing useful values. In many of these, you can prove that the current layer only depends on a few (possibly one) previous layer, so all the rest can be discarded, which saves memory.
Practical problems aren't often so nice. The parameter sets aren't small or aren't natural numbers, or - even when they are - they're sparse so that filling in all entries of an array would be a waste.
In these cases, memoization is the only reasonable choice. The top-down recursion determines the sub-problems (on down to the base cases) that need solving as they occur. Sparseness doesn't matter because the memo cache can store parameter sets as explicit keys. When the current layer doesn't need more than K previous ones, various strategies can still be applied to discard the others.

Testing for heteroskedasticity and autocorrelation in large unbalanced panel data

I want to test for heteroskedasticity and autocorrelation in a large unbalanced panel dataset.
I do so using the following code:
* Heteroskedasticity test
// iterated GLS with only heteroskedasticity produces
// maximum-likelihood parameter estimates
xtgls adjusted_volume ibn.rounded_time i.id i.TRD_EVENT_DT, igls panels(heteroskedastic)
estimates store hetero
* Autocorrelation
findit xtserial
net sj 3-2 st0039
net install st0039
xtserial adjusted_volume ibn.rounded_time i.id i.TRD_EVENT_DT
Though I use the calculation power of high process center, because of the iteration method, this procedure takes more than 15 hours.
What is the most efficient program to perform these tests using Stata?
This question is borderline off-topic and quite broad, but i suspect still of
considerable interest to new users. As such, here i will try to consolidate our
conversation in the comments as an answer.
I strongly advise in the future to refrain from using highly subjective
words such as 'best', which can mean different things to different people. Or
terms like 'efficient', which can have a different meaning in a different context.
It is also difficult to provide specific advice regarding the use of commands
when we know nothing about what you are trying to do.
In my view, the 'best' choice, is the choice that gets the job done as accurately
as possible given the available data. Speed is an important consideration nowadays, but accuracy is still the most fundamental one. As you continue to use Stata, you will see that it has a considerable number of commands, often with overlapping functionality. Depending on the use case, sometimes opting for one implementation over another can be 'better', in the sense that it may be more practical or faster in achieving the desired end result.
Case in point, your comment in your previous post where the noconstant option is unavailable in rreg. In that particular context you can get a reasonably good alternative using regress with the vce(robust) option. In fact, this alternative may often be adequate for several use cases.
In this particular example, xtgls will be considerably faster if the igls
option is not used. This will be especially true with larger and more 'difficult' datasets. In cases where MLE is necessary, the iterate option will allow you to specify a fixed number of iterations, which could speed things up but can be a recipe for disaster if you don't know what you are doing and is thus not recommended. This option is usually used for other purposes. However, is xtgls the only command you could use? Read here why this may in fact not necessarily be the case.
Regarding speed, Stata in general is slow, at least when the ado language is used. This is because it is an interpreted language. The only realistic option for speed gains here is through parallelisation if you have Stata MP. Even in this case, whether any gains are achieved it will depend on a number of factors,
including which command you use.
Finally, xtserial is a community-contributed command, something which you
fail to make clear in your question. It is customary and useful to provide this
information right from the start, so others know that you do not refer to an
official, built-in command.

Where to get hardware model data?

I have a task which consists of 3 concurrent self-defined (recursive to each other) processes. I need somehow to make it execute on computer, but any attempt to convert a requirement to program code with just my brain fails since first iteration produces 3^3 entities with 27^2 cross-relations, but it needs to implement at least several iterations to try if program even works at all.
So I decided to give up on trying to understand the whole system and formalized the problem and now want to map it to hardware to generate an algorithm and run. Language doesn't matter (maybe even directly to machine/assembly one?).
I never did anything like that before, so all topics I searched through like algorithm synthesis, software and hardware co-design, etc. mention hardware model as the second half (in addition to problem model) of solution generation, but I never seen one. The whole work supposed to look like this:
I don't know yet what level hardware model described at, so can't decide how problem model must be formalized to fit hardware model layer.
For example, target system may contain CPU and GPGPU, let's say target solution having 2 concurrent processes. System must decide which process to run on CPU and which on GPGPU. The highest level solution may come from comparing computational intensity of processes with target hardware, which is ~300 for CPUs and ~50 for GPGPUs.
But a normal model gotta be much more complete with at least cache hierarchy, memory access batch size, etc.
Another example is implementing k-ary trees. A synthesized algorithm could address parents and children with computing k * i + c / ( i - 1 ) / k or store direct pointers - depending on computations per memory latency ratio.
Where can I get a hardware model or data to use? Any hardware would suffice for now - to just see how it can look like - later would be awesome to get models of modern processors, GPGPUs and common heterogeneous clusters.
Do manufacturers supply such kinds of models? Description of how their systems work in any formal language.
I'm not pretty sure if it might be the case for you, but as you're mentioning modeling, I just thought about Modelica. It's used to model physical systems and combined with a simulation environment, you can run some simulations on it.

Java : Use Disruptor or Not . .

Hy,
Currently I am developing a program that takes 2 values from an amq queue and performs a series of mathematical calculations on them. A topic has been created on the amq server to which my program subscribes and receive messages via callbacks (listeners).
Now whenever a message arrives the two values are taken out of and added to the SynchronizedDescriptiveStatistics object. After each addition to the list of values the whole sequence of calculations is performed all over again (this is part of the requirement actually).
The problem I am facing right now is that since I am using listeners, sometimes a single or more messages are received in the middle of calculations. Although SynchronizedDescriptiveStatistics takes care of all the thread related issues it self but it adds all the waiting values in its list of numbers at once when it comes out of lock or something. While my problem was to add one value then perform calcls on it then second value and on and on.
The solution I came up with is to use job queues in my program (not amq queues). In this way whenever calcs are over the program would look for further jobs in the queue and goes on accordingly.
Since I am also looking for efficiency and speed I thought the Disruptor framework might be good for this problem and it is optimized for threaded situations. But I am not sure if its worth the trouble of implementing Disruptor in to my application because regular standard queue might be enough for what I am trying to do.
Let me also tell you that the data on which the calcs need to be performed is a lot and it will keep on coming and the whole calcs will need to be performed all over again for each addition of a single value in a continuous fashion. So keeping in mind the efficiency and the huge volume of data what do you think will be useful in the long run.
Waiting for a reply. . .
Regards.
I'll give our typical answer to this question: test first, and make your decision based on your results.
Although you talk about efficiency, you don't specifically say that performance is a fundamental requirement. If you have an idea of your performance requirements, you could mock up a simple prototype using queues versus a basic implementation of the Disruptor, and take measurements of the performance of both.
If one comes off substantially better than the other, that's your answer. If, however, one is much more effort to implement, especially if it's also not giving you the efficiency you require, or you don't have any hard performance requirements, then that suggests that solution is not the right one.
Measure first, and decide based on your results.

Message Passing Arbitrary Object Graphs?

I'm looking to parallelize some code across a Beowulf cluster, such that the CPUs involved don't share address space. I want to parallelize a function call in the outer loop. The function calls do not have any "important" side effects (though they do use a random number generator, allocate memory, etc.).
I've looked at libs like MPI and the problem I see is that they seem to make it very non-trivial to pass complex object graphs between nodes. The input to my function is a this pointer that points to a very complex object graph. The return type of my function is another complex object graph.
At a language-agnostic level (I'm working in the D programming language, and I'm almost sure no canned solution is available here, but I'm willing to create one), is there a "typical" way that passing complex state across nodes is dealt with? Ideally, I want the details of how the state is copied to be completely abstracted away and for the calls to look almost like normal function calls. I don't care that copying this much state over a network isn't particularly efficient, as the level of parallelism in question is so coarse-grained that it probably won't matter.
Edit: If there is no easy way to pass complex state, then how is message passing typically used? It seems to me like anything involving copying data over a network requires coarse grained parallelism, yet coarse grained parallelism usually requires passing complex state so that a lot of work can be done in one work unit.
I do a fair bit of MPI programming but I don't know of any typical way of passing complex state (as you describe it) between processes. Here's how I've been thinking about your problem, it probably matches your own thinking ...
I surmise that your complex object graphs are represented, in memory, by blocks of data and pointers to other blocks of data -- a usual sort of implementation of a graph. How best can you move one of these COGs (to coin an abbreviation) from the address space of one process to the address space of another ? To the extent that a pointer is a memory address, a pointer in one address space is no use in another address space, so you will have to translate it into some neutral form for transport (I think ?).
To send a COG, therefore, it has to be put into some form from which the receiving process can build, in its own address space, a local version of the graph with the pointers pointing to local memory addresses. Do you ever write these COGs to file ? If so, you already have a form in which one could be transported. I hate to suggest it, but you could even use files to communicate between processes -- and that might be easier to handle than the combination of D and MPI. Your choice !
If you don't have a file form for the COGs can you easily represent them as adjacency matrices or lists ? In other words, work out your own representation for transport ?
I'll be very surprised (but pleased to learn) if you can pass a COG between processes without transforming it from pointer-based to some more static structure such as arrays or records.
Edit, in response to OP's edit. MPI does provide easy ways to pass complex state around, provided that the complex state is represented as values not pointers. You can pass complex state around in either the intrinsic or customised MPI datatypes; as one of the other answers shows you these are flexible and capable. If our program does not keep the complex state in a form that MPI custom datatypes can handle, you'll have to write functions to pack/unpack to a message-friendly representation. If you can do that, then your message calls will look (for most purposes) like function calls.
As to the issues surrounding complex state and the graininess of parallelism, I'm not sure I quite follow you. We (include yourself in this sweeping generalisation if you want, or not) typically resort to MPI programming because we can't get enough performance out of a single processor, we know that we'll pay a penalty in terms of computation delayed by waiting for communication, we work hard to minimise that penalty, but in the end we accept the penalty as the cost of parallelisation. Certainly some jobs are too small or too short to benefit from parallelisation, but a lot of what we (parallel computationalists that is) do is just too big and too long-running to avoid parallelisation
You can do marvelous things with custom MPI datatypes. I'm currently working on a project where several MPI processes are tracking particles in a piece of virtual space, and when particles cross over from one process' territory into another one's, their data (position/speed/size/etc) has to be sent over the network.
The way I achieved this is the following:
1) All processes share an MPI Struct datatype for a single particle that contains all its relevant attributes, and their displacement in memory compared to the base address of the particle object.
2) On sending, the process iterates over whatever data structure it stores the particles in, notes down the memory address of each one that needs to be sent, and then builds a Hindexed datatype where each block is 1 long (of the above mentioned particle datatype) and starts at the memory addresses previously noted down. Sending 1 object of the resulting type will send all the necessary data over the network, in a type safe manner.
3) On the receiving end, things are slightly trickier. The receiving process first inserts "blank" particles into its own data structure: "blank" means that all the attributes that will be received from the other process are initialized to some default value. The memory addresses of the freshly inserted particles are noted down, and a datatype similar to that of the sender is created from these addresses. Receiving the sender's message as a single object of this type will automatically unpack all the data into all the right places, again, in a type safe manner.
This example is simpler in the sense that there are no relationships between particles (as there would be between nodes of a graph), but you could transmit that data in a similar way.
If the above description is not clear, I can post the C++ code that implements it.
I'm not sure I understand the question correctly so forgive me if my answer is off. From what I understand you want to send non-POD datatypes using MPI.
A library that can do this is Boost.MPI. It uses a serialization library to send even very complex data structures. There is a catch though: you will have to provide code to serialize the data yourself if you use complicated structures that Boost.Serialize does not already know about.
I believe message passing is typically used to transmit POD datatypes.
I'm not allowed to post more links so here is what I wanted to include:
Explanation of POD: www.fnal.gov/docs/working-groups/fpcltf/Pkg/ISOcxx/doc/POD.html
Serialization Library: www.boost.org/libs/serialization/doc
it depends on organization of your data. If you use pointers or automatic memory inside your objects, it will be difficult. If you can organize your objects to be contiguous in memory, you have two choices: send memory as bytes,cast it back to object type on the receiver or define mpi derived type for your object. If however you use inheritance, things will become complicated due to how objects are laid out in memory.
I do not know your problem, but maybe can take a look at ARMCI if you manage memory manually.

Resources