Inter-process communication on MATLAB

Inter-process communication on MATLAB - multithreading

I want to create a MATLAB program to simulate the behavior of some agents. Each of these agents is capable of communicating with each other and decide on is next state. I could implement the program is a traditional language that I am familiar with like java,python or C++ and use threads to simulate each of the agents.
Now I want to try the implementation on MATLAB to make use of the MATLAB plot-functions and its mathematical tools. Is it possible to create such a simulation in MATLAB, or maybe better is it strait-forward? I am aware of the parallelism toolbox but I am not sure if MATLAB is a good choice for such an application. I can also make the simulation no-parallel but it is not that interesting. This is part of an assignment and I would like to know if it is a good idea to start such a simulation on MATLAB to get more familiar with it. If it is not strait-forward I can switch easily to python.

As mentioned earlier you can't really have multiple processes in matlab.
But for the agents you can make them if their classes inherit from handles. Then you can give them a method to receive messages.
But keep in mind they will not run in parallel.

Here's what I would do:
Write an agent class in matlab with parameters you need, methods for setting and getting (or write subsref-methods) and methods for "decision making"
Fill an array with instances of the class
Either make an array containing an instance index followed by its predecessors, i.e. if, say agent 4 follows agents 1, 2, and 3 and agent 5 follows agents 1,2, and 4, the vector would look like: [4 1 2 3 5 1 2 4] and so on. Or make a parent-child matrix. You could also add a parameter storing the predecessors in the instances. If every agent is connected with each other, you don't even need this feature.
Now you would run sequentially. All agents update their inputs, all agents compute their response and set their outputs.
As you can see, this is not parallel but sequential. However, I can't really see the adantage of parallel processing here. The toolbox is no help, since it only allows "workers" depending on how many cores you have at your disposal. Basically, even if you use the parallel processing toolbox, you won't have much of an advantage since it is meant to parallelize loops. for example, in a genetic algorithm, you can compute the cost function for every pool member independently, thus you can use the toolbox. In algorithms where one loop execution depends on the computations of the prior loop execution, you cannot use the toolbox.
Hope this helps.

Matlab interprets the code sequentially. Hence, in order to solve your problem, you will need a loop that iterates each sampling time and evaluates the state of all the agents in a pre-defined order.
TimeMax = 10;
TimeStep = 0.1;
time_counter = 0;
while time_counter<TimeMax
time_counter = time_counter + TimeStep;
% Update all the agents sequentially
end
This is not very efficient. Thus, I would suggest you to use Simulink, that supports parallel computations more naturally. Then, you can export the results to Matlab and do all the fancy plots that you wish.

Related

kiba-etl Pattern to split transformations into independent pipelines

Kiba is a very small library, and it is my understanding that most of its value is derived from enforcing a modular architecture of small independent transformations.
However, it seems to me that the model of a series of serial transformations does not fit most of the ETL problems we face. To explain the issue, let me give a contrived example:
A source yields hashes with the following structure
{ spend: 3, cost: 7, people: 8, hours: 2 ... }
Our prefered output is a list of hashes where some of the keys might be the same as those from the source, though the values might differ
{ spend: 8, cost: 10, amount: 2 }
Now, calculating the resulting spend requires a series of transformations: ConvertCurrency, MultiplyByPeople etc. etc. And so does calculating the cost: ConvertCurrencyDifferently, MultiplyByOriginalSpend.. Notice that the cost calculations depend on the original (non transformed) spend value.
The most natural pattern would be to calculate the spend and cost in two independent pipelines, and merge the final output. A map-reduce pattern if you will. We could even benefit from running the pipelines in parallel.
However in my case it is not really a question of performance (as the transformations are very fast). The issue is that since Kiba applies all transforms as a set of serial steps, the cost calculations will be affected by the spend calculations, and we will end up with the wrong result.
Does kiba have a way of solving this issue? The only thing I can think of is to make sure that the destination names are not the same as the source names, e.g. something like 'originSpend' and 'finalSpend'. It still bothers me however that my spend calculation pipeline will have to make sure to pass on the full set of keys for each step, rather than just passing the key relevant to it, and then merging in the Cost keys in the end. Or perhaps one can define two independent kiba jobs, and have a master job call the two and merge their result in the end? What is the most kiba-idiomatic solution to this?
Splitting an ETL pipeline into multiple parallel paths seem to be a key feature of most ETL tools, so I'm surprised that it doesn't seem to be something kiba supports?

I think I lack extra details to be able to properly answer your main question. I will get in touch via email for this round, and will maybe comment here later for public visibility.
Splitting an ETL pipeline into multiple parallel paths seem to be a key feature of most ETL tools, so I'm surprised that it doesn't seem to be something kiba supports?
The main focus of Kiba ETL today is: components reuse, lower maintenance cost, modularity and ability to have a strong data & process quality.
Parallelisation is supported to some extent though, via different patterns.
Using Kiba Pro parallel transform to run sister jobs
If your main input is something that you can manage to "partition" with a low volume of items (e.g. database id ranges, or a list of files), you can use Kiba Pro parallel transform like this:
source ... # something that generate list of work items
parallel_transform(max_threads: 10) do |group_items|
Kiba.run(...)
end
This works well if there is no output at all, or not much output, coming to the destinations of the sister jobs.
This works with threads but one can also "fork" here for extra performance.
Using process partitioning
In a similar fashion, one can structure their jobs in a way where each process will only process a subset of the input data.
This way one can start say 4 processes (via cron jobs, or monitored via a parent tool), and pass a SHARD_NUMBER=1,2,3,4, which is then used by the source for input-load partitioning.
But!
I'm pretty sure your problem, as you said, is more about workflow control & declarations & ability to express what you need to be done, rather than performance.
I'll reach out and we'll discuss that.

Optimise multiple objectives in MiniZinc

I am newbie in CP but I want to solve problem which I got in college.
I have a Minizinc model which minimize number of used Machines doing some Tasks. Machines have some resource and Tasks have resource requirements. Except minimize that number, I am trying to minimize cost of allocating Tasks to Machine (I have an array with cost). Is there any chance to first minimize that number and then optimizate the cost in Minizinc?
For example, I have 3 Task and 2 Machines. Every Machine has enough resource to allocate 3 Task on them but I want to allocate Task where cost is lower.
Sorry for my English and thanks for help. If there is such a need I will paste my code.

The technique that you are referring to is called lexicographic optimisation/objectives. The idea is to optimise for multiple objectives, where there is a clear ordering between the objectives. For example, when optimising (A, B, C) we would optimise B and C, subject to A. So if we can improve the value of A then we would allow B and C to worsen. Similarly, C is also optimised subject to B.
This technique is often used, but is currently not (yet) natively supported in MiniZinc. There are however a few workarounds:
As shown in the radation model, we can scale the first objective by a value that is at least as much as the maximum of the second objective (and so on). This will ensure that any improvement on the first objective will trump any improvement/stagnation on the second objective. The result of the instance should then be the lexicographic optimal.
We can seperate our models into multiple stages. In each stage we would only concern ourselves with a single objective value (working from most important to least important). Any subsequent stage would
fix the objectives from earlier stages. The solution of the final stage should give you the lexicographic optimal solution.
Some solvers support lexicographic optimisation natively. There is some experimental support for using these lexicographic objectives in MiniZinc, as found in std/experimental.mzn.
Note that lexicographic techniques might not always (explicitly) talk about both minimisation and maximisation; however, you can always convert from one to the other by negating the intended objective value.

How to fetch coefficients from ROM (actually BlockRAM in FPGA) to use in matrix multiplication?

We are senior year student who designs FPGA based Convolutional Neural Network accelerator.
We built pipelined architecture. (Convolution, Pooling, Convolution and Pooling), for this 4 stage of the architecture, we need to multiply one particular window and filter. We have (5*5)*6*16 window in the 2nd convolution layer and filter.
Up to here, I accept this is not a clear explanation. But the main problem in here is that we need to access 5*5*6*16 filter coefficients which are stored in block ram sequentially at the same time. But at every clock, I can just reach one particular address on the ROM.
What approach can we take?

What approach can we take?
You don't want to hear this but the only solution is:
Go back to the start and change your architecture/code. (or run very slowly)
You can NOT access 2400 coefficients sequentially unless you run the memory at a 2400 times the clock frequency of your main system. So lets say with a 100MHz RAM/ROM operating frequency your main system must run at ~42KHz.
This is a recurrent theme I encounter on these forums. You have made a wrong decision and now want a solution. Preferable an easy one. Sorry but there is none.

I am having the same issue. For some layers we want to access multiple kernels for parallel operations, however, for a BRAM implementation you can have at most 2 accesses per cycle. So, the solution I made to this is to create a ROM array, whether implemented in BRAM style or Distributed style.
Unlike RAM array, you can't just implement the ROM array as easily. So you need a script/software layer that generates RTL for your module.
I chose to implement with Distributed approach, however, I can't estimate the resources needed and the utilization reports give me unclear results. Still investigating into this.

For future reference you could look into HLS pragmas to help use the FPGAs resources. What you could do is use the array partition pragma with the cyclic setting. This makes it so that each subsequent element of an array is stored in a different sub array.
For example with a factor of 4 there’d be four smaller arrays created from the original array. The first element in each sub array would be arr[0], arr[1], arr[2] respectively.
That’s how you would distribute an array across Block RAMs to have more parallel access at a time.

Where to get hardware model data?

I have a task which consists of 3 concurrent self-defined (recursive to each other) processes. I need somehow to make it execute on computer, but any attempt to convert a requirement to program code with just my brain fails since first iteration produces 3^3 entities with 27^2 cross-relations, but it needs to implement at least several iterations to try if program even works at all.
So I decided to give up on trying to understand the whole system and formalized the problem and now want to map it to hardware to generate an algorithm and run. Language doesn't matter (maybe even directly to machine/assembly one?).
I never did anything like that before, so all topics I searched through like algorithm synthesis, software and hardware co-design, etc. mention hardware model as the second half (in addition to problem model) of solution generation, but I never seen one. The whole work supposed to look like this:
I don't know yet what level hardware model described at, so can't decide how problem model must be formalized to fit hardware model layer.
For example, target system may contain CPU and GPGPU, let's say target solution having 2 concurrent processes. System must decide which process to run on CPU and which on GPGPU. The highest level solution may come from comparing computational intensity of processes with target hardware, which is ~300 for CPUs and ~50 for GPGPUs.
But a normal model gotta be much more complete with at least cache hierarchy, memory access batch size, etc.
Another example is implementing k-ary trees. A synthesized algorithm could address parents and children with computing k * i + c / ( i - 1 ) / k or store direct pointers - depending on computations per memory latency ratio.
Where can I get a hardware model or data to use? Any hardware would suffice for now - to just see how it can look like - later would be awesome to get models of modern processors, GPGPUs and common heterogeneous clusters.
Do manufacturers supply such kinds of models? Description of how their systems work in any formal language.

I'm not pretty sure if it might be the case for you, but as you're mentioning modeling, I just thought about Modelica. It's used to model physical systems and combined with a simulation environment, you can run some simulations on it.

Improving simulation performance via concurrency

Consider this sequential procedure on a data structure containing collections (for simplicity, call them lists) of Doubles. For as long as I feel like, do:
Select two different lists from the structure at random
Calculate a statistic based on those lists
Flip a coin based on that statistic
Possibly modify one of the lists, based on the outcome of the coin toss
The goal is to eventually achieve convergence to something, so the 'solution' is linear in the number of iterations. An implementation of this procedure can be seen in the SO question here, and here is an intuitive visualization:
It seems that this procedure could be better performed - that is, convergence could be achieved faster - by using several workers executing concurrently on separate OS threads, ex:
I guess a perfectly-realized implementation of this should be able to achieve a solution in O(n/P) time, for P the number of available compute resources.
Reading up on Haskell concurrency has left my head spinning with terms like MVar, TVar, TChan, acid-state, etc. What seems clear is that a concurrent implementation of this procedure would look very different from the one I linked above. But, the procedure itself seems to essentially be a pretty tame algorithm on what is essentially an in-memory database, which is a problem that I'm sure somebody has come across before.
I'm guessing I will have to use some kind of mutable, concurrent data structure that supports decent random access (that is, to random idle elements) & modification. I am getting a bit lost when I try to piece together all the things that this might require with a view towards improving performance (STM seems dubious, for example).
What data structures, concurrency concepts, etc. are suitable for this kind of task, if the goal is a performance boost over a sequential implementation?

Keep it simple:
forkIO for lightweight, super-cheap threads.
MVar, for fast, thread safe shared memory.
and the appropriate sequence type (probably vector, maybe lists if you only prepend)
a good stats package
and a fast random number source (e.g. mersenne-random-pure64)
You can try the fancier stuff later. For raw performance, keep things simple first: keep the number of locks down (e.g. one per buffer); make sure to compile your code and use the threaded runtime (ghc -O2) and you should be off to a great start.
RWH has a intro chapter to cover the basics of concurrent Haskell.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string