Where to get hardware model data? - modeling

I have a task which consists of 3 concurrent self-defined (recursive to each other) processes. I need somehow to make it execute on computer, but any attempt to convert a requirement to program code with just my brain fails since first iteration produces 3^3 entities with 27^2 cross-relations, but it needs to implement at least several iterations to try if program even works at all.
So I decided to give up on trying to understand the whole system and formalized the problem and now want to map it to hardware to generate an algorithm and run. Language doesn't matter (maybe even directly to machine/assembly one?).
I never did anything like that before, so all topics I searched through like algorithm synthesis, software and hardware co-design, etc. mention hardware model as the second half (in addition to problem model) of solution generation, but I never seen one. The whole work supposed to look like this:
I don't know yet what level hardware model described at, so can't decide how problem model must be formalized to fit hardware model layer.
For example, target system may contain CPU and GPGPU, let's say target solution having 2 concurrent processes. System must decide which process to run on CPU and which on GPGPU. The highest level solution may come from comparing computational intensity of processes with target hardware, which is ~300 for CPUs and ~50 for GPGPUs.
But a normal model gotta be much more complete with at least cache hierarchy, memory access batch size, etc.
Another example is implementing k-ary trees. A synthesized algorithm could address parents and children with computing k * i + c / ( i - 1 ) / k or store direct pointers - depending on computations per memory latency ratio.
Where can I get a hardware model or data to use? Any hardware would suffice for now - to just see how it can look like - later would be awesome to get models of modern processors, GPGPUs and common heterogeneous clusters.
Do manufacturers supply such kinds of models? Description of how their systems work in any formal language.

I'm not pretty sure if it might be the case for you, but as you're mentioning modeling, I just thought about Modelica. It's used to model physical systems and combined with a simulation environment, you can run some simulations on it.

Related

Which concurrency models do multi-process/thread programming belong to?

Wikipedia C/S article says
A number of formalisms for modeling and understanding concurrent
systems have been developed, including:[5]
The parallel random-access machine[6]
The actor model
Computational bridging models such as the bulk synchronous parallel (BSP) model
Petri nets
Process calculi
Calculus of communicating systems (CCS)
Communicating sequential processes (CSP) model
π-calculus
Tuple spaces, e.g., Linda
Simple Concurrent Object-Oriented Programming (SCOOP)
Reo Coordination Language
Which model(s) do multi-process programming (as in Linux API, MPI, Java, Python) belong to?
Which model(s) do multi-threading programming (as in PThread, Java, Python) belong to?
Let me add I bit of thoughts:
occam-pi is true-[PARALLEL] language well fit onto parallel InMos T-414 Transputer hardware ( actually a hardware network of Transputers ). Process-flow was based on the theory of lambda-calculus guaranteed scheduling strategy and coordination was thanks to the seminal work of Hoarre's CSP not a constraint to achieving a true-[PARALLEL] execution, pure-[SERIAL] execution ( where feasible ) and opportunistic "just"-[CONCURRENT], where required.
So the language ( paradigm ) does not uniquely map onto some above Wikipedia listed archetype form of parallelism. Also the external code-execution eco-system properties matter.
Python, on the other hand, is since the Guido Rossum's design decision a pure sequential interpreter process ( whatever amount of threads one might have instantiated, as the central Global Interpreter Lock, the GIL-lock, knowingly chops the flow of time and permits one and only one thread to execute, all others waiting for GIL, thus the code principally avoids any form of a just-[CONCURRENT] related collisions ( race condition to acquire a resource, read(s) colliding with write(s) et al ).
Python can use message passing using MPI or ZeroMQ, can use a CSP paradigm module, has modules that enjoy the actor-model behaviour ( as an example in, mimicking the XEROX PARC-place invention of a Model-Visual-Controller coordination ) so the language typically does not constrain a paradigm, being used on a higher layer of abstraction ( while the lower level constraints do limit how hard-real-time any such abstracted form of execution may get harnessed, as any low-level limitations extend all upper abstraction-layers latency and may introduce fine-grain blocking state(s), that are outside of a domain of control of the upper-layer abstracted code-execution behaviour(s) )
Python can use multiprocessing ( joblib decorated or not ) - it helped to partially escape from the principal, and as Guido Rossum has expressed on his own that GIL-lock will remain a natural part of the python interpreter, unless an immense scale of total re-design of the whole concept of the interpreter is undertaken (which is not, in his view, a probable direction of further efforts spent in this domain). Attempts to escape from the otherwise, known and forever present principal GIL-lock orchestrated re-[SERIAL]-isation of any number of threads execution were developed, yet each one comes at a cost - human-related: refactoring the code, system-related: re-spawning the full-identical copies of the original python interpreter state ( the only chance under Windows-class O/S-es, partial or ad-hoc copies in linux fork or forkserver ), making troubles for both newbies and practitioners by ignoring or wrong guesses of the Amdahl Law add-on costs added right due to process-instantiation costs ( TIME + RAM-allocations TIME + RAM-to-RAM copy TIME + parameters / interprocess SER/DES-add-on TIME ), sum of which may easily wipe up any promises or wished-to-have speedups from going into a "just"-[CONCURRENT] or a true-[PARALLEL] code-execution domain.
Python can, as most of the other mentioned examples, participate in a distributed-computing-infrastructure, where higher-layer paradigms control the mode-of-cooperative execution, so the macro-system may have higher-levels of concurrency, not visible from "inside" a python-node.
The above-listed "forms"-of are sort of academic ( missing a hardware-based ILP-parallelism, AND-based and OR-based forms of fine-grain forms of parallelisms ), PRAM-s being the subject of C/S research as deep as in late 60-ies, early 70-ies, when it was concluded that even PRAM-based architectures cannot escape from Class-2 computing taxonomy.
"Section 4.3 ( IS THERE ANY CHANCE FOR A GIANT LEAP ) BEYOND THE CLASS-2 COMPUTER BOUNDARIES 2
The main practical - though negative - implication of the previous thoughts is a fact
that within the Class-2 computing, there is not to be expected any efficient solution
for sequentially intractable problems.
Nevertheless, a question raises here, whether some other sort of parallel computers could be imaginable,
that would be computationally more efficient than Class-2 computers.
Indications, coming from many known, conceptually different C2 class computer models,
suggest that without adding some other, fundamental computing capability, the parallelism per se
does not suffice to overcome C2 class boundaries,
irrespective how we try to modify, within all thinkable possibilites,
the architectures of such computers.
As a matter of fact, it turns out, that C2 class boundaries will be crossed,
if there would be a non-determinism added to an MIMD-type parallelism ( Ref. Section 3.5 ).
Non-deterministic PRAM (+)
can, as an example, solve ( intractable ) problems from NPTIME class
in polylogarithmic time and problems of a demonstrably exponential sequential complexity in polynomial time.
Because, in the context of computers, where the non-determinism is equally well technically feasible to be implemented
as a clairvoyance, the C2 computer class seems to represent, from the efficiency point of view,
the ultimate class for the parallel computers, the borders of which will never be crossed.
+) PRAM: a Parallel-RAM, not a SIMD-only processor, demonstrated by Savitch, Stimson in 1979 (1)
(1) SAVITCH, W. J. - STIMSON, M. J.: Time bounded random access machines with parallel processing. J. ACM 26, 1979, Pg. 103-118.
(2) WIEDERMANN, J.: Efficiency boundaries of parallel computing systems. ( Medze efektivnosti paralelných výpočtových systémov ).
Advances in Mathematics, Physics and Astronomy ( Pokroky matematiky, fyziky a astronomie ), Vol. 33 (1988), No. 2, Pg. 81--94"
Both a process-based and a thread-based code may per-se use, or participate in a gang-of-coordinated actors in almost any of the above enlisted forms-of-concurrency.
The code-implementation plus all the underlying resources' management constraints ( hardware + O/S + resource-management policy in respective context of use ) actually decide about what forms remain achievable in fields, when and how any piece of code gets executed - i.e., your code design may be of any level of geniality architecture-wise, if O/S policy resorts your code to get executed on a one only and the only one CPU-core ( due to user-process effective rights enforced affinity mapping constraints ), again any such smart-code will result in a re-[SERIAL]-ised code-execution ( paying all the add-on overhead costs of wished-to-have [CONCURRENT]-execution, but getting nothing in return of having spent and continuing to spend such add-on costs ) the very like the straightforward, pure-[SERIAL] code does [ which one also remains free from any wasted add-on costs, so results in a faster result generation, often with also enjoying a benefit of non-depleted CPU-core local L1/L2 cache hierarchies, if HPC-grade computing was carefully designed-in :o) ]

How to fetch coefficients from ROM (actually BlockRAM in FPGA) to use in matrix multiplication?

We are senior year student who designs FPGA based Convolutional Neural Network accelerator.
We built pipelined architecture. (Convolution, Pooling, Convolution and Pooling), for this 4 stage of the architecture, we need to multiply one particular window and filter. We have (5*5)*6*16 window in the 2nd convolution layer and filter.
Up to here, I accept this is not a clear explanation. But the main problem in here is that we need to access 5*5*6*16 filter coefficients which are stored in block ram sequentially at the same time. But at every clock, I can just reach one particular address on the ROM.
What approach can we take?
What approach can we take?
You don't want to hear this but the only solution is:
Go back to the start and change your architecture/code. (or run very slowly)
You can NOT access 2400 coefficients sequentially unless you run the memory at a 2400 times the clock frequency of your main system. So lets say with a 100MHz RAM/ROM operating frequency your main system must run at ~42KHz.
This is a recurrent theme I encounter on these forums. You have made a wrong decision and now want a solution. Preferable an easy one. Sorry but there is none.
I am having the same issue. For some layers we want to access multiple kernels for parallel operations, however, for a BRAM implementation you can have at most 2 accesses per cycle. So, the solution I made to this is to create a ROM array, whether implemented in BRAM style or Distributed style.
Unlike RAM array, you can't just implement the ROM array as easily. So you need a script/software layer that generates RTL for your module.
I chose to implement with Distributed approach, however, I can't estimate the resources needed and the utilization reports give me unclear results. Still investigating into this.
For future reference you could look into HLS pragmas to help use the FPGAs resources. What you could do is use the array partition pragma with the cyclic setting. This makes it so that each subsequent element of an array is stored in a different sub array.
For example with a factor of 4 there’d be four smaller arrays created from the original array. The first element in each sub array would be arr[0], arr[1], arr[2] respectively.
That’s how you would distribute an array across Block RAMs to have more parallel access at a time.

Given measurements from a event series as input, how do I generate an infinite input series with the same profile?

I'm currently working with a system that makes scheduling decisions based on a series of requests and the state of the system.
I would like to take the stream of real inputs, mock out some of the components, and run simulations against the rest. The idea is to use it for planning with respect to system capacity (i.e. when to scale certain components), tracking down certain failure modes, and analyzing the effects of changes to the codebase (i.e. simulations with version A compared to simulations with version B).
I can do everything related to this, except generate a suitable input stream. Replaying the exact input from production hasn't been very helpful because it's hard to get a long enough data stream to tease out some of the behavior that I'm trying to find. In other words, if production falls over at 300 days of input, I don't have enough data to find out until after it fell over. Repeating the same input set has been considered; but after a few initial tries, the developers all agree that the simulation seems to "need more random".
About this particular system:
The input is a series of irregularly spaced events (i.e. a stochastic process with discrete time and continuous state space).
Properties are not independent of each other.
Even the more independent of the properties are composites of other properties that will always be, by nature, invisible to me (leading to a multi-modal distribution).
Request interval is not independent of other properties (i.e. lots of requests for small amounts of resources come through in a batch, large requests don't).
There are feedback loops in it.
It's provably chaotic.
So:
Given a stream of input events with a certain distribution of various properties (including interval), how do I generate an infinite stream of events with the same distribution across a number of non-independent properties?
Having looked around, I think I need to do a Markov-Chain Monte-Carlo Simulation. My problem is figuring out how to build the Markov-Chain from the existing input data.
Maybe it is possible to model the input with a Copula. There are tools that help you doing so, e.g. see this paper. Apart from this, I would suggest to move the question to http://stats.stackexchange.com, as this is a statistical problem and will likely draw more attention over there.

Improving simulation performance via concurrency

Consider this sequential procedure on a data structure containing collections (for simplicity, call them lists) of Doubles. For as long as I feel like, do:
Select two different lists from the structure at random
Calculate a statistic based on those lists
Flip a coin based on that statistic
Possibly modify one of the lists, based on the outcome of the coin toss
The goal is to eventually achieve convergence to something, so the 'solution' is linear in the number of iterations. An implementation of this procedure can be seen in the SO question here, and here is an intuitive visualization:
It seems that this procedure could be better performed - that is, convergence could be achieved faster - by using several workers executing concurrently on separate OS threads, ex:
I guess a perfectly-realized implementation of this should be able to achieve a solution in O(n/P) time, for P the number of available compute resources.
Reading up on Haskell concurrency has left my head spinning with terms like MVar, TVar, TChan, acid-state, etc. What seems clear is that a concurrent implementation of this procedure would look very different from the one I linked above. But, the procedure itself seems to essentially be a pretty tame algorithm on what is essentially an in-memory database, which is a problem that I'm sure somebody has come across before.
I'm guessing I will have to use some kind of mutable, concurrent data structure that supports decent random access (that is, to random idle elements) & modification. I am getting a bit lost when I try to piece together all the things that this might require with a view towards improving performance (STM seems dubious, for example).
What data structures, concurrency concepts, etc. are suitable for this kind of task, if the goal is a performance boost over a sequential implementation?
Keep it simple:
forkIO for lightweight, super-cheap threads.
MVar, for fast, thread safe shared memory.
and the appropriate sequence type (probably vector, maybe lists if you only prepend)
a good stats package
and a fast random number source (e.g. mersenne-random-pure64)
You can try the fancier stuff later. For raw performance, keep things simple first: keep the number of locks down (e.g. one per buffer); make sure to compile your code and use the threaded runtime (ghc -O2) and you should be off to a great start.
RWH has a intro chapter to cover the basics of concurrent Haskell.

Message Passing Arbitrary Object Graphs?

I'm looking to parallelize some code across a Beowulf cluster, such that the CPUs involved don't share address space. I want to parallelize a function call in the outer loop. The function calls do not have any "important" side effects (though they do use a random number generator, allocate memory, etc.).
I've looked at libs like MPI and the problem I see is that they seem to make it very non-trivial to pass complex object graphs between nodes. The input to my function is a this pointer that points to a very complex object graph. The return type of my function is another complex object graph.
At a language-agnostic level (I'm working in the D programming language, and I'm almost sure no canned solution is available here, but I'm willing to create one), is there a "typical" way that passing complex state across nodes is dealt with? Ideally, I want the details of how the state is copied to be completely abstracted away and for the calls to look almost like normal function calls. I don't care that copying this much state over a network isn't particularly efficient, as the level of parallelism in question is so coarse-grained that it probably won't matter.
Edit: If there is no easy way to pass complex state, then how is message passing typically used? It seems to me like anything involving copying data over a network requires coarse grained parallelism, yet coarse grained parallelism usually requires passing complex state so that a lot of work can be done in one work unit.
I do a fair bit of MPI programming but I don't know of any typical way of passing complex state (as you describe it) between processes. Here's how I've been thinking about your problem, it probably matches your own thinking ...
I surmise that your complex object graphs are represented, in memory, by blocks of data and pointers to other blocks of data -- a usual sort of implementation of a graph. How best can you move one of these COGs (to coin an abbreviation) from the address space of one process to the address space of another ? To the extent that a pointer is a memory address, a pointer in one address space is no use in another address space, so you will have to translate it into some neutral form for transport (I think ?).
To send a COG, therefore, it has to be put into some form from which the receiving process can build, in its own address space, a local version of the graph with the pointers pointing to local memory addresses. Do you ever write these COGs to file ? If so, you already have a form in which one could be transported. I hate to suggest it, but you could even use files to communicate between processes -- and that might be easier to handle than the combination of D and MPI. Your choice !
If you don't have a file form for the COGs can you easily represent them as adjacency matrices or lists ? In other words, work out your own representation for transport ?
I'll be very surprised (but pleased to learn) if you can pass a COG between processes without transforming it from pointer-based to some more static structure such as arrays or records.
Edit, in response to OP's edit. MPI does provide easy ways to pass complex state around, provided that the complex state is represented as values not pointers. You can pass complex state around in either the intrinsic or customised MPI datatypes; as one of the other answers shows you these are flexible and capable. If our program does not keep the complex state in a form that MPI custom datatypes can handle, you'll have to write functions to pack/unpack to a message-friendly representation. If you can do that, then your message calls will look (for most purposes) like function calls.
As to the issues surrounding complex state and the graininess of parallelism, I'm not sure I quite follow you. We (include yourself in this sweeping generalisation if you want, or not) typically resort to MPI programming because we can't get enough performance out of a single processor, we know that we'll pay a penalty in terms of computation delayed by waiting for communication, we work hard to minimise that penalty, but in the end we accept the penalty as the cost of parallelisation. Certainly some jobs are too small or too short to benefit from parallelisation, but a lot of what we (parallel computationalists that is) do is just too big and too long-running to avoid parallelisation
You can do marvelous things with custom MPI datatypes. I'm currently working on a project where several MPI processes are tracking particles in a piece of virtual space, and when particles cross over from one process' territory into another one's, their data (position/speed/size/etc) has to be sent over the network.
The way I achieved this is the following:
1) All processes share an MPI Struct datatype for a single particle that contains all its relevant attributes, and their displacement in memory compared to the base address of the particle object.
2) On sending, the process iterates over whatever data structure it stores the particles in, notes down the memory address of each one that needs to be sent, and then builds a Hindexed datatype where each block is 1 long (of the above mentioned particle datatype) and starts at the memory addresses previously noted down. Sending 1 object of the resulting type will send all the necessary data over the network, in a type safe manner.
3) On the receiving end, things are slightly trickier. The receiving process first inserts "blank" particles into its own data structure: "blank" means that all the attributes that will be received from the other process are initialized to some default value. The memory addresses of the freshly inserted particles are noted down, and a datatype similar to that of the sender is created from these addresses. Receiving the sender's message as a single object of this type will automatically unpack all the data into all the right places, again, in a type safe manner.
This example is simpler in the sense that there are no relationships between particles (as there would be between nodes of a graph), but you could transmit that data in a similar way.
If the above description is not clear, I can post the C++ code that implements it.
I'm not sure I understand the question correctly so forgive me if my answer is off. From what I understand you want to send non-POD datatypes using MPI.
A library that can do this is Boost.MPI. It uses a serialization library to send even very complex data structures. There is a catch though: you will have to provide code to serialize the data yourself if you use complicated structures that Boost.Serialize does not already know about.
I believe message passing is typically used to transmit POD datatypes.
I'm not allowed to post more links so here is what I wanted to include:
Explanation of POD: www.fnal.gov/docs/working-groups/fpcltf/Pkg/ISOcxx/doc/POD.html
Serialization Library: www.boost.org/libs/serialization/doc
it depends on organization of your data. If you use pointers or automatic memory inside your objects, it will be difficult. If you can organize your objects to be contiguous in memory, you have two choices: send memory as bytes,cast it back to object type on the receiver or define mpi derived type for your object. If however you use inheritance, things will become complicated due to how objects are laid out in memory.
I do not know your problem, but maybe can take a look at ARMCI if you manage memory manually.

Resources