I wrote a complex model in Alloy which contains 100 signatures and each signature has some relations based on the abstract signature. The problem occurs when I try solving the model. After about two hours, the analyzer returns a message saying the "solver run out of the memory, reduce the scope or increase the memory or simplify the model". The scope was already reduced to 1. Also, the memory was increased to the maximum limit. I did this experiment in two machines (my laptop with 8GB memory and my desktop with 4GB) and I still get the same message.
Has any one faces this kind of problem? Is Alloy able to handle such a complex model?
Regards
Related
We are senior year student who designs FPGA based Convolutional Neural Network accelerator.
We built pipelined architecture. (Convolution, Pooling, Convolution and Pooling), for this 4 stage of the architecture, we need to multiply one particular window and filter. We have (5*5)*6*16 window in the 2nd convolution layer and filter.
Up to here, I accept this is not a clear explanation. But the main problem in here is that we need to access 5*5*6*16 filter coefficients which are stored in block ram sequentially at the same time. But at every clock, I can just reach one particular address on the ROM.
What approach can we take?
What approach can we take?
You don't want to hear this but the only solution is:
Go back to the start and change your architecture/code. (or run very slowly)
You can NOT access 2400 coefficients sequentially unless you run the memory at a 2400 times the clock frequency of your main system. So lets say with a 100MHz RAM/ROM operating frequency your main system must run at ~42KHz.
This is a recurrent theme I encounter on these forums. You have made a wrong decision and now want a solution. Preferable an easy one. Sorry but there is none.
I am having the same issue. For some layers we want to access multiple kernels for parallel operations, however, for a BRAM implementation you can have at most 2 accesses per cycle. So, the solution I made to this is to create a ROM array, whether implemented in BRAM style or Distributed style.
Unlike RAM array, you can't just implement the ROM array as easily. So you need a script/software layer that generates RTL for your module.
I chose to implement with Distributed approach, however, I can't estimate the resources needed and the utilization reports give me unclear results. Still investigating into this.
For future reference you could look into HLS pragmas to help use the FPGAs resources. What you could do is use the array partition pragma with the cyclic setting. This makes it so that each subsequent element of an array is stored in a different sub array.
For example with a factor of 4 there’d be four smaller arrays created from the original array. The first element in each sub array would be arr[0], arr[1], arr[2] respectively.
That’s how you would distribute an array across Block RAMs to have more parallel access at a time.
I want to use the model to predict scores in map lambda function in PySpark.
def inference(user_embed, item_embed):
feats = user_embed + item_embed
dnn_model = load_model("best_model.h5")
infer = dnn_model.predict(np.array([feats]), verbose=0, steps=1)
return infer
iu_score = iu.map(lambda x: Row(userid=x.userid, entryid=x.entryid, score = inference(x.user_embed, x.item_embed)))
The running is extremely slow and it stuck at the final stage quickly after code start running.
[Stage 119:==================================================>(4048 + 2) / 4050]
In HTOP monitor, only 2 of 80 cores are in full work load, others core seems not working.
So what should I do to making the model predicting in parallel ? The iu is 300 million so the efficiency if important for me.
Thanks.
I have turn verbose=1, and the predict log appears, but it seems that the prediction is just one by one , instead of predict in parallel.
During the response I researched a little bit and found this question interesting.
First, if efficiency is really important, invest a little time on recoding the whole thing without Keres. You still can use the high-level API for tensorflow (Models) and with a little effort to extract the parameters and assign them to the new model. Regardless it is unclear from all the massive implementations in the framework of wrappers (is TensorFlow not a rich enough framework?), you will most likely meet problems with backward compatibility when upgrading. Really not recommended for production.
Having said that, can you inspect what is the problem exactly, for instance - are you using GPUs? maybe they are overloaded? Can you wrap the whole thing to not exceed some capacity and use a prioritizing system? You can use a simple queue if not there are no priorities. You can also check if you really terminate tensorflow's sessions or the same machine runs many models that interfere with the others. There are many issues that can be the reason for this phenomena, it will be great to have more details.
Regarding the parallel computation - you didn't implement anything that really opens a thread or a process for this models, so I suspect that pyspark just can't handle the whole thing by its own. Maybe the implementation (honestly I didn't read the whole pyspark documentation) is assuming that the dispatched functions runs fast enough and doesn't distributed as it should. PySpark is simply a sophisticated implementation of map-reduce principles. The dispatched functions plays the role of a mapping function in a single step, which can be problematic for your case. Although it is passed as a lambda expression, you should inspect more carefully which are the instances that are slow, and on which machines they are running.
I strongly recommend you do as follows:
Go to Tensorflow deplot official docs and read how to really deploy a model. There is a protocol for communicating with the deployed models called RPC and also a restful API. Then, using your pyspark you can wrap the calls and connect with the served model. You can create a pool of how many models you want, manage it in pyspark, distribute the computations over a network, and from here the sky and the cpus/gpus/tpus are the limits (I'm still skeptical about the sky).
It will be great to get an update from you about the results :) You made me curious.
I hope you the best with this issue, great question.
I have a task which consists of 3 concurrent self-defined (recursive to each other) processes. I need somehow to make it execute on computer, but any attempt to convert a requirement to program code with just my brain fails since first iteration produces 3^3 entities with 27^2 cross-relations, but it needs to implement at least several iterations to try if program even works at all.
So I decided to give up on trying to understand the whole system and formalized the problem and now want to map it to hardware to generate an algorithm and run. Language doesn't matter (maybe even directly to machine/assembly one?).
I never did anything like that before, so all topics I searched through like algorithm synthesis, software and hardware co-design, etc. mention hardware model as the second half (in addition to problem model) of solution generation, but I never seen one. The whole work supposed to look like this:
I don't know yet what level hardware model described at, so can't decide how problem model must be formalized to fit hardware model layer.
For example, target system may contain CPU and GPGPU, let's say target solution having 2 concurrent processes. System must decide which process to run on CPU and which on GPGPU. The highest level solution may come from comparing computational intensity of processes with target hardware, which is ~300 for CPUs and ~50 for GPGPUs.
But a normal model gotta be much more complete with at least cache hierarchy, memory access batch size, etc.
Another example is implementing k-ary trees. A synthesized algorithm could address parents and children with computing k * i + c / ( i - 1 ) / k or store direct pointers - depending on computations per memory latency ratio.
Where can I get a hardware model or data to use? Any hardware would suffice for now - to just see how it can look like - later would be awesome to get models of modern processors, GPGPUs and common heterogeneous clusters.
Do manufacturers supply such kinds of models? Description of how their systems work in any formal language.
I'm not pretty sure if it might be the case for you, but as you're mentioning modeling, I just thought about Modelica. It's used to model physical systems and combined with a simulation environment, you can run some simulations on it.
I'm using an OCR library to extract products specifications from images. I'm first focusing on notebooks.For example:
Processor
Processor model: Intel N3540
Clock speed: 2.16 GHz
Memory
Internal: 4 GB
Hard disk
Capacity: 1 TB
or:
TOSHIBA
SATELLITE C50-5302
PENTIUM
TOSHIBA
DISPLAY 15.6
4GB
DDR3
500
The OCR is not perfect and sometimes what would be C10 ends up being CIO and other similar things.
I'd like to extract the attribute-value pairs but I don't know how to approach this problem.
I'm thinking about building a file with all the notebooks and microprocessors I can get(because brand, memory and hard drive capacity are pretty limited) and then use a NLP library to extract the entities from text. The problem also is that sometimes there are spelling errors so it's not as easy as comparing the exact values.
How would you approach this problem?
As for spelling errors, I'd suggest to, if possible, obtain an ambiguous and probabilistic output of the OCR system. Considering your CIO example, I is much graphically closer to 1 than to other characters. If no such an output is available, you may consider using some sort of weighted edit distance between characters.
For named entity recognition, work has been done for recognizing named entities from noisy input, mostly for ASR sources (as much as I know). See how word confusion networks does handle this, for instance this article.
As a final step, you'll probably need a joint task for OCR correction and named entity recognition. This will probably require to define what entities are likely for your domain: what tokens are expected to describe CPU speed, storage capacity, computer brands, etc. You may either manually implement rules or mine data from existing databases. As a final step, you'll probably have to somehow adjust the rate of expected OCR error correction to extract correct attribute-value pairs without adding false positive.
Don't hesitate to keep us informed about the solution you experiment!
I'm looking to parallelize some code across a Beowulf cluster, such that the CPUs involved don't share address space. I want to parallelize a function call in the outer loop. The function calls do not have any "important" side effects (though they do use a random number generator, allocate memory, etc.).
I've looked at libs like MPI and the problem I see is that they seem to make it very non-trivial to pass complex object graphs between nodes. The input to my function is a this pointer that points to a very complex object graph. The return type of my function is another complex object graph.
At a language-agnostic level (I'm working in the D programming language, and I'm almost sure no canned solution is available here, but I'm willing to create one), is there a "typical" way that passing complex state across nodes is dealt with? Ideally, I want the details of how the state is copied to be completely abstracted away and for the calls to look almost like normal function calls. I don't care that copying this much state over a network isn't particularly efficient, as the level of parallelism in question is so coarse-grained that it probably won't matter.
Edit: If there is no easy way to pass complex state, then how is message passing typically used? It seems to me like anything involving copying data over a network requires coarse grained parallelism, yet coarse grained parallelism usually requires passing complex state so that a lot of work can be done in one work unit.
I do a fair bit of MPI programming but I don't know of any typical way of passing complex state (as you describe it) between processes. Here's how I've been thinking about your problem, it probably matches your own thinking ...
I surmise that your complex object graphs are represented, in memory, by blocks of data and pointers to other blocks of data -- a usual sort of implementation of a graph. How best can you move one of these COGs (to coin an abbreviation) from the address space of one process to the address space of another ? To the extent that a pointer is a memory address, a pointer in one address space is no use in another address space, so you will have to translate it into some neutral form for transport (I think ?).
To send a COG, therefore, it has to be put into some form from which the receiving process can build, in its own address space, a local version of the graph with the pointers pointing to local memory addresses. Do you ever write these COGs to file ? If so, you already have a form in which one could be transported. I hate to suggest it, but you could even use files to communicate between processes -- and that might be easier to handle than the combination of D and MPI. Your choice !
If you don't have a file form for the COGs can you easily represent them as adjacency matrices or lists ? In other words, work out your own representation for transport ?
I'll be very surprised (but pleased to learn) if you can pass a COG between processes without transforming it from pointer-based to some more static structure such as arrays or records.
Edit, in response to OP's edit. MPI does provide easy ways to pass complex state around, provided that the complex state is represented as values not pointers. You can pass complex state around in either the intrinsic or customised MPI datatypes; as one of the other answers shows you these are flexible and capable. If our program does not keep the complex state in a form that MPI custom datatypes can handle, you'll have to write functions to pack/unpack to a message-friendly representation. If you can do that, then your message calls will look (for most purposes) like function calls.
As to the issues surrounding complex state and the graininess of parallelism, I'm not sure I quite follow you. We (include yourself in this sweeping generalisation if you want, or not) typically resort to MPI programming because we can't get enough performance out of a single processor, we know that we'll pay a penalty in terms of computation delayed by waiting for communication, we work hard to minimise that penalty, but in the end we accept the penalty as the cost of parallelisation. Certainly some jobs are too small or too short to benefit from parallelisation, but a lot of what we (parallel computationalists that is) do is just too big and too long-running to avoid parallelisation
You can do marvelous things with custom MPI datatypes. I'm currently working on a project where several MPI processes are tracking particles in a piece of virtual space, and when particles cross over from one process' territory into another one's, their data (position/speed/size/etc) has to be sent over the network.
The way I achieved this is the following:
1) All processes share an MPI Struct datatype for a single particle that contains all its relevant attributes, and their displacement in memory compared to the base address of the particle object.
2) On sending, the process iterates over whatever data structure it stores the particles in, notes down the memory address of each one that needs to be sent, and then builds a Hindexed datatype where each block is 1 long (of the above mentioned particle datatype) and starts at the memory addresses previously noted down. Sending 1 object of the resulting type will send all the necessary data over the network, in a type safe manner.
3) On the receiving end, things are slightly trickier. The receiving process first inserts "blank" particles into its own data structure: "blank" means that all the attributes that will be received from the other process are initialized to some default value. The memory addresses of the freshly inserted particles are noted down, and a datatype similar to that of the sender is created from these addresses. Receiving the sender's message as a single object of this type will automatically unpack all the data into all the right places, again, in a type safe manner.
This example is simpler in the sense that there are no relationships between particles (as there would be between nodes of a graph), but you could transmit that data in a similar way.
If the above description is not clear, I can post the C++ code that implements it.
I'm not sure I understand the question correctly so forgive me if my answer is off. From what I understand you want to send non-POD datatypes using MPI.
A library that can do this is Boost.MPI. It uses a serialization library to send even very complex data structures. There is a catch though: you will have to provide code to serialize the data yourself if you use complicated structures that Boost.Serialize does not already know about.
I believe message passing is typically used to transmit POD datatypes.
I'm not allowed to post more links so here is what I wanted to include:
Explanation of POD: www.fnal.gov/docs/working-groups/fpcltf/Pkg/ISOcxx/doc/POD.html
Serialization Library: www.boost.org/libs/serialization/doc
it depends on organization of your data. If you use pointers or automatic memory inside your objects, it will be difficult. If you can organize your objects to be contiguous in memory, you have two choices: send memory as bytes,cast it back to object type on the receiver or define mpi derived type for your object. If however you use inheritance, things will become complicated due to how objects are laid out in memory.
I do not know your problem, but maybe can take a look at ARMCI if you manage memory manually.