No more satisfying instances - alloy

No examples generated for my alloy model with the error message: 'No more satisfying instances' (see image attached)
I have created the following small model in Alloy:
sig System
{
subSystem : System
}
// Prevent a subsystem from directly including itself
fact noDirectInclusion
{
no s : System | s in s.subSystem
}
// Prevent a subsystem from transitivelyincluding itself
fact noTransitiveInclusion
{
no s : System | s in s.^subSystem
}
pred show {}
run show for 5
The fact 'noDirectInclusion' nicely prevents the generation of examples where a subsystem is a subsystem of itself.
I am probably missing something trivial, but When I also use the fact 'noTransitiveInclusion' there are no longer any examples generated with the error message: 'No more satisfying instances' (see image attached)
What am I missing?

What am I missing?
Try to make your graph by hand for only 2 System sigs ...
You will see that with the constraints you specified in the System sig you can only make a cycle ... You force a System to have 1 and exactly 1 subSystem, i.e. the default for a field is one. Therefore, the transitive graph can only be a cycle with a finite set of objects and that invalidates your fact.
Either make subSystem a lone or set.

Related

WGSL atomics with multiple compute passes

I'm having an issue with atomics in wgpu / WGSL but I'm not sure if it's due to a fundamental misunderstanding or a bug in my code.
I have a input array declared in WGSL as
struct FourTileUpdate {
// (u32 = 4 bytes)
data: array<u32, 9>
};
#group(0) #binding(0) var<storage, read> tile_updates : array<FourTileUpdate>;
I'm limiting the size of this array to around 5MB, but sometimes I need to transfer more than that for a single frame and so use multiple command encoders & compute passes.
Each "tile update" has an associated position (x & y) and a ms_since_epoch property that indicates when the tile update was created. Tile updates get written to a texture.
I don't want to overwrite newer tile updates with older tile updates, so in my shader I have a guard:
storageBarrier();
let previous_timestamp_value = atomicMax(&last_timestamp_for_tile[x + y * r_locals.width], ms_since_epoch);
if (previous_timestamp_value > ms_since_epoch) {
return;
}
However, something is going wrong and older tile updates are overwriting newer tile updates. I can't reproduce this on Windows / Vulkan but it consistently happens on macOS / Metal. Here's an image of the rendered texture--it should be completely green instead of the occasional red and black pixel:
rendered texture
A few questions:
is execution order guaranteed to be the same as the order of the command encoder constructions?
do storageBarrier() and atomics work across all invocations in a single frame or just the compute pass?
I tried submitting each encoder with queue.submit(Some(encoder.finish())) before creating the next encoder for the frame, and even waiting for the queue to finish processing for each submitted encoder with
let (tx, rx) = mpsc::channel();
queue.on_submitted_work_done(move || {
tx.send().unwrap();
});
device.poll(wgpu::Maintain::Wait);
rx.rev().unwrap()
// ... loop back and create & submit next encoder for current frame
but that didn't work either.
Good questions!
is execution order guaranteed to be the same as the order of the command encoder constructions?
I believe that is the case. But I checked and the spec is actually unclear about this. I filed https://github.com/gpuweb/gpuweb/issues/3809 to fix this.
Further, I believe the intent is that all memory accesses (e.g. to storage buffers) from one GPU command will complete before the next GPU command begins. So the effect of any writes in one command will be visible in the next command (read-after-write hazard). Also, a write in a later command will not be visible in an earlier command (write-after-read hazard).
do storageBarrier() and atomics work across all invocations in a single frame or just the compute pass?
Another good question. storageBarrier() only works within a single workgroup. This may be surprising, but is due to a limitation in some platforms.
For further details, see https://github.com/gpuweb/gpuweb/issues/3774
This will be a FAQ because it is surprising, and subtle!
Update: I suspect the bad behaviour you're seeing is that storageBarrier() does not work across workgroups. It's a limitation in Metal.

How to use child kernels (CUDA dynamic parallelism) using PyCUDA

My python code has a gpu kernel function which is called multiple times in a for loop from host like this :
for i in range:
gpu_kernel_func(blocksize, grid)
Since this function call requires communication between host and gpu device multiple times which is not efficient, I want to make this as
gpu_kernel_function(){
for(){
computation } ;
}
But this requires extra step to make sure all the blocks in grid are in sync. According to dynamic parallelism, calling a dummy child kernel should ensure that every thread (in whole grid) should finish that child kernel before the code continues running. So I defined another kernel just like gpu_kernel_function and I tried this :
GPUcode = '''
\__global__ gpu_kernel_function() {... }
\__global__ dummy_child_kernel(){ ... }
'''
gpu_kernel_function(){
for() {
computation } ;
dummy_child_kernel(void);
}
But I am getting this error " nvcc fatal : Option '--cubin (-cubin)' is not allowed when compiling for a virtual compute architecture "
I am using Tesla P100 (compute 6.0), python 3.5, cuda.8.0.44. I am compiling my sourcemodule like this :
mod = SourceModule(GPUcode, options=['-rdc=true' ,'-lcudart','-lcudadevrt','--machine=64'],arch='compute_60' )
I tried compute_35 too which gives same error.
The error message is explicitly telling you what the issue is. compute_60 is a virtual architecture. You can't statically compile virtual architectures to machine code. They are intended for producing PTX (virtual machine assembler) for JIT translation to machine code by the runtime. PyCUDA compiles code to a binary payload ("cubin") using the CUDA toolchain and them loads it via the driver API into the CUDA context. Thus the error.
You can fix the error by specifying a valid physical GPU target architecture. So you should modify the source module constructor call to something like this:
mod = SourceModule(GPUcode,
options=['-rdc=true','-lcudart','-lcudadevrt','--machine=64'],
arch='sm_60' )
This should fix the compiler error.
However, note that using dynamic parallelism requires device code linkage, and I am 99% sure that PyCUDA still doesn't support this, so you likely won't be able to do what you are asking about via a SourceModule. You could link your own cubin by hand using the compiler outside of PyCUDA and then load that cubin inside PyCUDA. You will find many examples of how to compile dynamic parallelism correctly if you search for them.

How to find TLS segments of the current thread on linux amd64?

I'm looking for a way to find out the memory addresses of TLS segments for the current thread on linux, amd64. Bonus point for a solution that works on OSX.
Looked into various language runtime or GC (like boehm), but couldn't go through the multiple layer of abstractions to support all kind of systems so far. Any help appreciated.
Did you have a look at the solution Martin and I came up with in druntime?
What we do there boils down to scanning the segments in the corresponding dl_phdr_info (obtained by looking for the correct one using dl_iterate_phdr) for the segment with type PT_TLS, and storing its module id and size.
You can then get the start of the address range on the current thread by calling __tls_get_addr for offset 0 and the module id (there is an offset on some archs), and the end by simply adding the size you determined to that. If you do not need to support shared libraries, you can also simply use fs/gs on x86 for that (might be required if you want to link a static executable).
This works for Linux and FreeBSD (and probably other ELF platforms), but not OS X. There, the best I could come up with so far is this:
void _d_dyld_getTLSRange(void* arbitraryTLSSymbol, void** start, size_t* size) {
dyld_enumerate_tlv_storage(
^(enum dyld_tlv_states state, const dyld_tlv_info *info) {
assert(state == dyld_tlv_state_allocated);
if (info->tlv_addr <= arbitraryTLSSymbol &&
arbitraryTLSSymbol < (info->tlv_addr + info->tlv_size)
) {
// Found the range we are looking for.
*start = info->tlv_addr;
*size = info->tlv_size;
}
}
);
}
The naive implementation currently used in LDC's druntime does not quite handle shared libraries, though, and dyld_enumerate_tlv_storage is from dyld_priv.h, which might or might not be a problem for App Store publishing.
On Linux, the thread-specific segment is set up via arch_prtcl(ARCH_SET_FS, <addr>) call. You can find out what it was set to in the current thread via arch_prctl(ARCH_GET_FS, ...).
Bonus point for a solution that works on OSX.
OSX is a completely different OS, and uses completely different mechanism for its TLS support.

Function graph (timestamped entry and exit) for both user, library and kernel space in Linux?

I'm writing this more-less in frustration - but who knows, maybe there's a way for this too...
I would like to analyze what happens with a function from ALSA, say snd_pcm_readi; for that purpose, let's say I have prepared a small testprogram.c, where I have this:
void doCapture() {
ret = snd_pcm_readi(handle, buffer, period_size);
}
The problem with this function is that it eventually (should) hook into snd_pcm_readi in the shared system library /usr/lib/libasound.so; from there, I believe via ioctl, it would somehow communicate to snd_pcm_read in the kernel module /lib/modules/$(uname -r)/kernel/sound/core/snd-pcm.ko -- and that should ultimately talk to whatever .ko kernel module which is a driver for a particular soundcard.
Now, with the organization like above, I can do something like:
valgrind --tool=callgrind --toggle-collect=doCapture ./testprogram
... and then kcachegrind callgrind.out.12406 does indeed reveal a relationship between snd_pcm_readi, libasound.so and an ioctl (I cannot get the same information to show with callgrind_annotate) - so that somewhat covers userspace; but that is as far as it goes. Furthermore, it produces a call graph, that is to say general caller/callee relationships between functions (possibly by a count of samples/ticks each function has spent working as scheduled).
However, what I would like to get instead, is something like the output of the Linux ftrace tracer called function_graph, which provides a timestamped entry and exit of traced kernel functions... example from ftrace: add documentation for function graph tracer [LWN.net]:
$ cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# TIME CPU DURATION FUNCTION CALLS
# | | | | | | | |
2105.963678 | 0) | mutex_unlock() {
2105.963682 | 0) 5.715 us | __mutex_unlock_slowpath();
2105.963693 | 0) + 14.700 us | }
2105.963698 | 0) | dnotify_parent() {
(NB: newer ftrace documentation seems to not show a timestamp at first for the function\_graph, only duration - but I think it's still possible to modify that)
With ftrace, one can filter so one can only trace functions in a given kernel module - so in my case, I could add the functions of snd-pcm.ko and whatever .ko module is the soundcard driver, and I'd have whatever I find interesting in kernel-space covered. But then, I lose the link to the user-space program (unless I explicitly printf to /sys/kernel/debug/tracing/trace_marker, or do a trace_printk from user-space .c files)
Ultimately, what I'd like, is to have the possibility to specify an executable, possibly also library files and kernel modules - and obtain a timestamped function graph (with indented/nested entry and exit per function) like ftrace provides. Are there any alternatives for something like this? (Note I can live without the function exits - but I'd really like to have timestamped function entries)
As a PS: it seems I actually found something that fits the description, which is the fulltrace application/script:
fulltrace [andreoli#Github]
fulltrace traces the execution of an ELF program, providing as output a full trace of its userspace, library and kernel function calls. ...
(prerequisites) the following kernel configuration options and their dependencies must be set as enabled (=y): FTRACE, TRACING_SUPPORT, UPROBES, UPROBE_EVENT, FUNCTION_GRAPH_TRACER.
Sounds perfect - but the problem is, I'm on Ubuntu 11.04, and while this 2.6.38 kernel luckily has CONFIG_FTRACE=y enabled -- its /boot/config-`uname -r`
doesn't even mention UPROBES :/ And since I'd like to avoid doing kernel hacking, unfortunately I cannot use this script...
(Btw, if UPROBES were available, (as far as I understand) one sets a trace probe on a symbol address (as obtained from say objdump -d), and output goes again to /sys/kernel/debug/tracing/trace - so some custom solution would have been possible using UPROBES, even without the fulltrace script)
So, to narrow down my question a bit - is there a solution, that would allow simultaneous user-space (incl. shared libraries) and kernel-space "function graph" tracing, but where UPROBES are not available in the kernel?

Size of an OpenGL context

Is there a way to get a size of an opengl context? Or at least to estimate it's size? If yes, how?
I have an application in glut, which creates several windows. Since glut doesn't share opengl context between windows, every window is going to create new. Now, I am trying to reduce needed memory, since it is for an embedded system. But if the opengl context is small enough to neglect it, then I am not going to see big reduction in memory usage.
I have found this patch to create windows with shared opengl context :
A small addendum for Windows users (by Misbah Qidwai): I added this subroutine to glut_win.c. I use this routine to call wglSharedLists()
//MQ
/* CENTRY */
GLXContext APIENTRY
glutGetWindowRenderContext(int win)
{
GLUTwindow *window;
if (win < 1 || win > __glutWindowListSize) {
__glutWarning("glutSetWindow attempted on bogus window.");
return NULL;
}
window = __glutWindowList[win - 1];
if (!window) {
__glutWarning("glutSetWindow attempted on bogus window.");
return NULL;
}
return window->renderCtx;
}
A OpenGL context is an abstract thing. The amount of data backing a particular context can be as small as a pointer, or as big as a few megabytes. The context itself is not some kind of data structure, it's merely a handle shared by your program and the graphics system so that each other "knows" what the other is talking about.
The only way to know in a particular configuration is to measure it.

Resources