I'm currently working on an implementation of the parallel BVH construction by Karras (2012).
The algorithm creates every node in parallel and calculates its children. (this is the kernel). The kernel writes the node's children and in its child nodes the the parent reference.
My kernel code works with completely random scene up to the size of 41 triangles. If I go higher, in some cases a single (sometimes 2 or 3) nodes don't have any values written to them except for the parent value.
I already double check groupSizes, and made sure that the kernel only operates on the right amount of items.
Does anyone have an Idea what could go wrong here? Is there a possibility to see if a single thread dies/gets killed?
NB: If requested I will post some code, but it's generated code from an F# to OpenCL compiler.
Related
Let's suppose I have a big memory buffer used as a framebuffer, what is constantly written by a thread (or even multiple threads, guaranteed that no two threads write the same byte concurrently). These writes are indeterministic in time, scattered through the codebase, and cannot be blocked.
I have another single thread which periodically reads out (copies) the whole buffer for generating a display frame. This read should not be blocked, too. Tearing is not a problem in my case. In other words, my only goal is that every change done by the write thread(s) should eventually appear in the reading thread. The ordering or some (negligible compared to a display refresh rate) delay does not matter.
Reading and writing the same memory location concurrently is a data race, which results in an undefined behavior in c++11, and this article lists same really dreadful examples where the optimizing compiler generates code for a memory read that alters the memory contents in the presence of data race.
Still, I need some solution without completely redesigning this legacy code. Every advice counts what is safe from practical standpoints, independent of if it is theoretically correct or not. I am also open to not-fully-portable solutions, too.
Aside from that I have a data race, I can easily force the visibility of the buffer changes in the reading thread by establishing a synchronizes-with relation between the threads (acquire-release an atomic guard variable, used for nothing else), or by adding platform-specific memory fence calls to key points in the writer thread(s).
My ideas to target the data race:
Use assembly for the reading thread. I would try to avoid that.
Make the memory buffer volatile, thus preventing the compiler to optimize such nasty things what are described in the referenced article.
Put the reading thread's code in a separate compile unit, and compile with -O0
+1. Leave everything as is, and cross my fingers (as currently I do not notice issues) :)
What is the safest from the list above? Do you see a better solution?
FYI, the target platform is ARM (with multiple cores) and x86 (for testing).
(This question is concretizing a previous one what was a little too generic.)
I have created a tbb::task_group and added multiple task to it. In the end I wait() on the tasks to complete. I was profiling the code and saw that the number of threads used by my application have increased (as visible in Window's Task Manager). However when the tbb::task_group object is destructed, the thread count does not decrease.
Additionally if I call the same code block again (without restarting the application), the number of threads sometimes increases and sometimes not.
Is this an expected behavior? If yes, how can I make sure the threads created previously are reused?
Yes, this is expected behavior. It is done specifically to reuse threads between parallel algorithms. You can verify it by marking threads with thread-local variables (TBB provides combinable class) or looking into callbacks of task_scheduler_observer.
TBB always but lazily create the number of threads specified at the initialization time - even if you run only single task. By default the number of TBB worker threads equals to the number of HW threads (cores*HT) minus one for the application thread.
BTW, I'd not recommend you using tbb::task which is for advanced cases, check out tbb::parallel_invoke or tbb::task_group first which are high-level interfaces to tasks. Or even better, look whether your algorithm can be expressed on even more higher level using things like parallel_for, parallel_reduce (possibly with custom Range), parallel_pipeline, flow::graph, etc.
In Apple's documentation, I read this:
1 — "Shared contexts share all texture objects, display lists, vertex programs, fragment programs, and buffer objects created before and after sharing is initiated."
2 — "Contexts that are on different threads can share object resources. For example, it is acceptable for one context in one thread to modify a texture, and a second context in a second thread to modify the same texture. The shared object handling provided by the Apple APIs automatically protects against thread errors."
So I expected to be able to create my buffer objects once, then use them to render simultaneously on multiple contexts. However if I do that, I get crashes on my NVIDIA GeForce GT 650M with backtraces like this:
Crashed Thread: 10 Dispatch queue: com.apple.root.default-qos
Exception Type: EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: EXC_I386_GPFLT
…
Thread 10 Crashed:: Dispatch queue: com.apple.root.default-qos
0 GLEngine 0x00007fff924111d7 gleLookupHashObject + 51
1 GLEngine 0x00007fff925019a9 gleBindBufferObject + 52
2 GLEngine 0x00007fff9243c035 glBindBuffer_Exec + 127
I've posted my complete code at https://gist.github.com/jlstrecker/9df10ef177c2a49bae3e. At the top, there's #define SHARE_BUFFERS — when commented out it works just fine, but uncommented it crashes.
I'm not looking to debate whether I should be using OpenGL 2.1 — it's a requirement of other software I'm interfacing with. Nor am I looking to debate whether I should use GLUT — my example code just uses that since it's included on Mac and doesn't have any external dependencies. Nor am I looking for feedback on performance/optimization.
I'd just like to know if I can expect to be able to simultaneously render from a single shared buffer object on multiple contexts — and if so, why my code is crashing.
We also ran into the 'gleLookupHashObject' crash and made a small repro-case (very similar to yours) which was posted in an 'incident' to Apple support. After investigation, an Apple DTS engineer came back with the following info, quoting:
"It came to my attention that glFlush() is being called on both the main thread and also a secondary thread that binds position data. This would indeed introduce issues and, while subtle, actually does indicate that the constraints we place on threads and GL contexts aren’t being fully respected.
At this point it behoves you to either further investigate your implementation to ensure that such situations are avoided or, better yet, extend your implementation with explicit synchronization mechanisms (such as what we offer with GCD). "
So if you run into this crash you will need to do explicit synchronization on the application side (pending a fix on the driver-side).
Summary of relevant snippets related to "OpenGL, Contexts and Threading" from the official Apple Documentation:
[0] Section: "Use Multiple OpenGL Contexts"
If your application has multiple scenes that can be rendered in parallel, you can use a context for each scene you need to render. Create one context for each scene and assign each context to an operation or task. Because each task has its own context, all can submit rendering commands in parallel.
https://developer.apple.com/library/mac/documentation/GraphicsImaging/Conceptual/OpenGL-MacProgGuide/opengl_threading/opengl_threading.html#//apple_ref/doc/uid/TP40001987-CH409-SW6
[1] Section: Guidelines for Threading OpenGL Applications
(a) Use only one thread per context. OpenGL commands for a specific context are not thread safe. You should never have more than one thread accessing a single context simultaneously.
(b) Contexts that are on different threads can share object resources. For example, it is acceptable for one context in one thread to modify a texture, and a second context in a second thread to modify the same texture. The shared object handling provided by the Apple APIs automatically protects against thread errors. And, your application is following the "one thread per context" guideline.
https://developer.apple.com/library/mac/documentation/GraphicsImaging/Conceptual/OpenGL-MacProgGuide/opengl_threading/opengl_threading.html
[2] OpenGL Restricts Each Context to a Single Thread
Each thread in an OS X process has a single current OpenGL rendering context. Every time your application calls an OpenGL function, OpenGL implicitly looks up the context associated with the current thread and modifies the state or objects associated with that context.
OpenGL is not reentrant. If you modify the same context from multiple threads simultaneously, the results are unpredictable. Your application might crash or it might render improperly. If for some reason you decide to set more than one thread to target the same context, then you must synchronize threads by placing a mutex around all OpenGL calls to the context, such as gl* and CGL*. OpenGL commands that blockâsuch as fence commandsâdo not synchronize threads.
https://developer.apple.com/library/mac/documentation/GraphicsImaging/Conceptual/OpenGL-MacProgGuide/opengl_threading/opengl_threading.html
I was going through topics of Operating Systems using the text book by Galvin (the 9th edition). In Chapter 4 on multi-threading, I came across problem 14 which is as follows:
A system with two dual-core processors has four processors available for scheduling. A CPU -intensive application is running on this system. All input is performed at program start-up, when a single file must be opened. Similarly, all output is performed just before the program terminates, when the program results must be written to a single file. Between startup and termination, the program is entirely CPU - bound. Your task is to improve the performance of this application by multithreading it. The application runs on a system that uses the one-to-one threading model (each user thread maps to a kernel thread).
• How many threads will you create to perform the input and output? Explain.
• How many threads will you create for the CPU -intensive portion of the application? Explain.
For the first part, I think we could create 4 threads for taking input for reading from a file as well as for writing output to a file. This is because during either input or output, there is no updating of the data being carried out.
For the second part, the nature of operation to be carried out on data is not known, for example, whether (1) average of the data is to be printed or (2) a function to print the average of first and last data points, then print average of second and second last data points, and so on.
Therefore, for second part, one thread could be employed to handle the operation.
But I am not very sure of the answer I gave here being right. So, I would be very grateful if you could let me know the right answer for this.
The question is testing if you understand some principles about parallelizing work to increase speed. Some of these principles are:
In the usual case, reading and writing a single file cannot be sped up using multiple cores. Speed of file I/O is determine by the properties of where and how the file is stored. Throwing more threads at it is not going to help, because those threads are just going to be waiting for the I/O to complete.
How many threads you use for CPU intensive portion depends entirely on what is being computed. If the program is generating imagery for a movie, use 4 threads because that is completely parallel. If the workload is entirely serial, use 1 thread because adding more threads won't help (by definition).
Computing the averages in your example is almost completely parallel, so you should use four threads, not one.
Let me begin by saying I do not have in depth knowledge of Perl so please pardon me if there is something obvious that I have missed :)
In the system (running in Windows environment) that I am looking at, we have a perl process which has to download ~5000-6000 files. Since each file can be independently downloaded, we forked separate threads for each file. The thread is supposed to download the file and die. On running the process, I noticed that the memory of the process goes up to ~1.7 GB and then dies due to the memory limit of each process.
On searching and asking a few people, I came across this concept of circular referencing due to which the garbage collector will not free up memory. I searched a bit and found the Devel-Cycle package which can find out if there are any cycles in the object. I got this package and added a line to check if the main object in the process has any cycles. find_cycle came back with the following statement for each thread.
DBD::Oracle::db FIRSTKEY failed: handle 2 is owned by thread 256004 not current thread c0ea29c (handles can't be shared between threads and your driver may need a CLONE method added) at C:/Program Files/Perl/site/lib/Devel/Cycle.pm line 151.
I got to know that DB handles cannot be shared between threads. I looked at the code again and realised that after the fork happens, the child process does actually create a new DB handle (which I guess is why the process still continues to run fine till it reaches the memory limit). I guess there might be more db handles from the parent in the object that are not used by the child but are still referenced.
Questons that I have -
Is the circular reference the only reason for the problem or could there be other issues causing the process to use so much memory?
Could the sharing of the handle cause the blow up in memory (in other words is the shared DB handle causing the GC to not free up space)?
If it is indeed the shared DB handle, I guess I can just say $dbHandle = 0 to get rid of the reference (if $dbHabndle is referencing that particular handle). Am I correct here?
I am trying to go through the code to see where else there is a reference to the parent DB handle (and found at least one more reference). Is there any other way I can do this? Is there a method to print out all the properties of an object?
EDIT:
Not all threads (due to the perl fork call in windows) are spawned at the same time. It spawns a max of n number of threads (where n is a configurable number). Once a thread has finished its execution, the process spawns another thread. At this moment n is set to 10, however I had changed n to 1 (so only one extra thread is running at one time), and I still hit the memory limit.
edit: Turns out, this does not solve the Ops problem. Still might be helpful for a future reader.
We do not really know a lot about your situation and your program sounds quite complex to just fork it 6000 times to me. But i will still attempt to answer, please correct me if my assumptions are wrong.
It appears you are on Windows. It is important to note, that Windows has no fork() system call. And as you specifically note that you "fork", i just assume that you actually use that Perl command. On windows, this will try to emulate fork() as best as it can but what that basically means is, that all the forked processes you see, are in fact just threads within the original process, just pretending to be processes to you. To do this, they copy the complete interpreter state. See http://perldoc.perl.org/perlfork.html for more information. Especially the following part seems to apply to you:
Resource limits
In the eyes of the operating system, pseudo-processes created via the fork() emulation are simply threads in the same process. This means that any process-level limits imposed by the operating system apply to all pseudo-processes taken together. This includes any limits imposed by the operating system on the number of open file, directory and socket handles, limits on disk space usage, limits on memory size, limits on CPU utilization etc.
If you fork so many pseudo processes, you need a lot of memory as you also have to copy the interpreter state as often. And depending on the complexity of your program and how it is structured, that may very well be a non-trivial amount of memory.
And as http://msdn.microsoft.com/en-us/library/windows/desktop/aa366778%28v=vs.85%29.aspx tells us, the 1.7GB you mentioned, is not far away from the 2GB that some Windows versions impose on you as memory limit for a single process.
My wild guess would be, that you in fact just hit that limit by spawning all those many many threads, each with its own copy of the interpreter state and everything.
You will probably be off a lot better using some threading library instead of asking Perl to emulate individual processes for you. Needless to mention (i hope) that you do not really gain any advantage by having 6000 threads over lets say 16. If you try to have all of them do something at the same time, you will in fact most likely experience slowdowns, depending on how the threading is handled.
In addition to the comments already provided, I want to emphasize the point DeVadder made regarding the behavior of fork in Windows and that Perl threading is likely a better solution but are you sure that the DBD module is safe to be used by multiple processes / forks / threads, etc without setting some extra parameters?
I had a similar error when using the DBI module to access a SQLite DB in multi-processed code using the threads module. It was solved by setting the 'use_immediate_transaction' option for the database handle provided by DBI to 1. If you aren't familiar with how Perl threads work, they aren't threads, they create a copy of the interpreter and everything you have in memory at the time of their creation, but even if I made the database handle separately in each "thread" I would get 'database locked' and various other errors. Without some of these extra options DBD may not function correctly in a multiprocessed environment.
Also, why make 6000 forks, use thread::queue and the threads module, make a worker pool of a few workers (one per core?) and recycle the workers. You are doing alot of overhead every fork for no gain.