multi-threading not working in aomedia av1 build with visual studio - multithreading

I managed to build AV1 codec https://aomedia.googlesource.com/aom/ with visual studio thanks to the command
cmake path/to/aom -G "Visual Studio 15 2017 Win64"
however the encoder only use one thread even with the option --threads, in the code it seems to use pthread. Do I need to do a different build with pthread emulation or do I miss a flag to enable multi-threading on windows 10 64bit for this codec?

i make AOM AV1 encoder to work in parallel by adding tiles --tile-columns=1 --tile-rows=0 , where --tile-columns and --tile-rows denote  log2 of tile columns and rows.

Unlike in x264, option --threads only limits thread usage. To enable threading, set --row-mt=1, and optionally use tiling, which has great threading performance but reduces efficiency quite a bit at lower resolutions.
Loop restoration hurts threading, so for better threading you should disable it, especially when using many tiles: --enable-restoration=0

hmm, well av1 is missing multi-threading, should arrive when they get around finishing specs lol

Related

How to compile boost faster?

I'm using the following command on Win7 x64
.\b2 --cxxflags=/MP --build-type=complete
also tried
.\b2 --cxxflags=-MP --build-type=complete
However, cl.exe is still using only one of the 8 cores of my system.Any suggestions?
Make the compilation parallel at the build tool level, not per translation unit with
.\b2 -j8
or similar (if you have n cores, -j(n+1) is often used)
Turns out Malwarebytes was the culprit. It was slowing down the compilation by scanning newly generated files and memory. I turned it off, now I'm seeing 50% utilization(4 cores) sometimes. It's still between 5%-14% most of the time though.

Writing CUDA program for more than one GPU

I have more than one GPU and want to execute my kernels on them. Is there an API or software that can schedule/manage GPU resources dynamically? Utilizing resources of all available GPUs for the program.
A utility that may periodically report the available resource and my program will launch as many threads to GPUs.
Secondly, I am using Windows+ Visual Studio for my development. I have read that CUDA is supported on Linux. what changes do I need to do in my program?
I have more than GPUs and want to execute my kernels on them. Is there an API or software that can schedule/manage GPU resources dynamically.
For arbitrary kernels that you write, there is no API that I am aware of (certainly no CUDA API) that "automatically" makes use of multiple GPUs. Today's multi-GPU aware programs often use a strategy like this:
detect how many GPUs are available
partition the data set into chunks based on the number of GPUs available
successively transfer the chunks to each GPU, and launch the computation kernel on each GPU, switching GPUs using cudaSetDevice().
A program that follows the above approach, approximately, is the cuda simpleMultiGPU sample code. Once you have worked out the methodology for 2 GPUs, it's not much additional effort to go to 4 or 8 GPUs. This of course assumes your work is already separable and the data/algorithm partitioning work is "done".
I think this is an area of active research in many places, so if you do a google search you may turn up papers like this one or this one. Whether these are of interest to you will probably depend on your exact needs.
There are some new developments with CUDA libraries available with CUDA 6 that can perform certain specific operations (e.g. BLAS, FFT) "automatically" using multiple GPUs. To investigate this further, review the relevant CUBLAS XT documentation and CUFFT XT multi-GPU documentation and sample code. As far as I know, at the current time, these operations are limited to 2 GPUs for automatic work distribution. And these allow for automatic distribution of specific workloads (BLAS, FFT) not arbitrary kernels.
Secondly, I am using Windows+ Visual Studio for my development. I have read that CUDA is supported on Linux. what changes do I need to do in my program?
With the exception of the OGL/DX interop APIs CUDA is mostly orthogonal to choice of windows or linux as a platform. The typical IDE's are different (windows: nsight Visual Studio edition, Linux: nsight eclipse edition) but your code changes will mostly consist of ordinary porting differences between windows and linux. If you want to get started with linux, follow the getting started document.

GCC v/s Visual studio run time differences

I have written a C++ code for a vehicle routing project. On my dell laptop I have both Ubuntu and Windows 7 installed. When i run my code in a gcc compiler on UNIX platform it runs at least 10x faster than the exact same code on Visual C++ 2010 on the windows OS (both of them on the same machine). This is not just for one particular code, turns out this happens for almost every C++ code i have been using.
I am assuming there is an explanation to such a large differences in runtimes and why gcc out performs visual C++ run time wise. Could anyone enlighten me on this?
Thanks.
In my experience, both compilers are fairly equal, but you have to watch out for a few things:
1. Visual Studio defaults to stack-checking on, which means that every function starts with a small amount of "memset" and ends with a small amount of "memcmp". Turn that off if you want performance - it's great for catching when you write to the 11th element of a ten element array.
2. Visual studio does buffer overflow checking. Again, this can add a significant amount of time to the execution.
See: Visual Studio Runtime Checks
I believe these are normally enabled in debug mode, but not in release builds, so you should get similar results from release builds and -O2 or -O3 optimized builds on gcc.
If this doesn't help, then perhaps you can give us a small (compilable) example, and the respective timings.

Visual C++ express 10 using too much memory

I use process explorer (which is a microsoft tool) on windows XP, and often the "physical memory" is being filled at max (3GB) while I use visual C++. At a point, all my programs are slow and are unresponsive, and when it returns to normal, available memory comes back by nearly half ! What is wrong ?
I'm programming some project with Ogre3D, maybe I can deactivate some options in visual, what exactly is it caching that eats that much memory ?
Apparently MSVC is designed to work on big machines, there are many settings in text editor -> C++ to remove some weight, but my guess is that windows xp + recent microsoft apps don't play nice.

Tips to reduce opengl 3 & 4 frame-rate stuttering under Linux

In recent years I've developed several small games and applications for OpenLG 2 and ES.
I'm now trying to build a scene-graph based on opengl 3+ for casual “3D” graphics on desktop systems. (Nothing complex like the unreal- or crytec-engine in mind.)
I started my development with OsX 10.7 and was impressed by Apples recent ogl 3.2 release which achieves equivalent results compared to windows systems.
Otherwise the results for Linux are a disappointment. Even the most basic animation is stuttering and destroys the impression of reality. The results did not differ between the windows toolkits freeglut and glfw. (The Extensions are loaded with glew 1.7)
I would like to mention that I'm talking about the new opengl core, not the old opengl 2 render-path, which works fine under Linux but uses the cpu instead of the gpu for complex operations.
After watching professional demos like the “Unigine heaven demo” I think there is a general problem to use modern real-time 3D graphics with Linux.
Any suggestions to overcome this problem are very welcome.
UPDATE:
I'm using:
AMD Phenom II X6, Radeon HD57XX with latest proprietary drivers (11.8) and Unity(64Bit).
You could take my renderloop from the toolkit documentation:
do {
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
...
} while (!glfwGetKey(GLFW_KEY_ESC) && glfwGetWindowParam(GLFW_OPENED));
I'm using VBOs and all transformation stages are done with shaders. Animation timing is done with glfwGetTime(). This problem occurs in window and full-screen mode. I don't know if a composition manager interferes with full-screen applications. But it is also impossible to demand from the user to disable it.
Update 2:
Typo: I'm using a HD57XX card.
GLX_Extension: http://snipt.org/xnKg
PCI Dump: 01:00.0 VGA compatible controller: ATI Technologies Inc Juniper [Radeon HD 5700 Series]
X Info: http://pastie.org/2507935
Update 3:
Disabling the composition manager reduces, but did not completely remove the stuttering.
(I replaced the standard window manager with "ubuntu classic without extensions")
Once a second the animation freezes and ugly distortions appear:
(Image removed - Not allowed to post Images.)
Although Vertical synchronisation is enabled in the driver and checked in the application.
Since you're running Linux we require a bit of detailed information:
Which hardware do you use?
Only NVidia, AMD/ATI and Intel offer 3D acceleration so far.
Which drivers?
For NVidia and AMD/ATI there are propritary (nvidia-glx, fglrx) and open source drivers (nouveau, radeon). For Intel there are only the open source drivers.
Of all open source 3D drivers, the Intel drivers offer the best quality.
The open source AMD/ATI drivers, "radeon" have reached an acceptable state, but still are not on par, performance wise.
For NVidia GPUs, the only drivers that makes sense to use productively are the propritary ones. The open source "nouveau" drivers simply don't cut it, yet.
Do you run a compositing window manager?
Compositing creates a whole bunch of synchronization and timing issues. Also (some of) the OpenGL code you can find in the compositing WMs at some places drives tears into the eyes of a seasoned OpenGL coder, especially if one has experience writing realtime 3D (game) engines.
KDE4 and GNOME3 by default use compositing, if available. The same holds for the Ubuntu Unity desktop shell. Also for some non-compositing WMs the default skripts start xcompmgr for transparency and shadow effects.
And last but not least: How did you implement your rendering loop?
A mistake oftenly found is, that a timer is used to issue redisplay events at "regular" intervals. This is not how it's done properly. Timer events can be delayed arbitrarily, and the standard timers are not very accurate by themself, too.
The proper way is to call the display function in a tight loop and measure the time it takes between rendering iterations, then use this timing to advance the animation accordingly. A truly elegant method is using one of the VSync extensions that delivers one the display refresh frequency and the refresh counter. That way instead of using a timer you are told exactly the time advanced between frames in display refresh cycle periods.

Resources