Directx 11 CreateVertexShader memory leak - memory-leaks

Hi I have a memory leak when creating and releasing vertex shader.
Here is my compiled shader https://pastebin.com/raw/4w8tyY1n
And here is my pretty simple code I just create device and context, then vertex shader and then I all release in a loop.
HRESULT hr;
while(true)
{
ID3D11Device* device;
ID3D11DeviceContext* deviceCtx;
ID3D11VertexShader* vertexShader;
hr = D3D11CreateDevice (
nullptr,
D3D_DRIVER_TYPE_HARDWARE,
nullptr,
D3D11_CREATE_DEVICE_BGRA_SUPPORT,
nullptr,
0,
D3D11_SDK_VERSION,
&device,
nullptr,
&deviceCtx);
if (SUCCEEDED (hr))
{
UINT Size = ARRAYSIZE (g_VS);
hr = device->CreateVertexShader (g_VS, Size, nullptr, &vertexShader);
if (SUCCEEDED (hr))
{
vertexShader->Release ();
}
deviceCtx->Release ();
device->Release ();
}
}
I'm stuck with this i have read every possible msdn documentation on this and I just don't know what might be the problem.

Ok so the problem was with IntelĀ® HD Graphics 620 drivers an update fix everything for me.

Related

C++ multi-thread memory leak issue

Once the following code is running, it will eat all my memory and cause OOM issue on my ARM 64-bit processor. it seems that some 'delete' operations do not work... it confused me. Could any experts help to analysis what is the root cause? Ttoolchain or libc?
I run the same code in another two platforms, one is x64, one is another ARM 32-bit chip; they all fine except my IPQ ARM chip.
environment:
libc version: musl libc 1.1.16
gcc version: 5.2.0
cpu chip: IPQ ARM 64bit
os: linux 4.1
#include <iostream>
#include <cstring>
#include <list>
#include <memory>
#include <pthread.h>
void *h(void *param)
{
while(1)
{
char *p = new char[4096];
delete[] p;
}
return NULL;
}
int main(int argc, char *argv)
{
pthread_t th[50];
int thread_cnt = 2;
for(int i = 0; i < thread_cnt; i++)
pthread_create(th+i, NULL, h, NULL);
for(int i = 0; i < thread_cnt; i++)
pthread_join(th+i, NULL);
return 0;
}
If I use a mutex in the header-file thread callback, the issue will disappear; there is no memory leak.
std::mutex thlock;
void *h(void *param)
{
while(1)
{
std::lock_guard<std::mutex> lk(thlock);
char *p = new char[4096];
delete[] p;
}
return NULL;
}
If I change the memory allocation size to 1024, then a weird thing happed. 2 threads do not leak memory, but 20 threads will cause memory leak. Whatever the allocation size, there's no issue with only 1 thread.
void *h(void *param)
{
while(1)
{
//char *p = new char[4096];
char *p = new char[1024];
delete[] p;
}
return NULL;
}
If I replace new/delete to std::malloc/std::free, the issue will disappear; no memory leak.
void *h(void *param)
{
while(1)
{
//char *p = new char[4096];
char *p = (char *)std::malloc(4096);
std::free(p);
//delete[] p;
}
return NULL;
}
So I thought if I overload operator new/delete and my "new/delete" allocate memory by using std::malloc/free, maybe the issue could be solved. But
the weird thing happened again; the memory leak still exists.
void* operator new(std::size_t sz) // no inline, required by [replacement.functions]/3
{
std::printf("global op new called, size = %zu\n", sz);
if (sz == 0)
++sz; // avoid std::malloc(0) which may return nullptr on success
if (void *ptr = std::malloc(sz))
return ptr;
throw std::bad_alloc{}; // required by [new.delete.single]/3
}
void operator delete(void* ptr) noexcept
{
std::puts("global op delete called");
std::free(ptr);
}
void *h(void *param)
{
while(1)
{
char *p = new char[4096];
delete[] p;
}
return NULL;
}
So I thought the memory leak may be caused by my operator new/delete rather than malloc/free, so I changed the code to look like what's shown below. I don't really call malloc/free but return a pre-allocated buffer. But the memory leak issue disappeared, so it seems the issue not only comes from new/delete...
char pp[4096];
void* operator new(std::size_t sz) // no inline, required by [replacement.functions]/3
{
std::printf("global op new called, size = %zu\n", sz);
if (sz == 0)
++sz; // avoid std::malloc(0) which may return nullptr on success
return (void *)pp;
//if (void *ptr = std::malloc(sz))
//return ptr;
throw std::bad_alloc{}; // required by [new.delete.single]/3
}
void operator delete(void* ptr) noexcept
{
std::puts("global op delete called");
//std::free(ptr);
}
So, what happened? Could any experts give me some idea? I guess it maybe caused musl libc thread lib; it seems that thread synchronization does not work fine... or is there a patch that already fixed this issue?
this issue is involved by musl libc, and has been fixed by the following patch.
https://git.musl-libc.org/cgit/musl/commit/src/malloc?id=3e16313f8fe2ed143ae0267fd79d63014c24779f
above patch involved a buffur overflow issue of realloc and has been fixed by below 2 patches:
https://git.musl-libc.org/cgit/musl/commit/src/malloc?id=cb5babdc8d624a3e3e7bea0b4e28a677a2f2fc46
https://git.musl-libc.org/cgit/musl/commit/src/malloc?id=fca7428c096066482d8c3f52450810288e27515c

multiple thread in CMSIS RTOS - STM32 nucleo L053R8

Today I developing RTOS (CMSIS RTOS) for kit STM32 nucleo L053R8. I have issue relate to multiple task.
I create 4 task(task_1, task_2, task_3, task_4), however only 3 task run.
This is part of my code:
#include "main.h"
#include "stm32l0xx_hal.h"
#include "cmsis_os.h"
osMutexId stdio_mutex;
osMutexDef(stdio_mutex);
int main(void){
.....
stdio_mutex = osMutexCreate(osMutex(stdio_mutex));
osThreadDef(defaultTask_1, StartDefaultTask_1, osPriorityNormal, 0, 128);
defaultTaskHandle = osThreadCreate(osThread(defaultTask_1), NULL);
osThreadDef(defaultTask_2, StartDefaultTask_2, osPriorityNormal, 0, 128);
defaultTaskHandle = osThreadCreate(osThread(defaultTask_2), NULL);
osThreadDef(defaultTask_3, StartDefaultTask_3, osPriorityNormal, 0, 128);
defaultTaskHandle = osThreadCreate(osThread(defaultTask_3), NULL);
osThreadDef(defaultTask_4, StartDefaultTask_4, osPriorityNormal, 0, 600);
defaultTaskHandle = osThreadCreate(osThread(defaultTask_4), NULL);
}
void StartDefaultTask_1(void const * argument){
for(;;){
osMutexWait(stdio_mutex, osWaitForever);
printf("%s\n\r", __func__);
osMutexRelease(stdio_mutex);
osDelay(1000);
}
}
void StartDefaultTask_2(void const * argument){
for(;;){
osMutexWait(stdio_mutex, osWaitForever);
printf("%s\n\r", __func__);
osMutexRelease(stdio_mutex);
osDelay(1000);
}
}
void StartDefaultTask_3(void const * argument){
for(;;){
osMutexWait(stdio_mutex, osWaitForever);
printf("%s\n\r", __func__);
osMutexRelease(stdio_mutex);
osDelay(1000);
}
}
void StartDefaultTask_4(void const * argument){
for(;;){
osMutexWait(stdio_mutex, osWaitForever);
printf("%s\n\r", __func__);
osMutexRelease(stdio_mutex);
osDelay(1000);
}
}
this is reslut in console (uart):
when I change stack size for task 4 from 600 -> 128 as below:
osThreadDef(defaultTask_4, StartDefaultTask_4, osPriorityNormal, 0, 128);
defaultTaskHandle = osThreadCreate(osThread(defaultTask_4), NULL);
then don't have any task run.
Actualy I want make many thread for my application, however this issue cause difficult to implement.
Could you let me know root cause of prblem? and how to resolve it.
Thank you advance!!
There is no common easy method of the stack calculation. It depends on many factors.
I would suggest to avoid stack greedy functions like printf scanf etc. Write your own ones, not as "smart" and universal but less resources greedy.
Avoid large local variables. Be very careful when you allocate the memory
As your Suggestions, I checked by debug and see root cause is heap size is small.
I resolve by 2 method
increase heap size: #define configTOTAL_HEAP_SIZE ((size_t)5120)
decrease stack size: #define configMINIMAL_STACK_SIZE ((uint16_t)64)
osThreadDef(defaultTask_6, StartDefaultTask_6, osPriorityNormal, 0, 64);
Do you know how to determine max of heap size? Please let me known.
Thank you so much

How to capture a certain region of screen

Ihave to record a certain area of a screen and save the frames (20-30fps) of the video being played. Currently I'm able to capture the screen at the required frame rate using the following code, However I'm not really sure about how I can capture a certain region.
HRESULT SavePixelsToFile32bppPBGRA(UINT width, UINT height, UINT stride, BYTE, *pixels, LPWSTR filePath, const GUID &format)
{
if (!filePath || !pixels)
return E_INVALIDARG;
HRESULT hr = S_OK;
IWICImagingFactory *factory = nullptr;
IWICBitmapEncoder *encoder = nullptr;
IWICBitmapFrameEncode *frame = nullptr;
IWICStream *stream = nullptr;
GUID pf = GUID_WICPixelFormat32bppPBGRA;
BOOL coInit = CoInitialize(nullptr);
CoCreateInstance(CLSID_WICImagingFactory, nullptr, CLSCTX_INPROC_SERVER, IID_PPV_ARGS(&factory));
factory->CreateStream(&stream);
stream->InitializeFromFilename(filePath, GENERIC_WRITE);
factory->CreateEncoder(format, nullptr, &encoder);
encoder->Initialize(stream, WICBitmapEncoderNoCache);
encoder->CreateNewFrame(&frame, nullptr); // we don't use options here
frame->Initialize(nullptr); // we dont' use any options here
frame->SetSize(width, height);
frame->SetPixelFormat(&pf);
frame->WritePixels(height, stride, stride * height, pixels);
frame->Commit();
encoder->Commit();
RELEASE(stream);
RELEASE(frame);
RELEASE(encoder);
RELEASE(factory);
if (coInit) CoUninitialize();
return hr;
}
HRESULT Direct3D9TakeScreenshots(UINT adapter, int count)
{
HRESULT hr = S_OK;
IDirect3D9 *d3d = nullptr;
IDirect3DDevice9 *device = nullptr;
IDirect3DSurface9 *surface = nullptr;
D3DPRESENT_PARAMETERS parameters = { 0 };
D3DDISPLAYMODE mode;
D3DLOCKED_RECT rc;
UINT pitch;
SYSTEMTIME st;
BYTE *shot = nullptr;
// init D3D and get screen size
d3d = Direct3DCreate9(D3D_SDK_VERSION);
d3d->GetAdapterDisplayMode(adapter, &mode);
parameters.Windowed = TRUE;
parameters.BackBufferCount = 1;
parameters.BackBufferHeight = mode.Height;
parameters.BackBufferWidth = mode.Width;
parameters.SwapEffect = D3DSWAPEFFECT_DISCARD;
parameters.hDeviceWindow = NULL;
// create device & capture surface
d3d->CreateDevice(adapter, D3DDEVTYPE_HAL, NULL, D3DCREATE_SOFTWARE_VERTEXPROCESSING, &parameters, &device);
device->CreateOffscreenPlainSurface(mode.Width, mode.Height, D3DFMT_A8R8G8B8, D3DPOOL_SYSTEMMEM, &surface, nullptr);
// compute the required buffer size
surface->LockRect(&rc, NULL, 0);
pitch = rc.Pitch;
surface->UnlockRect();
// allocate screenshots buffers
shot = new BYTE[pitch * mode.Height];
// get the data
device->GetFrontBufferData(0, surface);
// copy it into our buffers
surface->LockRect(&rc, NULL, 0);
CopyMemory(shot, rc.pBits, rc.Pitch * mode.Height);
surface->UnlockRect();
WCHAR file[100];
wsprintf(file, L"C:/Frames/cap%i.png", count);
SavePixelsToFile32bppPBGRA(mode.Width, mode.Height, pitch, shot, file, GUID_ContainerFormatPng);
if (shot != nullptr)
{
delete shot;
}
RELEASE(surface);
RELEASE(device);
RELEASE(d3d);
return hr;
}
I know I can open the saved images and crop them and then again save them, but that will take up extra time. Is there a way to select a desired region

what could be causing this opengl segfault in glBufferSubData?

I've been whittling down this segfault for a while, and here's a pretty minimal reproducible example on my machine (below). I have the sinking feeling that it's a driver bug, but I'm very unfamiliar with OpenGL, so it's more likely I'm just doing something wrong.
Is this correct OpenGL 3.3 code? Should be fine regardless of platform and compiler and all that?
Here's the code, compiled with gcc -ggdb -lGL -lSDL2
#include <stdio.h>
#include "GL/gl.h"
#include "GL/glext.h"
#include "SDL2/SDL.h"
// this section is for loading OpenGL things from later versions.
typedef void (APIENTRY *GLGenVertexArrays) (GLsizei n, GLuint *arrays);
typedef void (APIENTRY *GLGenBuffers) (GLsizei n, GLuint *buffers);
typedef void (APIENTRY *GLBindVertexArray) (GLuint array);
typedef void (APIENTRY *GLBindBuffer) (GLenum target, GLuint buffer);
typedef void (APIENTRY *GLBufferData) (GLenum target, GLsizeiptr size, const GLvoid* data, GLenum usage);
typedef void (APIENTRY *GLBufferSubData) (GLenum target, GLintptr offset, GLsizeiptr size, const GLvoid* data);
typedef void (APIENTRY *GLGetBufferSubData) (GLenum target, GLintptr offset, GLsizeiptr size, const GLvoid* data);
typedef void (APIENTRY *GLFlush) (void);
typedef void (APIENTRY *GLFinish) (void);
GLGenVertexArrays glGenVertexArrays = NULL;
GLGenBuffers glGenBuffers = NULL;
GLBindVertexArray glBindVertexArray = NULL;
GLBindBuffer glBindBuffer = NULL;
GLBufferData glBufferData = NULL;
GLBufferSubData glBufferSubData = NULL;
GLGetBufferSubData glGetBufferSubData = NULL;
void load_gl_pointers() {
glGenVertexArrays = (GLGenVertexArrays)SDL_GL_GetProcAddress("glGenVertexArrays");
glGenBuffers = (GLGenBuffers)SDL_GL_GetProcAddress("glGenBuffers");
glBindVertexArray = (GLBindVertexArray)SDL_GL_GetProcAddress("glBindVertexArray");
glBindBuffer = (GLBindBuffer)SDL_GL_GetProcAddress("glBindBuffer");
glBufferData = (GLBufferData)SDL_GL_GetProcAddress("glBufferData");
glBufferSubData = (GLBufferSubData)SDL_GL_GetProcAddress("glBufferSubData");
glGetBufferSubData = (GLGetBufferSubData)SDL_GL_GetProcAddress("glGetBufferSubData");
}
// end OpenGL loading stuff
#define CAPACITY (1 << 8)
// return nonzero if an OpenGL error has occurred.
int opengl_checkerr(const char* const label) {
GLenum err;
switch(err = glGetError()) {
case GL_INVALID_ENUM:
printf("GL_INVALID_ENUM");
break;
case GL_INVALID_VALUE:
printf("GL_INVALID_VALUE");
break;
case GL_INVALID_OPERATION:
printf("GL_INVALID_OPERATION");
break;
case GL_INVALID_FRAMEBUFFER_OPERATION:
printf("GL_INVALID_FRAMEBUFFER_OPERATION");
break;
case GL_OUT_OF_MEMORY:
printf("GL_OUT_OF_MEMORY");
break;
case GL_STACK_UNDERFLOW:
printf("GL_STACK_UNDERFLOW");
break;
case GL_STACK_OVERFLOW:
printf("GL_STACK_OVERFLOW");
break;
default: return 0;
}
printf(" %s\n", label);
return 1;
}
int main(int nargs, const char* args[]) {
printf("initializing..\n");
SDL_Init(SDL_INIT_EVERYTHING);
SDL_GL_SetAttribute(SDL_GL_CONTEXT_MAJOR_VERSION, 3);
SDL_GL_SetAttribute(SDL_GL_CONTEXT_MINOR_VERSION, 3);
SDL_GL_SetAttribute(SDL_GL_CONTEXT_PROFILE_MASK, SDL_GL_CONTEXT_PROFILE_CORE);
SDL_Window* const w =
SDL_CreateWindow(
"broken",
SDL_WINDOWPOS_CENTERED, SDL_WINDOWPOS_CENTERED,
1, 1,
SDL_WINDOW_OPENGL
);
if(w == NULL) {
printf("window was null\n");
return 0;
}
SDL_GLContext context = SDL_GL_CreateContext(w);
if(context == NULL) {
printf("context was null\n");
return 0;
}
load_gl_pointers();
if(opengl_checkerr("init")) {
return 1;
}
printf("GL_VENDOR: %s\n", glGetString(GL_VENDOR));
printf("GL_RENDERER: %s\n", glGetString(GL_RENDERER));
float* const vs = malloc(CAPACITY * sizeof(float));
memset(vs, 0, CAPACITY * sizeof(float));
unsigned int i = 0;
while(i < 128000) {
GLuint vertex_array;
GLuint vertex_buffer;
glGenVertexArrays(1, &vertex_array);
glBindVertexArray(vertex_array);
glGenBuffers(1, &vertex_buffer);
glBindBuffer(GL_ARRAY_BUFFER, vertex_buffer);
if(opengl_checkerr("gen/binding")) {
return 1;
}
glBufferData(
GL_ARRAY_BUFFER,
CAPACITY * sizeof(float),
vs, // initialize with `vs` just to make sure it's allocated.
GL_DYNAMIC_DRAW
);
// verify that the memory is allocated by reading it back into `vs`.
glGetBufferSubData(
GL_ARRAY_BUFFER,
0,
CAPACITY * sizeof(float),
vs
);
if(opengl_checkerr("creating buffer")) {
return 1;
}
glFlush();
glFinish();
// segfault occurs here..
glBufferSubData(
GL_ARRAY_BUFFER,
0,
CAPACITY * sizeof(float),
vs
);
glFlush();
glFinish();
++i;
}
return 0;
}
When I bump the iterations from 64k to 128k, I start getting:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff754c859 in __memcpy_sse2_unaligned () from /usr/lib/libc.so.6
(gdb) bt
#0 0x00007ffff754c859 in __memcpy_sse2_unaligned () from /usr/lib/libc.so.6
#1 0x00007ffff2ea154d in ?? () from /usr/lib/xorg/modules/dri/i965_dri.so
#2 0x0000000000400e5c in main (nargs=1, args=0x7fffffffe8d8) at opengl-segfault.c:145
However, I can more than double the capacity (keeping the number of iterations at 64k) without segfaulting.
GL_VENDOR: Intel Open Source Technology Center
GL_RENDERER: Mesa DRI Intel(R) Haswell Mobile
I had a very similar issue when calling glGenTextures and glBindTexture. I tried debugging and when i would try to step through these lines I would get something like:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff26eaaa8 in ?? () from /usr/lib/x86_64-linux-gnu/dri/i965_dri.so
Note that prior to adding textures, I could successfully run programs with vbos and vaos and generate meshes fine. After looking into the answer suggesting switching from xf86-video-intel driver to xf86-video-fbdev driver, I would advise against it(There really isn't that much info on this issue or users facing segfaults on linux with integrated intel graphics cards. perhaps a good question to ask the folks over at Intel OpenSource).
The solution I found was to stop using freeglut. Switch to glfw instead. Whether there actually is some problem with the intel linux graphics stack is besides the matter, it seems the solvable problem is freeglut. If you want to use glfw with your machines most recent opengl core profile you need just the following:
glfwWindowHint (GLFW_CONTEXT_VERSION_MAJOR, 3);
glfwWindowHint (GLFW_CONTEXT_VERSION_MINOR, 0);
glfwWindowHint (GLFW_OPENGL_FORWARD_COMPAT, GL_TRUE);
Setting forward compat(although ive seen lots of post argueing you shouldnt do this) means mesa is free to select a core context permitted one sets the minimum context to 3.0 or higher. I guess freeglut must be going wrong somewhere in its interactions with mesa, if anyone can share some light on this that would be great!
This is a bug in the intel graphics drivers for Linux. Switching from the xf86-video-intel driver to xf86-video-fbdev driver solves the problem.
Edit: I'm not recommending switching to fbdev, just using it as an experiment to see whether the segfault goes away.

Slow OpenGL context access on Ubuntu 12.4

I have an application which accesses OpenGL context.I run it on 2 OSs :
1.Kubuntu 13.4
2.Ubuntu 12.4
I am experiencing the following issue: on OS 1 it takes around 60 ms to setup the context, while on OS 2 it takes 10 times more.Both OSs use Nvidia GPUs with driver version 319.It also seems like OpenGL API calls are slower in general for OS 2.The contexts are offscreen.Currently I have no clue what could cause it.My question is what are possible sources of such an overhead?X11 setup?Or may be something on the OS level?
Another difference is that OS 1 uses Nvidia GTX680 while OS2 uses Nvidia GRID K1 card.Also OS2 resides on a server and the latency tests are run locally on that machine.
UPDATE:
This is the part which causes most of overhead:
typedef GLXContext (*glXCreateContextAttribsARBProc)(Display*, GLXFBConfig, GLXContext, Bool, const int*);
typedef Bool (*glXMakeContextCurrentARBProc)(Display*, GLXDrawable, GLXDrawable, GLXContext);
static glXCreateContextAttribsARBProc glXCreateContextAttribsARB = 0;
static glXMakeContextCurrentARBProc glXMakeContextCurrentARB = 0;
int main(int argc, const char* argv[]){
static int visual_attribs[] = {
None
};
int context_attribs[] = {
GLX_CONTEXT_MAJOR_VERSION_ARB, 3,
GLX_CONTEXT_MINOR_VERSION_ARB, 0,
None
};
Display* dpy = XOpenDisplay(0);
int fbcount = 0;
GLXFBConfig* fbc = NULL;
GLXContext ctx;
GLXPbuffer pbuf;
/* open display */
if ( ! (dpy = XOpenDisplay(0)) ){
fprintf(stderr, "Failed to open display\n");
exit(1);
}
/* get framebuffer configs, any is usable (might want to add proper attribs) */
if ( !(fbc = glXChooseFBConfig(dpy, DefaultScreen(dpy), visual_attribs, &fbcount) ) ){
fprintf(stderr, "Failed to get FBConfig\n");
exit(1);
}
/* get the required extensions */
glXCreateContextAttribsARB = (glXCreateContextAttribsARBProc)glXGetProcAddressARB( (const GLubyte *) "glXCreateContextAttribsARB");
glXMakeContextCurrentARB = (glXMakeContextCurrentARBProc)glXGetProcAddressARB( (const GLubyte *) "glXMakeContextCurrent");
if ( !(glXCreateContextAttribsARB && glXMakeContextCurrentARB) ){
fprintf(stderr, "missing support for GLX_ARB_create_context\n");
XFree(fbc);
exit(1);
}
/* create a context using glXCreateContextAttribsARB */
if ( !( ctx = glXCreateContextAttribsARB(dpy, fbc[0], 0, True, context_attribs)) ){
fprintf(stderr, "Failed to create opengl context\n");
XFree(fbc);
exit(1);
}
/* create temporary pbuffer */
int pbuffer_attribs[] = {
GLX_PBUFFER_WIDTH, 800,
GLX_PBUFFER_HEIGHT, 600,
None
};
pbuf = glXCreatePbuffer(dpy, fbc[0], pbuffer_attribs);
XFree(fbc);
XSync(dpy, False);
/* try to make it the current context */
if ( !glXMakeContextCurrent(dpy, pbuf, pbuf, ctx) ){
/* some drivers does not support context without default framebuffer, so fallback on
* using the default window.
*/
if ( !glXMakeContextCurrent(dpy, DefaultRootWindow(dpy), DefaultRootWindow(dpy), ctx) ){
fprintf(stderr, "failed to make current\n");
exit(1);
}
}
/* try it out */
printf("vendor: %s\n", (const char*)glGetString(GL_VENDOR));
return 0;
}
Specifically , the line :
pbuf = glXCreatePbuffer(dpy, fbc[0], pbuffer_attribs);
where the dummy pbuffer is created is the slowest.If the rest of function calls take in average 2-4 ms,this call takes 40 ms on OS 1. Now , on OS2 (which is slow) the pbuffer creation takes 700ms! I hope now my problems looks more clear.
Are you absolutely sure "OS2" has correctly set up drivers and isn't falling back on SW OpenGL (Mesa) rendering? What framerate does glxgears report on each system?
I note Ubuntu 12.4 was released April 2012 while I believe NVidia's "GRID" tech wasn't even announced until GTC May 2012 and I think cards didn't turn up until 2013 (see relevant Nvidia press releases). Therefore it seems very unlikely Nvidia's drivers as supplied with Ubuntu 12.4 support the grid card (unless you've made some effort to upgrade using more recent driver releases from Nvidia?).
You may be able to check the list of supported hardware in /usr/share/doc/nvidia-glx/README.txt.gz's Appendix A "Supported NVIDIA GPU Products" (at least that's where this useful information lives on my Debian machines).

Resources