I have vector a(8000000000);
So 8000000000*8/1024/1024/1024=7.45Gb , so I need 7.45Gb RAM in order my program with this huge size to work, and I have that RAM in my computer but it doesn't work, WHY?
It compiles, but when I run, it gives error
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
Thanks in advance
2 things MUST be true for your program to work correctly:
The OS has to be 64-bit (which is what you are presumably running on, otherwise you'd be limited to 3GB RAM)
Your program must be built as a 64-bit application so it can actually address that much memory
For Windows MSVC solution/project make sure to follow the steps in this https://msdn.microsoft.com/en-us/library/h2k70f3s.aspx article to set up for 64-bits.
For SunOS follow this article: http://www.well.com/~jax/rcfb/solaris_tips/build_gcc_3.0_64bit.html
And also for any other platform/compiler you should be able to pull the documentation via Google, of course
Related
hope someone knows the answer to this...
I have a code that compiles perfectly well with openMP (it uses libsharp). However, I am finding it impossible to make the M1 Pro chip use all the 8 or 10 cores I have.
I am setting the threads variable correctly as export OMP_NUM_THREADS=10 such that the code correctly identifies it's supposed to be running with 10 threads (see image below showing a print-screen from my activity monitor):
Activity Monitor Print Screen
Print screen is showing that the code is compiled for Apple Silicon, uses 10 threads but not much of the CPU available.
Does anyone know how to properly compile/set the number of threads such that all the cores will be used?
This is trivial in x86 architectures.
Not really an answer, but long for a comment...
If both LLVM and GCC behave the same then it's not an OpenMP runtime issue. (And your monitor output shows that the correct number of threads have been created). I'm also not certain that it's really an Arm issue.
Are you comparing with an Apple x86 machine (so running the same operating system), or with a Linux x86 system?
The scheduling decisions of the two OSes are likely different, and (for instance) MacOS has no interface to bind threads to logicalCPUs.
As well as that, there's the issue of having some fast and some slow cores. That could mean that statically scheduled loops are inefficient.
I'm also confused by the fact that you arm to show multiple instances of your code running at the same time, so you are explicitly causing over-subscription of the logicalCPUs...
First and foremost: I am completely unable to create a MCVE, as I can only reproduce this when running a full code, any attempt to measure or replicate the error in a simpler environment makes it disappear. TDLR I suspect its not a code problem, but a configuration problem.
I have a piece of code for some mathematics on kernels in CUDA. I have a windows machine Win10 x64, GTX 1050, CUDA 9.2 and a Ubuntu 17.04, 2xGTX 1080 Ti, CUDA 9.1.
My code runs good on the windows machine. It is long (~700ms per kernel call for big samples) so I needed to increase the TDR value in windows. The code also (for now) forces it to run in 1 GPU, the first one that is selected with cudaSetDevice(0).
When I copy the same input data and code to the linux machine (I am using git, it is the same code), I get either
an illegal memory access was encountered
or
unspecified launch failure
in my error checking after the GPU call.
If I change the kernel to instead do the math, to just write a number in the output, the kernel executes properly. Other CUDA code (different functions that I have) works fine too. All this leads me to think that there is a problem outside the code, not with the code itself, nor with the general configuration of the drivers/environment variables.
I read that the xorg.conf can have an effect on the timeout of the kernels. I generated a xorg.conf (I had none) and remove the devices from there, as suggested here. I am connecting to the server remotely, and have no monitor plugged in. This changes nothing in the behavior, my kernels still error.
My question is: what else should I look? What linux specific configuration should I have a look at to pinpoint the cause of the kernel halts?
The error ended up being indeed illegal memory access.
These were caused by the fact that sizeof(unsigned long) is machine specific, and my linux machine returns 8 while my windows machine returns 4. As this code is called from MATLAB, and MATLAB (like some other high level languages such as python) defines the sizes of variables in bits (such as uint32(1)) there was a mismatch in the linux machine when doing memcpys. Turns out that this happened in a variable that is a index, so the kernels were reading garbage (due to the bad memcpy), but then triying to access another array at that location, creating an illegal memory error.
Too specific? yeah.
Whatever program I ran on GPU, even if programs that ran successfully before, my GPU throws this error: CL_OUT_OF_RESOURCES for the clEnqueueReadBuffer function.
Then I remembered that I ran a deep learning framework last night which crashed and may ate up all the memory on GPU. I tried to restart the computer, but it doesn't work.
Is it possible that my GPU ran out of memory due to the DL framework's crash?
If so, how should I solve this problem?
CL_OUT_OF_RESOURCES is a generic error given by NVIDIA implementation at clEnqueueRead, it more or less means:
Something went out of bounds (resources) when trying to write to this
buffer
Most probably the kernel you launched before that writes to that buffer went out of bounds of the buffer.
I'm running a small CUDA application: the QuickSort benchmark algorithm (see here). I have a dual system with a NVIDIA 660GTX (device 0) and 8600GTS (device 1).
Under Windows 8 and Visual Studio, the application compiles and runs flawlessly on device 0. Under Linux (Ubuntu 12.04 LTS), the app compiles with nvcc and gcc but suddenly stops in its tracks, returning a (unspecified launch failure).
I have two issues:
After this error, my GPU cannot perform some other operations, e.g., running the SDK example bandwidhtTest blocks when it performs the first data transfer, but running deviceQuery continues to perform well. How can I reset my GPU? I've already tried the cudaDeviceReset() method but it doesn't help
How can I find what is going wrong under linux? Has someone a clue or seen this before?
Thanks in advance for your help!
Using the nvidia-smi utility you can reset the GPU if it is compatible
To my knowledge and experience, (unspecified launch failure) usually referees to segmentation fault. Have you specified the right GPU to use? Try to use cuda-memcheck to see if there is any memory out of bound scenario.
From my experience XID 31 was always caused by accessing bad pointer (aka Memory access violation).
I'd first pursue this trail. Run your application with cuda memcheck. Like that cuda-memcheck you_app args to your app and see if it finds any wrong memory accesses.
Also try stepping though the code with cuda-gdb or Nsight Eclipse Edition.
I've found that using
cuda-memcheck -b ...
prevents the device from locking up.
I am an intern who was offered the task of porting a test application from Solaris to Red Hat. The application is written in Ada. It works just fine on the Unix side. I compiled it on the linux side, but now it is giving me a seg fault. I ran the debugger to see where the fault was and got this:
Warning: In non-Ada task, selecting an Ada task.
=> runtime tasking structures have not yet been initialized.
<non-Ada task> with thread id 0b7fe46c0
process received signal "Segmentation fault" [11]
task #1 stopped in _dl_allocate_tls
at 0870b71b: mov edx, [edi] ;edx := [edi]
This seg fault happens before any calls are made or anything is initialized. I have been told that 'tasks' in ada get started before the rest of the program, and the problem could be with a task that is running.
But here is the kicker. This program just generates some code for another program to use. The OTHER program, when compiled under linux gives me the same kind of seg fault with the same kind of error message. This leads me to believe there might be some little tweak I can use to fix all of this, but I just don't have enough knowledge about Unix, Linux, and Ada to figure this one out all by myself.
This is a total shot in the dark, but you can have tasks blow up like this at startup if they are trying to allocate too much local memory on the stack. Your main program can safely use the system stack, but tasks have to have their stack allocated at startup from dynamic memory, so typcially your runtime has a default stack size for tasks. If your task tries to allocate a large array, it can easily blow past that limit. I've had it happen to me before.
There are multiple ways to fix this. One way is to move all your task-local data into package global areas. Another is to dynamically allocate it all.
If you can figure out how much memory would be enough, you have a couple more options. You can make the task a task type, and then use a
for My_Task_Type_Name'Storage_Size use Some_Huge_Number;
statement. You can also use a "pragma Storage_Size(My_Task_Type_Name)", but I think the "for" statement is preferred.
Lastly, with Gnat you can also change the default task stack size with the -d flag to gnatbind.
Off the top of my head, if the code was used on Sparc machines, and you're now runing on an x86 machine, you may be running into endian problems.
It's not much help, but it is a common gotcha when going multiplat.
Hunch: the linking step didn't go right. Perhaps the wrong run-time startup library got linked in?
(How likely to find out what the real trouble was, months after the question was asked?)