Is there any CUDA library that performs comparison/search operation.
CUDA is an API for creating libraries that run on the NVidia GPU. Consequently, any operations that are to be performed must be custom programmed. There are not a wide range of open-source libraries available.
Programmers use 'C for CUDA' (C with NVIDIA extensions), compiled through a PathScale Open64 C compiler, to code algorithms for execution on the GPU.
http://en.wikipedia.org/wiki/CUDA
You could look at Thrust, which includes a binary_search operation and is very easy to use.
Related
since dlopen uses libdl.so , but i am working on standalone application which do not use OS support, so my idea is to implement dlopen directly using coding is there any
Loading shared libraries is intrinsically dependent on the operating
system's runtime loader and in turn on the operating system's executable file format and its process construction model. There is no OS-independent way to do it.
The GNU source code of dlopen is of course
freely available, but that does not make it independent of an operating system.
The maximum degree of OS independence you can achieve in C is obtained by
restricting yourself to software that you can write entirely with the
resources of the Standard C Library. The Standard C Library does not contain
dlopen or any equivalent functionality, because such functionality is
intrinsically OS-dependent.
As your question is tagged Linux, it is not quite clear why you would want your application
to be independent of OS support that is provided by Linux.
I need to use a parallel linear algebra on OSX and as painlessly as possible (i.e., at most I can use HomeBrew with my colleagues) factorization library due to the number of DOFs I have in my problems.
I've tried Armadillo, it supports sparse algebra which is what I need, I can link with the Accelerate framework, but it just solves linear problems, it doesn't support factorization AFAIK.
Next, MKL, but nothing I can do seems to trigger threading, even with TBB:
tbb::task_scheduler_init scheduler(4);
mkl_set_dynamic(true);
mkl_set_num_threads(4);
mkl_set_num_threads_local(4);
Eigen could be cool, but it seems that, like MKL, it won't run in parallel.
Do you have any suggestions?
OSX clang does not support openmp, which is required by multi-thread Eigen and MKL.
According to IntelĀ® Math Kernel Library Link Line Advisor, MKL does not support TBB threading with clang.
But it seems to support Intel OpenMP library with the extra link option -liomp5. You could try if it works. If not, you may have to use another compiler such as gcc. You could find it in HomeBrew.
I have to install without root access some software (the gromacs simulation package) on a cluster server, on which jobs can be sent through slurm. I only have direct access to the front-end machine, and the home directory is shared among all the servers and front-end. I had to manually build and install locally:
gcc 4.8
automake, autoconf, cmake
openmpi
lapack libs
gromacs
Right now, I have installed all of this only on the front-end, which is an older Intel Xeon machine. The production servers have new AMD processor instead. This is my question: in order to achieve optimal performance, which parts of the aforementioned stack should be recompiled on the production servers? I guess it would make much sense to rebuild the final software (gromacs) and maybe the lapack libs, because of the different instruction sets and processor architecture, but I'm not exactly sure whether it would make any sense to rebuild the compiler or other parts of the system. Hence the question: does using a compiler (and the associated libraries) which have been built on a different machine result in higher execution times for the generated binaries?
In general, I'd expect a compiler to produce the same binaries if given the same output, so the answer would be no; but what about the libraries (as libstdc++) which have been compiled together with the compiler on the other machine?
thank you
In order to optimize gromacs (parallel molecular dynamics code), you can forget about recompiling the compileror the compilation tools: that's useless.
You should go after and check for optimizations. For Intel CPU using the Intel C compiler makes a difference. It's possible you observe some gains with AMDs as well.
Another alternative is to use the Portland Group compiler.
Regarding MPI, you need to be sure it's customized for your interconnect (for example, if you have infiniband, avoid to use the TCP standard version).
regarding lapack libraries, you need to install optimized lapack (ACML for AMDs, MKL for Intels. You can use with very good performance GOTO or ATLAS blas - they are included in many linux distros).
You have not mentioned FFT: they are indeed important for electromagnetics (Ewald summations) in the simulations: FFTW here is a good choice. You need to install the correct version for the processor or compile it on the target processor, because it performs a sort of "auto-tuning" in the compilation process.
Going below than this (tools, compilers) make no difference on the produced executables.
Building the GCC compiler already involves a four-stage bootstrap process, one of whose purposes is to QA the compiler by ensuring the last two stages produce the same output. So there is no reason to believe that a fifth stage will have any effect at all.
I am working with OpenCV, an open source image processing library, and due to complexities in my algorithm I need to use multiple threads for video processing.
How multi-threading is done on C++ 98? I know about C++ 11 has a built in support library for threading (std::thread) but my platform (MSVC++ 2010) does not have that. Also I read about Boost library, which is a general purpose extension to C++ STL, has methods for multi-threading. I also know with MSDN support (windows.h) I can create and manage threads for Windows applications. Finally, I found out that Qt library, a cross platform GUI solution, has support for threading.
Is there a naive way (not having any 3rd party libraries) to create a cross-platform multi-threading application?
C++98 does not have any support for threading, neither in the language nor the standard library. You need to use a third party library and you have already listed a number of the main candidates.
OpenCV relies on different external systems for multithreading (or more accurately parallel processing).
Possible options are:
OpenMP (handled at the compiler level);
Intel's TBB (external library);
libdispatch (on systems that support it, like MacOS, iOS, *BSD);
GPGPU approaches with CUDA and OpenCL.
In recent versions of OpenCV these systems are "hidden" behind a parallel_for construct.
All this applies to parallel processing, i.e., data parallel tasks (roughly speaking, process each pixel or row of the input in parallel). If you need application level multithreading (like for example having a master thread and workers) then you need to use frameworks such as POSIX's threads or Qt.
I recommend boost::thread which is (mostly) compatible with std::thread in C++11. It is cross-platform and very mature.
OpenCV's parallelism is internal and does not directly mix with your code, but it may use more resources and cores than you might expect (as a feature), but this might be at the expense of other external processes.
There are a number of new features for parallel programming in VS 2012 C++ compiler:
Auto-parallelizer
C++ Accelerated Massive Parallelism (AMP)
Task Parallelism
and more ...
I would like to see these applied to matrices multiplication, eigenvalue decomposition, etc. I mean everything that can benefit executing in parallel.
Is there such C++ library?
Vectorization is applied by default, so we skip this part.
The library will not be portable, so we skip this also.
Look into Intel MKL. I don't know about the specifics of the library but it doesn't hurt to investigate it.