Currently, I have an Android Project in which I am using the JNI(Java native interface) to call the LAPACK C function.
My JNI is:
native-lib.c
#include <jni.h>
#include "lapack/LAPACKE/include/lapack.h"
#include "lapack/LAPACKE/example/lapacke_example_aux.h"
JNIEXPORT jstring JNICALL
Java_com_example_lapacktest_MainActivity_stringFromJNI(
JNIEnv *env,
jobject obj /* this */) {
char hello[20] = "Hello from C++";
double A[5][3] = {{1,2,3},{4,5,1},{3,5,2},{4,1,4},{2,5,3}};
double b[5][2] = {{-10,12},{14,16},{18,-3},{14,12},{16,16}};
lapack_int info,m,n,lda,ldb,nrhs;
/* Initialization */
m = 5;
n = 3;
nrhs = 2;
lda = 5;
ldb = 5;
/* Print Entry Matrix */
print_matrix_colmajor( "Entry Matrix A", m, n, *A, lda );
return (*env)->NewStringUTF(env,hello);
}
CMakeLists.txt:
cmake_minimum_required(VERSION 3.4.1)
project(LapackTest)
add_library( native-lib
SHARED
native-lib.c
#lapack/LAPACKE/example/lapacke_example_aux.c
)
# Add dependent libraries
add_library(blas STATIC IMPORTED)
set_property(TARGET blas PROPERTY IMPORTED_LOCATION ${CMAKE_SOURCE_DIR}/libs/libblas.a)
add_library(lapack STATIC IMPORTED)
set_property(TARGET lapack PROPERTY IMPORTED_LOCATION ${CMAKE_SOURCE_DIR}/libs/liblapack.a)
# Location of header files
include_directories(
${CMAKE_SOURCE_DIR}/lapack/BLAS
${CMAKE_SOURCE_DIR}/lapack/CBLAS
${CMAKE_SOURCE_DIR}/lapack/LAPACKE
${CMAKE_SOURCE_DIR}/lapack/SRC
${CMAKE_SOURCE_DIR}/lapack/TESTING
${CMAKE_SOURCE_DIR}/lapack/CBLAS
)
add_subdirectory(lapack)
The problem is when I try to call the function print_matrix_colmajor() which is present inside the lapack directory then it gives us undefined reference issue. I want to make my CMakeList in such a way that i could be able to call any lapack function from my JNI.
Can someone help me to build the lapack module using the CMakeLists.txt file.
Related
How can I make the dynamic loader load a library with no versioning information for a library/executable that requires versioning information?
For example, say I am trying to run /bin/bash which requires symbol S with version X.Y.Z and libtinfo.so.6 provides symbol S but due to being built with a musl toolchain has no versioning information. Currently, this gives me the following error:
/bin/bash: /usr/local/x86_64-linux-musl/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Inconsistency detected by ld.so: dl-lookup.c: 112: check_match: Assertion `version->filename == NULL || ! _dl_name_match_p (version->filename, map)' failed!
I am trying to avoid the process described here where I make a custom DSO that essentially maps all symbols (i.e. I would have to write out each symbol) to the appropriate symbol in the musl library. I have seen a lot of discussion about loading older versions of symbols in a DSO, but nothing about NO symbol versions.
Does this require me to recompile all binaries with versioned symbol so they don't include versioning information?
Thanks for your help!
Update
After some investigation, I found that /bin/bash has a handful of symbols that it gets from libtinfo.so.6 such as tgoto, tgetstr, tputs, tgetent, tgetflag, tgetnum, UP, BC, and PC. When the dynamic loader tries to find the correct version of these symbols (for example, tputs#NCURSES6_TINFO_5.0.19991023) in the musl-built libtinfo.so.6 it fails as there is no versioning information in that file.
I think I have the beginnings of a hack-y solution (hopefully there is a better one out there). Essentially, I make a DSO that I compile with a GNU toolchain and load with LD_PRELOAD. In this DSO, I open the musl-built libtinfo.so.6.1 with dlopen and use dlsym to get the needed symbols. These symbols are then made globally available. While there is no version information for libtinfo.so.6, there are version sections (.gnu.version and .gnu.version_r), and I am able to execute bash without any errors/warning. The DSO source is below:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>
/* Functions */
static char *(*tgoto_internal)(const char *string, int x, int y);
static char *(*tgetstr_internal)(const char * id, char **area);
static int (*tputs_internal)(const char *string, int affcnt, int (*outc)(int));
static int (*tgetent_internal)(char *bufp, const char *name);
static int (*tgetflag_internal)(const char *id);
static int (*tgetnum_internal)(const char *id);
void __attribute__ ((constructor)) init(void);
/* Library Constructor */
void
init(void)
{
void *handle = dlopen("/usr/local/x86_64-linux-musl/lib/libtinfo.so.6.1", RTLD_LAZY);
tgoto_internal = dlsym(handle, "tgoto");
tgetstr_internal = dlsym(handle, "tgetstr");
tputs_internal = dlsym(handle, "tputs");
tgetent_internal = dlsym(handle, "tgetent");
tgetflag_internal = dlsym(handle, "tgetflag");
tgetnum_internal = dlsym(handle, "tgetnum");
}
char *
tgoto(const char *string, int x, int y)
{
return tgoto_internal(string, x, y);
}
char *
tgetstr(const char * id, char **area)
{
return tgetstr_internal(id, area);
}
int
tputs(const char *string, int affcnt, int (*outc)(int))
{
return tputs_internal(string, affcnt, outc);
}
int
tgetent(char *bufp, const char *name)
{
return tgetent_internal(bufp, name);
}
int
tgetflag(const char *id)
{
return tgetflag_internal(id);
}
int
tgetnum(const char *id)
{
return tgetnum_internal(id);
}
/* Objects */
char * UP = 0;
char * BC = 0;
char PC = 0;
However this solution doesn't seem to work all the time, and I still see the same warning as above when testing musl-built binaries, but this time, they don't crash the tests and just print a warning.
It should also be noted that I encountered a similar versioning error before with libreadline.so looking for versioning information in libtinfo.so. This seemed to have stemmed from my musl-built libreadline.so being the wrong version (8 instead of 7) and thus my configuration script went to the GNU libreadline.so which was version 7 and this tried to pull in the musl libtinfo.so which raised the error. Building libreadline.so.7 with the musl toolchain resolved this error perfectly.
Thanks to #LorinczyZsigmond for helping me arrive at the solution! Since they don't want to post a complete answer, I will to close the question.
The error:
/bin/bash: /usr/local/x86_64-linux-musl/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Inconsistency detected by ld.so: dl-lookup.c: 112: check_match: Assertion `version->filename == NULL || ! _dl_name_match_p (version->filename, map)' failed!
tells us that /bin/bash is looking for libtinfo.so.6 in the musl lib directory. However, if we look at /bin/bash under ldd we see that in general it looks for DSO's in GNU's lib directory:
$ ldd /bin/bash
linux-vdso.so.1 (0x00007ffd485f7000)
libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007f58ad8ba000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f58ad8b5000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f58ad6f4000)
/lib64/ld-linux-x86-64.so.2 => //lib64/ld-linux-x86-64.so.2 (0x00007f58ada22000)
When /bin/bash is run and the LD_LIBRARY_PATH environment variable points to the musl lib directory, the loader will try to resolve the libtinfo.so.6 dependency with musl's libtinfo.so.6, not GNU's. This causes a conflict since /bin/bash was linked against GNU's libtinfo.so.6 which has symbol versioning and perhaps more.
The fix, as said by #LorincyZsigmond, is:
locally compiled shared objects should be searched first by locally compiled programs, but be hidden from the 'default' programs.
So essentially I needed to not mix the GNU and musl libraries which I was doing by heavy-handedly setting LD_LIBRARY_PATH=/usr/local/x86_64-linux-musl/lib.
Instead of using LD_LIBRARY_PATH, I used the rpath linker option (-L/usr/local/x86_64-linux-musl/lib -Wl,-rpath,/usr/local/x86_64-linux-musl/lib) to hard-code the path to my musl libraries into the executable. This allows musl-built binaries to link against the DSO's the need while also allowing for GNU-built binaries to link against GNU-built DSOs (both of which are required when doing something like testing vim built from source).
As an aside: The rpath entries in an ELF's dynamic section are searched first.
I am new to Cython (as well as Python), and I am trying to understand where I am doing wrong when I try to expose the C-API of an external multithreaded library to Python. To illustrate my problem, I will go over a hypothetical MWE.
Let's assume that I have the following directory structure
.
├── app.py
├── c_mylib.pxd
├── cxx
│ ├── CMakeLists.txt
│ ├── include
│ │ └── mylib.h
│ └── src
│ └── reduce_cb.cpp
├── mylib.pyx
└── setup.py
Here, cxx contains the external multithreaded library as follows (header and implementation files are concatenated):
/* cxx/include/mylib.h */
#ifndef MYLIB_H_
#define MYLIB_H_
#ifdef __cplusplus
extern "C" {
#endif
typedef double (*func_t)(const double *, const double *, void *);
double reduce_cb(const double *, const double *, func_t, void *);
#ifdef __cplusplus
}
#endif
#endif
/* cxx/src/reduce_cb.cpp */
#include <iterator>
#include <mutex>
#include <thread>
#include <vector>
#include "mylib.h"
extern "C" {
double reduce_cb(const double *xb, const double *xe, func_t func, void *data) {
const auto d = std::distance(xb, xe);
const auto W = std::thread::hardware_concurrency();
const auto split = d / W;
const auto remain = d % W;
std::vector<std::thread> workers(W);
double res{0};
std::mutex lock;
const double *xb_w{xb};
const double *xe_w;
for (unsigned int widx = 0; widx < W; widx++) {
xe_w = widx < remain ? xb_w + split + 1 : xb_w + split;
workers[widx] = std::thread(
[&lock, &res, func, data](const double *xb, const double *xe) {
const double partial = func(xb, xe, data);
std::lock_guard<std::mutex> guard(lock);
res += partial;
},
xb_w, xe_w);
xb_w = xe_w;
}
for (auto &worker : workers)
worker.join();
return res;
}
}
with the accompanying cxx/CMakeLists.txt file as follows:
cmake_minimum_required(VERSION 3.9)
project(dummy LANGUAGES CXX)
add_library(mylib
include/mylib.h
src/reduce_cb.cpp
)
target_compile_features(mylib
PRIVATE
cxx_std_11
)
target_include_directories(mylib
PUBLIC
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>
$<INSTALL_INTERFACE:include>
)
set_target_properties(mylib
PROPERTIES PUBLIC_HEADER include/mylib.h
)
install(TARGETS mylib
ARCHIVE DESTINATION lib
LIBRARY DESTINATION lib
PUBLIC_HEADER DESTINATION include
)
The corresponding Cython files are as follows (this time definition and implementation files are concatenated):
# c_mylib.pxd
cdef extern from "include/mylib.h":
ctypedef double (*func_t)(const double *, const double *, void *)
double reduce_cb(const double *, const double *, func_t, void *)
# mylib.pyx
# cython: language_level = 3
cimport c_mylib
cdef double func(const double *xb, const double *xe, void *data):
cdef int d = (xe - xb)
func = <object>data
return func(<double[:d]>xb)
def reduce_cb(double [::1] arr not None, f):
cdef int d = arr.shape[0]
data = <void*>f
return c_mylib.reduce_cb(&arr[0], &arr[0] + d, func, data)
# setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
setup(
ext_modules=cythonize([
Extension("mylib", ["mylib.pyx"], libraries=["mylib"])
])
)
Building the C++ library, and building the Cython extension module and linking it against the C++ library following the instructions, I get undefined behavior when I try to run
import mylib
from numpy import array
def cb(x):
partial = 0
for idx in range(x.shape[0]):
partial += x[idx]
return partial
arr = array([val + 1 for val in range(100)], "d")
print("sum(arr): ", mylib.reduce_cb(arr, cb))
By undefined behavior, I mean that I get either of
SIGSEGV (Address boundary error),
"Fatal Python error: GC object already tracked" with SIGABRT, or,
(rarely) the correct result.
I have checked the documentation of Cython thoroughly (I guess), and I have searched both SO and Google for this problem, but I could not find a proper solution to this problem.
Basically, I would like to have a C library, which is unaware of Python and which uses callback functions from multiple threads, that is integrated inside Python. Is this at all possible? I tried nogil signatures and with gil: blocks as discussed in Cython's documentation, but I got compilation errors. Moreover, gc related functionality in Cython seems to be valid only for extension types, which does not apply to my case.
I am stuck and I would appreciate any hint/help.
That happens when you use Python-objects/functionality without lock. Your critical section is not only the summation but also the call to function func, i.e.:
workers[widx] = std::thread(
[&lock, &res, func, data](const double *xb, const double *xe) {
std::lock_guard<std::mutex> guard(lock);
const double partial = func(xb, xe, data); // must be guarded
res += partial;
},
xb_w, xe_w);
which makes the parallelization senseless in the first place, doesn't it? Probably, from software-engineering point of view, a better place for guard would be in the wrapper function func - but I have put it into worker because the consequences are seen much better this way.
Python uses reference counting for memory management - similar to std::shared_ptr. However it doesn't lock with fine granularity as shared_ptr, which locks only when changing the reference counter, but uses a more coarse lock - the global interpreter lock. That has the consequence, that when one changes the reference count of an python-object from open-MP-thread or other threads not registered in Python-interpreter the reference-counter is not protected/guarded and race conditions arise. What you are observing, are the possible results of such race conditions.
The GIL makes your endeavor more or less impossible: you need to lock every call to possible python but than you serialize the calls to this functionality!
I wish to generate arrays in a C extension module and pass them back to python.
The following code works for python2:
C_generate_array.c:
#include "Python.h"
#include "arrayobject.h"
#include "C_generate_array.h"
#include <assert.h>
static PyMethodDef C_generate_arrayMethods[] = {
{"get_array", get_array, METH_VARARGS},
{NULL, NULL} /* Sentinel - marks the end of this structure */
};
#if PY_MAJOR_VERSION >= 3
static struct PyModuleDef cModPyDem =
{
PyModuleDef_HEAD_INIT,
"C_generate_array", /* name of module */
"", /* module documentation, may be NULL */
-1, /* size of per-interpreter state of the module, or -1 if the module keeps state in global variables. */
C_generate_arrayMethods
};
PyMODINIT_FUNC PyInit_C_generate_array(void)
{
return PyModule_Create(&cModPyDem);
}
#else
void initC_generate_array() {
(void) Py_InitModule("C_generate_array", C_generate_arrayMethods);
import_array(); // Must be present for NumPy. Called first after above line.
}
#endif
static PyObject *get_array(PyObject *self, PyObject *args)
{
int dims[2];
dims[0]=dims[1]=2;
PyArrayObject *matout;
#if PY_MAJOR_VERSION >= 3
//what to do here?
return PyLong_FromLong(1);
#else
matout = (PyArrayObject *) PyArray_FromDims(2,dims,NPY_DOUBLE);
return PyArray_Return(matout);
#endif
}
C_generate_array.h:
#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
#if PY_MAJOR_VERSION >= 3
#define IS_PY3K
#endif
typedef int bool;
#define true 1
#define false 0
static PyObject *get_array(PyObject *self, PyObject *args);
C_generate_array_setup.py:
from distutils.core import setup, Extension
module1 = Extension('C_generate_array',
include_dirs = ['path_to_python/lib/python3.5/','path_to_python/lib/python3.5/site-packages/numpy/core/include/numpy/'],
sources = ['C_generate_array.c'])
setup (name = 'C_generate_array',
version = '1.0',
description = 'Example',
ext_modules = [module1])
Then building and installing:
>sudo python2.7 C_generate_array_setup.py build
>sudo python2.7 C_generate_array_setup.py install
>python2.7
>>> import C_generate_array
>>> C_generate_array.get_array()
array([[0., 0.],
[0., 0.]])
However, what is the equivalent of this for python3? I only found a way to return scalar variables:
>sudo python3.5 C_generate_array_setup.py build
>sudo python3.5 C_generate_array_setup.py install
>python3.5
>>> import C_generate_array
>>> C_generate_array.get_array()
1
How can I return arrays?
I think the issue is that PyArray_FromDims is a very old API function that is no longer recommended, and may have been removed from the Numpy headers. I don't know why it seems to work for Python 2, but it's possible that you have an older version of Numpy installed there.
I suggest you use PyArray_ZEROS instead, which has basically the same interface with an additional parameter to mark if the array should be Fortran contiguous (you probably want to set this to 0). If you want to fill the array to something other than zero then pick a different function (read the documentation I have linked).
I have a GNU/Linux application with uses a number of shared memory objects. It could, potentially, be run a number of times on the same system. To keep things tidy, I first create a directory in /dev/shm for each of the set of shared memory objects.
The problem is that on newer GNU/Linux distributions, I no longer seem to be able create these in a sub-directory of /dev/shm.
The following is a minimal C program with illustrates what I'm talking about:
/*****************************************************************************
* shm_minimal.c
*
* Test shm_open()
*
* Expect to create shared memory file in:
* /dev/shm/
* └── my_dir
* └── shm_name
*
* NOTE: Only visible on filesystem during execution. I try to be nice, and
* clean up after myself.
*
* Compile with:
* $ gcc -lrt shm_minimal.c -o shm_minimal
*
******************************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
int main(int argc, const char* argv[]) {
int shm_fd = -1;
char* shm_dir = "/dev/shm/my_dir";
char* shm_file = "/my_dir/shm_name"; /* does NOT work */
//char* shm_file = "/my_dir_shm_name"; /* works */
// Create directory in /dev/shm
mkdir(shm_dir, 0777);
// make shared memory segment
shm_fd = shm_open(shm_file, O_RDWR | O_CREAT, 0600);
if (-1 == shm_fd) {
switch (errno) {
case EINVAL:
/* Confirmed on:
* kernel v3.14, GNU libc v2.19 (ArchLinux)
* kernel v3.13, GNU libc v2.19 (Ubuntu 14.04 Beta 2)
*/
perror("FAIL - EINVAL");
return 1;
default:
printf("Some other problem not being tested\n");
return 2;
}
} else {
/* Confirmed on:
* kernel v3.8, GNU libc v2.17 (Mint 15)
* kernel v3.2, GNU libc v2.15 (Xubuntu 12.04 LTS)
* kernel v3.1, GNU libc v2.13 (Debian 6.0)
* kernel v2.6.32, GNU libc v2.12 (RHEL 6.4)
*/
printf("Success !!!\n");
}
// clean up
close(shm_fd);
shm_unlink(shm_file);
rmdir(shm_dir);
return 0;
}
/* vi: set ts=2 sw=2 ai expandtab:
*/
When I run this program on a fairly new distribution, the call to shm_open() returns -1, and errno is set to EINVAL. However, when I run on something a little older, it creates the shared memory object in /dev/shm/my_dir as expected.
For the larger application, the solution is simple. I can use a common prefix instead of a directory.
If you could help enlighten me to this apparent change in behavior it would be very helpful. I suspect someone else out there might be trying to do something similar.
So it turns out the issue stems from how GNU libc validates the shared memory name. Specifically, the shared memory object MUST now be at the root of the shmfs mount point.
This was changed in glibc git commit b20de2c3d9 as the result of bug BZ #16274.
Specifically, the change is the line:
if (name[0] == '\0' || namelen > NAME_MAX || strchr (name, '/') != NULL)
Which now disallows '/' from anywhere in the filename (not counting leading '/')
If you have a third party tool that was broken by this shm_open change, a brilliant coworker found a workaround : preload a library that overrides the shm_open call and swaps slashes for underscores. It does the same for shm_unlink as well, so the application can properly free shared memory when needed.
deslash_shm.cc :
#include <dlfcn.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <algorithm>
#include <string>
// function used in place of the standard shm_open() function
extern "C" int shm_open(const char *name, int oflag, mode_t mode)
{
// keep a function pointer to the real shm_open() function
static int (*real_open)(const char *, int, mode_t) = NULL;
// the first time in, ask the dynamic linker to find the real shm_open() function
if (!real_open) real_open = (int (*)(const char *, int, mode_t)) dlsym(RTLD_NEXT,"shm_open");
// take the name we were given and replace all slashes with underscores instead
std::string n = name;
std::replace(n.begin(), n.end(), '/', '_');
// call the real open function with the patched path name
return real_open(n.c_str(), oflag, mode);
}
// function used in place of the standard shm_unlink() function
extern "C" int shm_unlink(const char *name)
{
// keep a function pointer to the real shm_unlink() function
static int (*real_unlink)(const char *) = NULL;
// the first time in, ask the dynamic linker to find the real shm_unlink() function
if (!real_unlink) real_unlink = (int (*)(const char *)) dlsym(RTLD_NEXT, "shm_unlink");
// take the name we were given and replace all slashes with underscores instead
std::string n = name;
std::replace(n.begin(), n.end(), '/', '_');
// call the real unlink function with the patched path name
return real_unlink(n.c_str());
}
To compile this file:
c++ -fPIC -shared -o deslash_shm.so deslash_shm.cc -ldl
And preload it before starting a process that tries to use non-standard slash characters in shm_open:
in bash:
export LD_PRELOAD=/path/to/deslash_shm.so
in tcsh:
setenv LD_PRELOAD /path/to/deslash_shm.so
I am working through a sample program that uses both C++ source code as well as CUDA. This is the essential content from my four source files.
matrixmul.cu (main CUDA source code):
#include <stdlib.h>
#include <cutil.h>
#include "assist.h"
#include "matrixmul.h"
int main (int argc, char ** argv)
{
...
computeGold(reference, hostM, hostN, Mh, Mw, Nw); //reference to .cpp file
...
}
matrixmul_gold.cpp (C++ source code, single function, no main method):
void computeGold(float * P, const float * M, const float * N, int Mh, int Mw, int Nw)
{
...
}
matrixmul.h (header for matrixmul_gold.cpp file)
#ifndef matrixmul_h
#define matrixmul_h
extern "C"
void computeGold(float * P, const float * M, const float * N, int Mh, int Mw, int Nw);
#endif
assist.h (helper functions)
I am trying to compile and link these files so that they, well, work. So far I can get matrixmul_gold.cpp compiled using:
g++ -c matrixmul_gold.cpp
And I can compile the CUDA source code with out errors using:
nvcc -I/home/sbu/NVIDIA_GPU_Computing_SDK/C/common/inc -L/home/sbu/NVIDIA_GPU_Computing_SDK/C/lib matrixmul.cu -c -lcutil_x86_64
But I just end up with two .O files. I've tried a lot of different ways to link the two .O files but so far it's a no-go. What's the proper approach?
UPDATE: As requested, here is the output of:
nm matrixmul_gold.o matrixmul.o | grep computeGold
nm: 'matrixmul.o': No such file
0000000000000000 T _Z11computeGoldPfPKfS1_iii
I think the 'matrixmul.o' missing error is because I am not actually getting a successful compile when running the suggested compile command:
nvcc -I/home/sbu/NVIDIA_GPU_Computing_SDK/C/common/inc -L/home/sbu/NVIDIA_GPU_Computing_SDK/C/lib -o matrixmul matrixmul.cu matrixmul_gold.o -lcutil_x86_64
UPDATE 2: I was missing an extern "C" from the beginning of matrixmul_gold.cpp. I added that and the suggested compilation command works great. Thank you!
Conventionally you would use whichever compiler you are using to compile the code containing the main subroutine to link the application. In this case you have the main in the .cu, so use nvcc to do the linking. Something like this:
$ g++ -c matrixmul_gold.cpp
$ nvcc -I/home/sbu/NVIDIA_GPU_Computing_SDK/C/common/inc \
-L/home/sbu/NVIDIA_GPU_Computing_SDK/C/lib \
-o matrixmul matrixmul.cu matrixmul_gold.o -lcutil_x86_64
This will link an executable binary called matrimul from matrixmul.cu, matrixmul_gold.o and the cutil library (implicitly nvcc will link the CUDA runtime library and CUDA driver library as well).