Scons checking for compiler option - scons

I want to check in Scons that my compiler support some option (for example
-fcilkplus). The only way I manage to do it is the following sequence of
operations:
env.Prepend(CXXFLAGS = ['-fcilkplus'], LIBS = ['cilkrts'])
Then I launch my custom checker:
def CheckCilkPlusCompiler(context):
test = """
#include <cilk/cilk.h>
#include <assert.h>
int fib(int n) {
if (n < 2) return n;
int a = cilk_spawn fib(n-1);
int b = fib(n-2);
cilk_sync;
return a + b;
}
int main() {
int result = fib(30);
assert(result == 832040);
return 0;
}
"""
context.Message('Checking Cilk++ compiler ... ')
result = context.TryRun(test, '.cpp')
context.Result(result[0])
return result[0]
Now if it fails, I have to remove the two options extra flags -fcilkplus
cilkrts from the environment variables. Is there a better way to do that ?
The problem is that I can't manage to access to the env from the context and
therefore I can't make a clone of it.

You can use check the availability of a library with SCons as follows:
env = Environment()
conf = Configure(env)
if not conf.CheckLib('cilkrts'):
print 'Did not find libcilkrts.a, exiting!'
Exit(1)
else:
env.Prepend(CXXFLAGS = ['-fcilkplus'], LIBS = ['cilkrts'])
env = conf.Finish()
You could also check the availability of a header as follows:
env = Environment()
conf = Configure(env)
if conf.CheckCHeader('cilk/cilk.h'):
env.Prepend(CXXFLAGS = ['-fcilkplus'], LIBS = ['cilkrts'])
env = conf.Finish()
Update:
I just realized, you can access the environment on the Configure object as follows:
conf.env.Append(...)

Related

Assign memory blob to py-torch output tensor (C++ API)

I am training a linear model using py-torch and I am saving it to a file with the "save" function call. I have another code that loads the model in C++ and performs inference.
I would like to instruct the Torch CPP Library to use a specific memory blob at the final output tensor. Is this even possible? If yes, how? Below you can see a small example of what I am trying to achieve.
#include <iostream>
#include <memory>
#include <torch/script.h>
int main(int argc, const char* argv[]) {
if (argc != 3) {
std::cerr << "usage: example-app <path-to-exported-script-module>\n";
return -1;
}
long numElements = (1024*1024)/sizeof(float) * atoi(argv[2]);
float *a = new float[numElements];
float *b = new float[numElements];
float *c = new float[numElements*4];
for (int i = 0; i < numElements; i++){
a[i] = i;
b[i] = -i;
}
//auto options = torch::TensorOptions().dtype(torch::kFloat64);
at::Tensor a_t = torch::from_blob((float*) a, {numElements,1});
at::Tensor b_t = torch::from_blob((float*) b, {numElements,1});
at::Tensor out = torch::from_blob((float*) c, {numElements,4});
at::Tensor c_t = at::cat({a_t,b_t}, 1);
at::Tensor d_t = at::reshape(c_t, {numElements,2});
torch::jit::script::Module module;
try {
module = torch::jit::load(argv[1]);
}
catch (const c10::Error& e) {
return -1;
}
out = module.forward({d_t}).toTensor();
std::cout<< out.sizes() << "\n";
delete [] a;
delete [] b;
delete [] c;
return 0;
}
So, I am allocating memory into "c" and then I am casting creating a tensor out of this memory. I store this memory into a tensor named "out". I load the model when I call the forward method. I observe that the resulted data are copied/moved into the "out" tensor. However, I would like to instruct Torch to directly store into "out" memory. Is this possible?
Somewhere in libtorch source code (I don' remember where, I'll try to find the file), there is an operator which is something like below (notice the last &&)
torch::tensor& operator=(torch::Tensor rhs) &&;
and which does what you need if I remember correctly. Basically torch assumes that if you allocate a tensor rhs to an rvalue reference tensor, then you actually mean to copy rhs into the underlying storage.
So in your case, that would be
std::move(out) = module.forward({d_t}).toTensor();
or
torch::from_blob((float*) c, {numElements,4}) = module.forward({d_t}).toTensor();

Call a function from a given package use RcppArmadillo(Rcpp)

I am learning to use RcppArmadillo(Rcpp) to make the code run faster.
In practice, I usually encounter that I want to call some functions from given packages.
The below is a little example.
I want to calculate the thresholds of lasso (or scad, mcp). In R we can use
the thresh_est function in library(HSDiC), which reads
library(HSDiC) # acquire the thresh_est() function
# usage: thresh_est(z, lambda, tau, a = 3, penalty = c("MCP", "SCAD", "lasso"))
# help: ?thresh_est
z = seq(-5,5,length=500)
thresh = thresh_est(z,lambda=1,tau=1,penalty="lasso")
# thresh = thresh_est(z,lambda=1,tau=1,a=3,penalty="MCP")
# thresh = thresh_est(z,lambda=1,tau=1,a=3.7,penalty="SCAD")
plot(z,thresh)
Then I try to realize the above via RcppArmadillo(Rcpp).
According to the answers of
teuder as well as
coatless
and coatless,
I write the code (which is saved as thresh_arma.cpp) below:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp; //with this, Rcpp::Function can be Function for short, etc.
using namespace arma; // with this, arma::vec can be vec for short, etc.
// using namespace std; //std::string
// [[Rcpp::export]]
vec thresh_arma(vec z, double lambda, double a=3, double tau=1, std::string penalty) {
// Obtaining namespace of the package
Environment pkg = namespace_env("HSDiC");
// Environment HSDiC("package:HSDiC");
// Picking up the function from the package
Function f = pkg["thresh_est"];
// Function f = HSDiC["thresh_est"];
vec thresh;
// In R: thresh_est(z, lambda, tau, a = 3, penalty = c("MCP", "SCAD", "lasso"))
if (penalty == "lasso") {
thresh = f(_["z"]=z,_["lambda"]=lambda,_["a"]=a,_["tau"]=tau,_["penalty"]="lasso");
} else if (penalty == "scad") {
thresh = f(_["z"]=z,_["lambda"]=lambda,_["a"]=a,_["tau"]=tau,_["penalty"]="SCAD");
} else if (penalty == "mcp") {
thresh = f(_["z"]=z,_["lambda"]=lambda,_["a"]=a,_["tau"]=tau,_["penalty"]="MCP");
}
return thresh;
}
Then I do the compilation as
library(Rcpp)
library(RcppArmadillo)
sourceCpp("thresh_arma.cpp")
# z = seq(-5,5,length=500)
# thresh = thresh_arma(z,lambda=1,tau=1,penalty="lasso")
# # thresh = thresh_arma(z,lambda=1,a=3,tau=1,penalty="mcp")
# # thresh = thresh_arma(z,lambda=1,a=3.7,tau=1,penalty="scad")
# plot(z,thresh)
However, the compilation fails and I have no idea about the reasons.

Check BIC via RcppArmadillo(Rcpp)

I want check the BIC usage in statistics.
My little example, which is saved as check_bic.cpp, is presented as follows:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
List check_bic(const int N = 10, const int p = 20, const double seed=0){
arma_rng::set_seed(seed); // for reproducibility
arma::mat Beta = randu(p,N); //randu/randn:random values(uniform and normal distributions)
arma::vec Bic = randu(N);
uvec ii = find(Bic == min(Bic)); // may be just one or several elements
int id = ii(ii.n_elem); // fetch the last one element
vec behat = Beta.col(id); // fetch the id column of matrix Beta
List ret;
ret["Bic"] = Bic;
ret["ii"] = ii;
ret["id"] = id;
ret["Beta"] = Beta;
ret["behat"] = behat;
return ret;
}
Then I compile check_bic.cpp in R by
library(Rcpp)
library(RcppArmadillo);
sourceCpp("check_bic.cpp")
and the compilation can pass successfully.
However, when I ran
check_bic(10,20,0)
in R, it shows errors as
error: Mat::operator(): index out of bounds
Error in check_bic(10, 20, 0) : Mat::operator(): index out of bounds
I check the .cpp code line by line, and guess the problems probably
happen at
uvec ii = find(Bic == min(Bic)); // may be just one or several elements
int id = ii(ii.n_elem); // fetch the last one element
since if uvec ii only has one element, then ii.n_elem may be NaN or something
else in Rcpp (while it's ok in Matlab), while I dont konw how to
deal with case. Any help?

How to properly pass arguments as structs to NVRTC?

let prog =
"""//Kernel code:
extern "C" {
#pragma pack(1)
typedef struct {
int length;
float *pointer;
} global_array_float;
__global__ void kernel_main(global_array_float x){
printf("(on device) x.length=%d\n",x.length); // prints: (on device) x.length=10
printf("(on device) x.pointer=%lld\n",x.pointer); // prints: (on device) x.pointer=0
printf("sizeof(global_array_float)=%d", sizeof(global_array_float)); // 12 bytes just as expected
}
;}"""
printfn "%s" prog
let cuda_kernel = compile_kernel prog "kernel_main"
let test_launcher(str: CudaStream, kernel: CudaKernel, x: CudaGlobalArray<float32>, o: CudaGlobalArray<float32>) =
let block_size = 1
kernel.GridDimensions <- dim3(1)
kernel.BlockDimensions <- dim3(block_size)
printfn "(on host) x.length=%i" x.length // prints: (on host) x.length=10
printfn "(on host) x.pointer=%i" x.pointer // prints: (on host) x.pointer=21535919104
let args: obj [] = [|x.length;x.pointer|]
kernel.RunAsync(str.Stream, args)
let cols, rows = 10, 1
let a = d2M.create((rows,cols))
|> fun x -> fillRandomUniformMatrix ctx.Str x 1.0f 0.0f; x
let a' = d2MtoCudaArray a
//printfn "%A" (getd2M a)
let o = d2M.create((rows,cols)) // o does nothing here as this is a minimalist example.
let o' = d2MtoCudaArray o
test_launcher(ctx.Str,cuda_kernel,a',o')
cuda_context.Synchronize()
//printfn "%A" (getd2M o)
Here is an excerpt from the main repo that I am working on currently. I am very close to having a working F# quotations to Cuda C compiler, but I can't figure out how to pass the arguments into the function properly from the host side.
Despite the pack pragma, the NVRTC 7.5 Cuda compiler is doing some other optimization and I have no idea what it is.
Because I am working off F# quotations, I need to pass the arguments as a single struct for this to work. If I change the function from kernel_main(global_array_float x) to something like kernel_main(int x_length, float *x_pointer) then it works, but I that is not the form which the quotations system gives me upfront and I would like to avoid doing extra work to make F# more like C.
Any idea what I could try?
I've made two mistaken assumptions.
First error is assuming that let args: obj [] = [|x.length;x.pointer|] would get neatly placed on stack next to each other. In actuality these are two different arguments and the second one gets lost somewhere when passed along like in the above.
It can be fixed by making a custom struct type and rewriting the expression like so: let args: obj [] = [|CudaLocalArray(x.length,x.pointer)|].
The other mistaken assumption that I found when I rewrote it like the above is that using [<StructLayout(LayoutKind.Sequential>] does not mean the fields will be packed together. Instead, like for C, pack is a argument, so it needs to be used like so: [<StructLayout(LayoutKind.Sequential,Pack=1)>].

Why patching a string using .ptr fails under Linux64 but not under Win32?

Why the small sample below fails under Linux64 but not under Windows32?
module test;
import std.string, std.stdio;
void main(string[] args)
{
string a = "abcd=1234";
auto b = &a;
auto Index = indexOf(*b, '=');
if (Index != -1)
*cast (char*) (b.ptr + Index) = '#';
writeln(*b);
readln;
}
one thing to remember is that string is an alias to (immutable char)[] which means that trying to write to the elements is undefined behavior
one of the reasons that I can think the behavior differs is that under linux64 the compiler puts the string data in write-protected memory, which means that *cast (char*) (b.ptr + Index) = '#'; fails (either silently or with segfault)

Resources