Any good documentation for the cblas interface? [closed]

Any good documentation for the cblas interface? [closed] - reference

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Can someone recommend a good reference or tutorial for the cblas interface? Nothing comes up on google, all of the man pages I've found are for the fortran blas interface, and the pdf that came with MKL literally took ten seconds to search and wasn't helpful.
In particular, I'm curious why there is an extra parameter for row vs. column-major; can't the same operations already be achieved with the transpose flags? It seems like the extra parameter only adds complexity to already an already error-prone interface.

This article shows how to use cblas (and others) in C with a simple example: http://www.seehuhn.de/pages/linear
I have quoted the relevant part below in case the site goes down.
Using BLAS
To test the BLAS routines we want to perform a simple matrix-vector multiplication. Reading the file blas2-paper.ps.gz we find that the name of the corresponding Fortran function is DGEMV. The text blas2-paper.ps.gz also explains the meaning of the arguments to this function. In cblas.ps.gz we find that the corresponding C function name is cblas_dgemv. The following example uses this function to calculate the matrix-vector product
/ 3 1 3 \ / -1 \
| 1 5 9 | * | -1 |.
\ 2 6 5 / \ 1 /
Example file testblas.c:
#include <stdio.h>
#include <cblas.h>
double m[] = {
3, 1, 3,
1, 5, 9,
2, 6, 5
};
double x[] = {
-1, -1, 1
};
double y[] = {
0, 0, 0
};
int
main()
{
int i, j;
for (i=0; i<3; ++i) {
for (j=0; j<3; ++j) printf("%5.1f", m[i*3+j]);
putchar('\n');
}
cblas_dgemv(CblasRowMajor, CblasNoTrans, 3, 3, 1.0, m, 3,
x, 1, 0.0, y, 1);
for (i=0; i<3; ++i) printf("%5.1f\n", y[i]);
return 0;
}
To compile this program we use the following command.
cc testblas.c -o testblas -lblas -lm
The output of this test program is
3.0 1.0 3.0
1.0 5.0 9.0
2.0 6.0 5.0
-1.0
3.0
-3.0
which shows that everything worked fine and that we did not even use the transposed matrix by mistake.

The irix man page for intro_cblas is pretty good:
http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?cmd=getdoc&coll=0650&db=man&fname=3%20INTRO_CBLAS

Related

Wrong u128 multiplication in Rust? [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
Before I open bug, I want to check what is going on here.
I'm porting this C-code here to Rust:
unsigned __int128 r = (unsigned __int128)a * (unsigned __int128)b;
easy enough (I thought):
let r = (a as u128) * (b as u128);
Now with this input parameters I get a different multiplication result in C and Rust:
(0x56eaa5f5f650a9e3 as u128) * (0xa0cf24341e75bda9 as u128)
The results are different in Rust and C:
Rust: 0x3698fbc09d2c5b15e8889b1b676bbddb
C: 0x3698fbc0f417010bded944fe676bbddb
^^^^^^^^^^^^^^^^
I cross-checked the result, and got the same result as the C code.
Am I missing something?
=== context information added:
It is this function from xmr-stak (https://github.com/fireice-uk/xmr-stak) thas is behaving differently:
static inline uint64_t _umul128(uint64_t a, uint64_t b, uint64_t* hi)
{
unsigned __int128 r = (unsigned __int128)a * (unsigned __int128)b;
*hi = r >> 64;
return (uint64_t)r;
}
Regardless if the C implementation is wrong, I have to recreate the exact computation in Rust, because this is needed for a hash computation.

I looks like you must have a made a typo in either language:
>>> hex(0x3698fbc09d2c5b15e8889b1b676bbddb//0x56eaa5f5f650a9e3)
'0xa0cf24341e75bda9' # what your Rust code uses
>>> hex(0x3698fbc0f417010bded944fe676bbddb//0x56eaa5f5f650a9e3)
'0xa0cf24351e75bda9' # what your online calculator uses
^
Classical case of off-by-0x100000000 error :)

Rcpp subsetting contiguous StringVector

Good afternoon,
I have been trying to use a similar method to subsetting x[200:300] in R while using Rcpp. (Note, this is not the problem I am trying to solve, but I need to subset many ranges within the functions I am trying to write in C++, and I found that this was the bottleneck of my performance)
However, although I have tried ussing the methods in rcpp, using iterators or other things, I just don't seem to find a solution that is minimally "fast." Most of the solutions I find are very slow.
And looking at the reference of Rcpp, I can't seem to find anything, not can I find it looking in StackExchange.
I know this code is pretty ugly right now... But I am just clueless
// [[Rcpp::export]]
StringVector range_test_( StringVector& x, int i, int j){
StringVector vect(x.begin()+i, x.begin()+j);
return vect;
}
And then, it is like 800 times slower. I have been trying to find the same x[i:j] function that R, which is very fast, within the rcpp base... but I can't find it.
tests_range <- rbenchmark::benchmark(
x[200:3000],
range_test_(x, 200, 3000),
order = NULL,
replications = 80
)[,1:4]
Gives as result
test replications elapsed relative
1 x[200:3000] 80 0.001 1
3 range_test_(x, 200, 3000) 80 0.822 822
If anybody knows how to access the subsetting function x[i:j] or something as fast within Rcpp I would really appreciate it. I just can't seem to find the tool I am missing.

The issue is that the iterator constructor makes a copy. See this page
Copy the data between iterators first and last to the created vector
However, you can try this instead
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::StringVector in_range(Rcpp::StringVector &x, int i, int j) {
return x[Rcpp::Range(i - 1, j - 1)]; // zero indexed
}
The time taken is a lot closer
> set.seed(20597458)
> x <- replicate(1e3, paste0(sample(LETTERS, 5), collapse = ""))
> head(x)
[1] "NHVFQ" "XMEOF" "DABUT" "XKTAZ" "NQXZL" "NPJLM"
>
> stopifnot(all.equal(in_range(x, 100, 200), x[100:200]))
>
> library(microbenchmark)
> microbenchmark(in_range(x, 100, 200), x[100:200], times = 1e4)
Unit: nanoseconds
expr min lq mean median uq max neval
in_range(x, 100, 200) 1185 1580 3669.780 1581 1976 3263205 10000
x[100:200] 790 790 1658.571 1185 1186 2331256 10000
Note that there is a page here on susbetting. I could not find a relevant example there though.

Where are ioctl parameters (such as 0x1268 / BLKSSZGET) actually specified?

I am looking for a definitive specification describing the expected arguments and behavior of ioctl 0x1268 (BLKSSZGET).
This number is declared in many places (none of which contain a definitive reference source), such as linux/fs.h, but I can find no specification for it.
Surely, somebody at some point in the past decided that 0x1268 would get the physical sector size of a device and documented that somewhere. Where does this information come from and where can I find it?
Edit: I am not asking what BLKSSZGET does in general, nor am I asking what header it is defined in. I am looking for a definitive, standardized source that states what argument types it should take and what its behavior should be for any driver that implements it.
Specifically, I am asking because there appears to be a bug in blkdiscard in util-linux 2.23 (and 2.24) where the sector size is queried in to a uint64_t, but the high 32-bits are untouched since BLKSSZGET appears to expect a 32-bit integer, and this leads to an incorrect sector size, incorrect alignment calculations, and failures in blkdiscard when it should succeed. So before I submit a patch, I need to determine, with absolute certainty, if the problem is that blkdiscard should be using a 32-bit integer, or if the driver implementation in my kernel should be using a 64-bit integer.
Edit 2: Since we're on the topic, the proposed patch presuming blkdiscard is incorrect is:
--- sys-utils/blkdiscard.c-2.23 2013-11-01 18:28:19.270004947 -0400
+++ sys-utils/blkdiscard.c 2013-11-01 18:29:07.334002382 -0400
## -71,7 +71,8 ##
{
char *path;
int c, fd, verbose = 0, secure = 0;
- uint64_t end, blksize, secsize, range[2];
+ uint64_t end, blksize, range[2];
+ uint32_t secsize;
struct stat sb;
static const struct option longopts[] = {
## -146,8 +147,8 ##
err(EXIT_FAILURE, _("%s: BLKSSZGET ioctl failed"), path);
/* align range to the sector size */
- range[0] = (range[0] + secsize - 1) & ~(secsize - 1);
- range[1] &= ~(secsize - 1);
+ range[0] = (range[0] + (uint64_t)secsize - 1) & ~((uint64_t)secsize - 1);
+ range[1] &= ~((uint64_t)secsize - 1);
/* is the range end behind the end of the device ?*/
end = range[0] + range[1];
Applied to e.g. https://www.kernel.org/pub/linux/utils/util-linux/v2.23/.

The answer to "where is this specified?" does seem to be the kernel source.
I asked the question on the kernel mailing list here: https://lkml.org/lkml/2013/11/1/620
In response, Theodore Ts'o wrote (note: he mistakenly identified sys-utils/blkdiscard.c in his list but it's inconsequential):
BLKSSZGET returns an int. If you look at the sources of util-linux
v2.23, you'll see it passes an int to BLKSSZGET in
sys-utils/blkdiscard.c
lib/blkdev.c
E2fsprogs also expects BLKSSZGET to return an int, and if you look at
the kernel sources, it very clearly returns an int.
The one place it doesn't is in sys-utils/blkdiscard.c, where as you
have noted, it is passing in a uint64 to BLKSSZGET. This looks like
it's a bug in sys-util/blkdiscard.c.
He then went on to submit a patch¹ to blkdiscard at util-linux:
--- a/sys-utils/blkdiscard.c
+++ b/sys-utils/blkdiscard.c
## -70,8 +70,8 ## static void __attribute__((__noreturn__)) usage(FILE *out)
int main(int argc, char **argv)
{
char *path;
- int c, fd, verbose = 0, secure = 0;
- uint64_t end, blksize, secsize, range[2];
+ int c, fd, verbose = 0, secure = 0, secsize;
+ uint64_t end, blksize, range[2];
struct stat sb;
static const struct option longopts[] = {
I had been hesitant to mention the blkdiscard tool in both my mailing list post and the original version of this SO question specifically for this reason: I know what's in my kernel's source, it's already easy enough to modify blkdiscard to agree with the source, and this ended up distracting from the real question of "where is this documented?".
So, as for the specifics, somebody more official than me has also stated that the BLKSSZGET ioctl takes an int, but the general question regarding documentation remained. I then followed up with https://lkml.org/lkml/2013/11/3/125 and received another reply from Theodore Ts'o (wiki for credibility) answering the question. He wrote:
> There was a bigger question hidden behind the context there that I'm
> still wondering about: Are these ioctl interfaces specified and
> documented somewhere? From what I've seen, and from your response, the
> implication is that the kernel source *is* the specification, and not
> document exists that the kernel is expected to comply with; is this
> the case?
The kernel source is the specification. Some of these ioctl are
documented as part of the linux man pages, for which the project home
page is here:
https://www.kernel.org/doc/man-pages/
However, these document existing practice; if there is a discrepancy
between what is in the kernel has implemented and the Linux man pages,
it is the Linux man pages which are buggy and which will be changed.
That is man pages are descriptive, not perscriptive.
I also asked about the use of "int" in general for public kernel APIs, his response is there although that is off-topic here.
Answer: So, there you have it, the final answer is: The ioctl interfaces are specified by the kernel source itself; there is no document that the kernel adheres to. There is documentation to describe the kernel's implementations of various ioctls, but if there is a mismatch, it is an error in the documentation, not in the kernel.
¹ With all the above in mind, I want to point out that an important difference in the patch Theodore Ts'o submitted, compared to mine, is the use of "int" rather than "uint32_t" -- BLKSSZGET, as per kernel source, does indeed expect an argument that is whatever size "int" is on the platform, not a forced 32-bit value.

Performance decrease with threaded implementation

I implemented a small program in C to calculate PI using a Monte Carlo method (mainly because of personal interest and training). After having implemented the basic code structure, I added a command-line option allowing to execute the calculations threaded.
I expected major speed ups, but I got disappointed. The command-line synopsis should be clear. The final number of iterations made to approximate PI is the product of the number of -iterations and -threads passed via the command-line. Leaving -threads blank defaults it to 1 thread resulting in execution in the main thread.
The tests below are tested with 80 Million iterations in total.
On Windows 7 64Bit (Intel Core2Duo Machine):
Compiled using Cygwin GCC 4.5.3: gcc-4 pi.c -o pi.exe -O3
On Ubuntu/Linaro 12.04 (8Core AMD):
Compiled using GCC 4.6.3: gcc pi.c -lm -lpthread -O3 -o pi
Performance
On Windows, the threaded version is a few milliseconds faster than the un-threaded. I expected a better performance, to be honest. On Linux, ew! What the heck? Why does it take even 2000% longer? Of course this is depending much on the implementation, so here it goes. An excerpt after the command-line argument parsing was done and the calculation is started:
// Begin computation.
clock_t t_start, t_delta;
double pi = 0;
if (args.threads == 1) {
t_start = clock();
pi = pi_mc(args.iterations);
t_delta = clock() - t_start;
}
else {
pthread_t* threads = malloc(sizeof(pthread_t) * args.threads);
if (!threads) {
return alloc_failed();
}
struct PIThreadData* values = malloc(sizeof(struct PIThreadData) * args.threads);
if (!values) {
free(threads);
return alloc_failed();
}
t_start = clock();
for (i=0; i < args.threads; i++) {
values[i].iterations = args.iterations;
values[i].out = 0.0;
pthread_create(threads + i, NULL, pi_mc_threaded, values + i);
}
for (i=0; i < args.threads; i++) {
pthread_join(threads[i], NULL);
pi += values[i].out;
}
t_delta = clock() - t_start;
free(threads);
threads = NULL;
free(values);
values = NULL;
pi /= (double) args.threads;
}
While pi_mc_threaded() is implemented as:
struct PIThreadData {
int iterations;
double out;
};
void* pi_mc_threaded(void* ptr) {
struct PIThreadData* data = ptr;
data->out = pi_mc(data->iterations);
}
You can find the full source code at http://pastebin.com/jptBTgwr.
Question
Why is this? Why this extreme difference on Linux? I expected the anmount of time taken to calculate to be at least 3/4 of the original time. It would of course be possible that I simply made wrong use of the pthread library. A clarifcation on how to do correct in this case would be very nice.

The problem is that in glibc's implementation, rand() calls __random(), and that
long int
__random ()
{
int32_t retval;
__libc_lock_lock (lock);
(void) __random_r (&unsafe_state, &retval);
__libc_lock_unlock (lock);
return retval;
}
locks around each call to the function __random_r that does the actual work.
Thus, as soon as you have more than one thread using rand(), you make each thread wait for the other(s) on almost every call to rand(). Directly using random_r() with your own buffers in each thread should be much faster.

Performance and threading is a black art. The answer depends on the specifics of the compiler and libraries used to do threading, how well the kernel handles it, etc. Basically, if your libraries for *nix are not efficient in switching, moving objects around etc, threading will in fact, be slower . THis is one of the reasons a lot us doing thread work now work with JVM or JVM-like languages. We can trust the runtime JVM's behavior -- it's overall speed may vary with platform, but it's consistent on that platform. In addition, you may have some hidden wait/race conditions that you uncovered just due to timing that may not show up on Windows.
If you are in a position to change your language, consider Scala or D. Scala is the actor driven model successor to Java, and D, the successor to C. Both languages show their roots -- if you can write in C, D should be no problem. Both languages however, implement the actor model. NO MORE THREAD POOLS, NO MORE RACE CONDITIONS ETC!!!!!!

For comparison, I just tried your app on Windows Vista, compiled with Borland C++, and the 2 thread version performed nearly twice as fast as the single thread.
pi.exe -iterations 20000000 -stats -threads 1
3.141167
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 12.511000 sec
Threads: Main
pi.exe -iterations 10000000 -stats -threads 2
3.142397
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 6.584000 sec
Threads: 2
That's compiled against the thread-safe run-time library. Using the single thread library, both versions run at twice their thread-safe speed.
pi.exe -iterations 20000000 -stats -threads 1
3.141167
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 6.458000 sec
Threads: Main
pi.exe -iterations 10000000 -stats -threads 2
3.141314
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 3.978000 sec
Threads: 2
So the 2 thread version is still twice as fast, but the 1 thread version with the single thread library is actually faster than the 2 thread version on the thread-safe library.
Looking at Borland's rand implementation, they use thread local storage for the seed in the thread-safe implementation, so it's not going to have the same negative impact on threaded code as glibc's lock, but the thread-safe implementation will obviously be slower than the single thread implementation.
The bottom line though, is that your compiler's rand implementation is probably the main performance issue in both cases.
Update
I've just tried replacing your rand_01 calls with inline implementations of Borland's rand function using a local variable for the seed, and the results are consistently twice as fast in the 2 thread case.
The updated code looks like this:
#define MULTIPLIER 0x015a4e35L
#define INCREMENT 1
double pi_mc(int iterations) {
unsigned seed = 1;
long long inner = 0;
long long outer = 0;
int i;
for (i=0; i < iterations; i++) {
seed = MULTIPLIER * seed + INCREMENT;
double x = ((int)(seed >> 16) & 0x7fff) / (double) RAND_MAX;
seed = MULTIPLIER * seed + INCREMENT;
double y = ((int)(seed >> 16) & 0x7fff) / (double) RAND_MAX;
double d = sqrt(pow(x, 2.0) + pow(y, 2.0));
if (d <= 1.0) {
inner++;
}
else {
outer++;
}
}
return ((double) inner / (double) iterations) * 4;
}
I don't know how good that is as rand implementations go, but it's worth at least trying on Linux to see whether it makes a difference to the performance.

What programing language is this? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Someone know what kind of language used below:
String^ fileName = "C:\\Test1.txt";
array<Byte>^ Array = gcnew array<Byte>(512);
try
{
FileStream^ fs = File::OpenRead(fileName);
fs->Read(Array, 0, 512);fs->Close();
}
catch (...)
{
MessageBox::Show("Disk error");
Application::Exit();
}
and another example of that language:
int RotateLeft3 (int number)
{
if ( ( number & 0x20000000 ) == 0x20000000 )
{
number <<= 3;number |= 1;
}
else
number <<= 3;
return number;
}

Its C++ in .NET. You can tell by the use of ^ as pointer instead of *

This is C++/CLI, in other words the C++ variant that runs on top of the .Net CLR.
On no account should this be confused with native C++.

It looks like managed c++ from Microsoft.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Any good documentation for the cblas interface? [closed] - reference

The irix man page for intro_cblas is pretty good: http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?cmd=getdoc&coll=0650&db=man&fname=3%20INTRO_CBLAS

Related

Wrong u128 multiplication in Rust? [closed]

Rcpp subsetting contiguous StringVector

Where are ioctl parameters (such as 0x1268 / BLKSSZGET) actually specified?

Performance decrease with threaded implementation

What programing language is this? [closed]

Categories

Resources