Good afternoon,
I have been trying to use a similar method to subsetting x[200:300] in R while using Rcpp. (Note, this is not the problem I am trying to solve, but I need to subset many ranges within the functions I am trying to write in C++, and I found that this was the bottleneck of my performance)
However, although I have tried ussing the methods in rcpp, using iterators or other things, I just don't seem to find a solution that is minimally "fast." Most of the solutions I find are very slow.
And looking at the reference of Rcpp, I can't seem to find anything, not can I find it looking in StackExchange.
I know this code is pretty ugly right now... But I am just clueless
// [[Rcpp::export]]
StringVector range_test_( StringVector& x, int i, int j){
StringVector vect(x.begin()+i, x.begin()+j);
return vect;
}
And then, it is like 800 times slower. I have been trying to find the same x[i:j] function that R, which is very fast, within the rcpp base... but I can't find it.
tests_range <- rbenchmark::benchmark(
x[200:3000],
range_test_(x, 200, 3000),
order = NULL,
replications = 80
)[,1:4]
Gives as result
test replications elapsed relative
1 x[200:3000] 80 0.001 1
3 range_test_(x, 200, 3000) 80 0.822 822
If anybody knows how to access the subsetting function x[i:j] or something as fast within Rcpp I would really appreciate it. I just can't seem to find the tool I am missing.
The issue is that the iterator constructor makes a copy. See this page
Copy the data between iterators first and last to the created vector
However, you can try this instead
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::StringVector in_range(Rcpp::StringVector &x, int i, int j) {
return x[Rcpp::Range(i - 1, j - 1)]; // zero indexed
}
The time taken is a lot closer
> set.seed(20597458)
> x <- replicate(1e3, paste0(sample(LETTERS, 5), collapse = ""))
> head(x)
[1] "NHVFQ" "XMEOF" "DABUT" "XKTAZ" "NQXZL" "NPJLM"
>
> stopifnot(all.equal(in_range(x, 100, 200), x[100:200]))
>
> library(microbenchmark)
> microbenchmark(in_range(x, 100, 200), x[100:200], times = 1e4)
Unit: nanoseconds
expr min lq mean median uq max neval
in_range(x, 100, 200) 1185 1580 3669.780 1581 1976 3263205 10000
x[100:200] 790 790 1658.571 1185 1186 2331256 10000
Note that there is a page here on susbetting. I could not find a relevant example there though.
Related
I'm trying to generate multinomial random variables as fast as possible. And I learned that gsl_ran_multinomial could be a good choice. However, I tried to use it based on the answer in this post: https://stackoverflow.com/a/23100665/21039115, and the results were always wrong.
In detail, my code is
// [[Rcpp::export]]
arma::ivec rmvn_gsl(int K, arma::vec prob) {
gsl_rng *s = gsl_rng_alloc(gsl_rng_mt19937); // Create RNG seed
arma::ivec temp(K);
gsl_ran_multinomial(s, K, 1, prob.begin(), (unsigned int *) temp.begin());
gsl_rng_free(s); // Free memory
return temp;
}
And the result was something like
rmvn_gsl(3, c(0.2, 0.7, 0.1))
[,1]
[1,] 1
[2,] 0
[3,] 0
which is ridiculous.
I was wondering if there was any problem exist in the code... I couldn't find any other examples to compare. I appreciate any help!!!
UPDATE:
I found the primary problem here is that I didn't set a random seed, and it seems that gsl has its own default seed (FYI: https://stackoverflow.com/a/32939816/21039115). Once I set the seed by time, the code worked. But I will go with rmultinom since it can even be faster than gsl_ran_multinomial based on microbenchmark.
Anyway, #Dirk Eddelbuettel provided a great example of implementing gsl_ran_multinomial below. Just pay attention to the random seeds issue if someone met the same problem as me.
Here is a complete example taking a double vector of probabilities and returning an unsigned integer vector (at the compiled level) that is mapped to an integer vector by the time we are back in R:
Code
#include <RcppGSL.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
// [[Rcpp::depends(RcppGSL)]]
// [[Rcpp::export]]
std::vector<unsigned int> foo(int n, const std::vector <double> p) {
int k = p.size();
std::vector<unsigned int> nv(k);
gsl_rng_env_setup();
gsl_rng *s = gsl_rng_alloc(gsl_rng_mt19937); // Create RNG instance
gsl_ran_multinomial(s, k, n, &(p[0]), &(nv[0]));
gsl_rng_free(s);
return nv;
}
/*** R
foo(400, c(0.1, 0.2, 0.3, 0.4))
*/
Output
> Rcpp::sourceCpp("~/git/stackoverflow/75165241/answer.cpp")
> foo(400, c(0.1, 0.2, 0.3, 0.4))
[1] 37 80 138 145
>
Why is the speed of nodejs array shift/push operations not linear in the size of the array? There is a dramatic knee at 87370 that completely crushes the system.
Try this, first with 87369 elements in q, then with 87370. (Or, on a 64-bit system, try 85983 and 85984.) For me, the former runs in .05 seconds; the latter, in 80 seconds -- 1600 times slower. (observed on 32-bit debian linux with node v0.10.29)
q = [];
// preload the queue with some data
for (i=0; i<87369; i++) q.push({});
// fetch oldest waiting item and push new item
for (i=0; i<100000; i++) {
q.shift();
q.push({});
if (i%10000 === 0) process.stdout.write(".");
}
64-bit debian linux v0.10.29 crawls starting at 85984 and runs in .06 / 56 seconds. Node v0.11.13 has similar breakpoints, but at different array sizes.
Shift is a very slow operation for arrays as you need to move all the elements but V8 is able to use a trick to perform it fast when the array contents fit in a page (1mb).
Empty arrays start with 4 slots and as you keep pushing, it will resize the array using formula 1.5 * (old length + 1) + 16.
var j = 4;
while (j < 87369) {
j = (j + 1) + Math.floor(j / 2) + 16
console.log(j);
}
Prints:
23
51
93
156
251
393
606
926
1406
2126
3206
4826
7256
10901
16368
24569
36870
55322
83000
124517
So your array size ends up actually being 124517 items which makes it too large.
You can actually preallocate your array just to the right size and it should be able to fast shift again:
var q = new Array(87369); // Fits in a page so fast shift is possible
// preload the queue with some data
for (i=0; i<87369; i++) q[i] = {};
If you need larger than that, use the right data structure
I started digging into the v8 sources, but I still don't understand it.
I instrumented deps/v8/src/builtins.cc:MoveElemens (called from Builtin_ArrayShift, which implements the shift with a memmove), and it clearly shows the slowdown: only 1000 shifts per second because each one takes 1ms:
AR: at 1417982255.050970: MoveElements sec = 0.000809
AR: at 1417982255.052314: MoveElements sec = 0.001341
AR: at 1417982255.053542: MoveElements sec = 0.001224
AR: at 1417982255.054360: MoveElements sec = 0.000815
AR: at 1417982255.055684: MoveElements sec = 0.001321
AR: at 1417982255.056501: MoveElements sec = 0.000814
of which the memmove is 0.000040 seconds, the bulk is the heap->RecordWrites (deps/v8/src/heap-inl.h):
void Heap::RecordWrites(Address address, int start, int len) {
if (!InNewSpace(address)) {
for (int i = 0; i < len; i++) {
store_buffer_.Mark(address + start + i * kPointerSize);
}
}
}
which is (store-buffer-inl.h)
void StoreBuffer::Mark(Address addr) {
ASSERT(!heap_->cell_space()->Contains(addr));
ASSERT(!heap_->code_space()->Contains(addr));
Address* top = reinterpret_cast<Address*>(heap_->store_buffer_top());
*top++ = addr;
heap_->public_set_store_buffer_top(top);
if ((reinterpret_cast<uintptr_t>(top) & kStoreBufferOverflowBit) != 0) {
ASSERT(top == limit_);
Compact();
} else {
ASSERT(top < limit_);
}
}
when the code is running slow, there are runs of shift/push ops followed by runs of 5-6 calls to Compact() for every MoveElements. When it's running fast, MoveElements isn't called until a handful of times at the end, and just a single compaction when it finishes.
I'm guessing memory compaction might be thrashing, but it's not falling in place for me yet.
Edit: forget that last edit about output buffering artifacts, I was filtering duplicates.
this bug had been reported to google, who closed it without studying the issue.
https://code.google.com/p/v8/issues/detail?id=3059
When shifting out and calling tasks (functions) from a queue (array)
the GC(?) is stalling for an inordinate length of time.
114467 shifts is OK
114468 shifts is problematic, symptoms occur
the response:
he GC has nothing to do with this, and nothing is stalling either.
Array.shift() is an expensive operation, as it requires all array
elements to be moved. For most areas of the heap, V8 has implemented a
special trick to hide this cost: it simply bumps the pointer to the
beginning of the object by one, effectively cutting off the first
element. However, when an array is so large that it must be placed in
"large object space", this trick cannot be applied as object starts
must be aligned, so on every .shift() operation all elements must
actually be moved in memory.
I'm not sure there's a whole lot we can do about this. If you want a
"Queue" object in JavaScript with guaranteed O(1) complexity for
.enqueue() and .dequeue() operations, you may want to implement your
own.
Edit: I just caught the subtle "all elements must be moved" part -- is RecordWrites not GC but an actual element copy then? The memmove of the array contents is 0.04 milliseconds. The RecordWrites loop is 96% of the 1.1 ms runtime.
Edit: if "aligned" means the first object must be at first address, that's what memmove does. What is RecordWrites?
Is there an R function for changing a text to lowercase, but for the first letter of each word, i.e. change?
"You live NEAR Chicago"
to
"You live Near Chicago"
The point is to benefit from a quite efficient implementation, if possible.
Could this be integrated to the tm R package (or is already available there), so that it could be applied to a corpus directly?
(the goal is to built a simple location detector in text, crossing with geonames).
If you're handling the bit where the word(s) (like "near") are next to the geographic location(s), then there are existing code snippets for something like a ucfirst bit of functionality. However, you mentioned speed, so here's a comparison between an Rcpp implementation and a basic/straight R implementation (both are vectorized):
library(Rcpp)
library(microbenchmark)
# pure Rcpp/C++ implementation
sourceCpp("
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
std::vector< std::string > ucfirst( std::vector< std::string > strings ) {
int len = strings.size();
for( int i=0; i < len; i++ ) {
std::transform(strings[i].begin(), strings[i].end(), strings[i].begin(), ::tolower);
strings[i][0] = toupper( strings[i][0] );
}
return strings;
}")
r_ucfirst <- function (str) {
paste(toupper(substring(str, 1, 1)), tolower(substring(str, 2)), sep = "")
}
print(ucfirst("hello"))
## [1] "Hello"
print(r_ucfirst("hello"))
## [1] "Hello"
mb <- microbenchmark(ucfirst("hello"), r_ucfirst("hello"), times=1000)
print(mb)
## Unit: microseconds
## expr min lq median uq max neval
## ucfirst("hello") 1.925 2.123 2.2765 2.4025 20.844 1000
## r_ucfirst("hello") 6.199 7.059 7.5285 7.9555 41.473 1000
Both should be compatible across-platforms. You can get even faster in C++ with some C-hacks, but 2.27μs for 1,000 conversions isn't exactly bad (neither is 7.5μs for the pure-R version :-)
Having said that, you could try implementing the "pure R" version with the stringi package, which uses Rcpp/C++/C-backed functions.
If I wanted to reduce a WAV file's amplitude by 25%, I would write something like this:
for (int i = 0; i < data.Length; i++)
{
data[i] *= 0.75;
}
A lot of the articles I read on audio techniques, however, discuss amplitude in terms of decibels. I understand the logarithmic nature of decibel units in principle, but not so much in terms of actual code.
My question is: if I wanted to attenuate the volume of a WAV file by, say, 20 decibels, how would I do this in code like my above example?
Update: formula (based on Nils Pipenbrinck's answer) for attenuating by a given number of decibels (entered as a positive number e.g. 10, 20 etc.):
public void AttenuateAudio(float[] data, int decibels)
{
float gain = (float)Math.Pow(10, (double)-decibels / 20.0);
for (int i = 0; i < data.Length; i++)
{
data[i] *= gain;
}
}
So, if I want to attenuate by 20 decibels, the gain factor is .1.
I think you want to convert from decibel to gain.
The equations for audio are:
decibel to gain:
gain = 10 ^ (attenuation in db / 20)
or in C:
gain = powf(10, attenuation / 20.0f);
The equations to convert from gain to db are:
attenuation_in_db = 20 * log10 (gain)
If you just want to adust some audio, I've had good results with the normalize package from nongnu.org. If you want to study how it's done, the source code is freely available. I've also used wavnorm, whose home page seems to be out at the moment.
One thing to consider: .WAV files have MANY different formats. The code above only works for WAVE_FORMAT_FLOAT. If you're dealing with PCM files, then your samples are going to be 8, 16, 24 or 32 bit integers (8 bit PCM uses unsigned integers from 0..255, 24 bit PCM can be packed or unpacked (packed == 3 byte values packed next to each other, unpacked == 3 byte values in a 4 byte package).
And then there's the issue of alternate encodings - For instance in Win7, all the windows sounds are actually MP3 files in a WAV container.
It's unfortunately not as simple as it sounds :(.
Oops I misunderstood the question… You can see my python implementations of converting from dB to a float (which you can use as a multiplier on the amplitude like you show above) and vice-versa
https://github.com/jiaaro/pydub/blob/master/pydub/utils.py
In a nutshell it's:
10 ^ (db_gain / 10)
so to reduce the volume by 6 dB you would multiply the amplitude of each sample by:
10 ^ (-6 / 10) == 10 ^ (-0.6) == 0.2512
The Perl module Proc::ProcessTable occasionally observes that the pctcpu attribute as 'inf', 'nan', or a value greater then 100. Why does it do this? And are there any guidelines on how to deal with this kind of information?
We have observed this on various platforms including Linux 2.4 running on 8 logical processors.
I would guess that 'inf' or 'nan' is the result of some impossibly large value or a divide by zero.
For values greater then 100, could this possibly mean that more then one processor was used?
And for dealing with this information, is the best practice merely marking the data point as untrustworthy and normalizing to 100%?
I do not know why that happens and I cannot stress test the module right now trying to generate such cases.
However, a principle I have followed all my research is not to replace data I know to be non-sense with something that looks reasonable. You basically have missing observations and you should treat them as such. I would not attach a numerical value at all so as not to pretend I have information when I in fact do not.
Then, your statistics for the non-missing points will be meaningful and you can look at any patterns in the missing observations separately.
UPDATE: Looking at the calc_prec() function in the source code:
/* calc_prec()
*
* calculate the two cpu/memory precentage values
*/
static void calc_prec(char *format_str, struct procstat *prs, struct obstack *mem_pool)
{
float pctcpu = 100.0f * (prs->utime / 1e6) / (time(NULL) - prs->start_time);
/* calculate pctcpu - NOTE: This assumes the cpu time is in microsecond units! */
sprintf(prs->pctcpu, "%3.2f", pctcpu);
field_enable(format_str, F_PCTCPU);
/* calculate pctmem */
if (system_memory > 0) {
sprintf(prs->pctmem, "%3.2f", (float) prs->rss / system_memory * 100.f);
field_enable(format_str, F_PCTMEM);
}
}
First, IMHO, it would be better to just divide by 1e4 rather than multiplying by 100.0f after the division. Second, it is possible (if polled immediately after process start) for the time delta to be 0. Third, I would have just done the whole thing in double.
As an aside, this function looks like a good example of why you should not have comments in code.
#include <stdio.h>
#include <time.h>
volatile float calc_percent(
unsigned long utime,
time_t now,
time_t start
) {
return 100.0f * ( utime / 1e6) / (now - start);
}
int main(void) {
printf("%3.2f\n", calc_percent(1e6, time(NULL), time(NULL)));
printf("%3.2f\n", calc_percent(0, time(NULL), time(NULL)));
return 0;
}
This outputs inf in the first case and nan in the second case when compiled with Cygwin gcc-4 on Windows. I do not know if this behavior is standard or just what happens with this particular combination of OS+compiler.