Convert a text to lowercase but keep uppercase for first letter in word (with R, if possible in tm package) - string

Is there an R function for changing a text to lowercase, but for the first letter of each word, i.e. change?
"You live NEAR Chicago"
to
"You live Near Chicago"
The point is to benefit from a quite efficient implementation, if possible.
Could this be integrated to the tm R package (or is already available there), so that it could be applied to a corpus directly?
(the goal is to built a simple location detector in text, crossing with geonames).

If you're handling the bit where the word(s) (like "near") are next to the geographic location(s), then there are existing code snippets for something like a ucfirst bit of functionality. However, you mentioned speed, so here's a comparison between an Rcpp implementation and a basic/straight R implementation (both are vectorized):
library(Rcpp)
library(microbenchmark)
# pure Rcpp/C++ implementation
sourceCpp("
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
std::vector< std::string > ucfirst( std::vector< std::string > strings ) {
int len = strings.size();
for( int i=0; i < len; i++ ) {
std::transform(strings[i].begin(), strings[i].end(), strings[i].begin(), ::tolower);
strings[i][0] = toupper( strings[i][0] );
}
return strings;
}")
r_ucfirst <- function (str) {
paste(toupper(substring(str, 1, 1)), tolower(substring(str, 2)), sep = "")
}
print(ucfirst("hello"))
## [1] "Hello"
print(r_ucfirst("hello"))
## [1] "Hello"
mb <- microbenchmark(ucfirst("hello"), r_ucfirst("hello"), times=1000)
print(mb)
## Unit: microseconds
## expr min lq median uq max neval
## ucfirst("hello") 1.925 2.123 2.2765 2.4025 20.844 1000
## r_ucfirst("hello") 6.199 7.059 7.5285 7.9555 41.473 1000
Both should be compatible across-platforms. You can get even faster in C++ with some C-hacks, but 2.27μs for 1,000 conversions isn't exactly bad (neither is 7.5μs for the pure-R version :-)
Having said that, you could try implementing the "pure R" version with the stringi package, which uses Rcpp/C++/C-backed functions.

Related

Problem when using gsl_ran_multinomial in Rcpp

I'm trying to generate multinomial random variables as fast as possible. And I learned that gsl_ran_multinomial could be a good choice. However, I tried to use it based on the answer in this post: https://stackoverflow.com/a/23100665/21039115, and the results were always wrong.
In detail, my code is
// [[Rcpp::export]]
arma::ivec rmvn_gsl(int K, arma::vec prob) {
gsl_rng *s = gsl_rng_alloc(gsl_rng_mt19937); // Create RNG seed
arma::ivec temp(K);
gsl_ran_multinomial(s, K, 1, prob.begin(), (unsigned int *) temp.begin());
gsl_rng_free(s); // Free memory
return temp;
}
And the result was something like
rmvn_gsl(3, c(0.2, 0.7, 0.1))
[,1]
[1,] 1
[2,] 0
[3,] 0
which is ridiculous.
I was wondering if there was any problem exist in the code... I couldn't find any other examples to compare. I appreciate any help!!!
UPDATE:
I found the primary problem here is that I didn't set a random seed, and it seems that gsl has its own default seed (FYI: https://stackoverflow.com/a/32939816/21039115). Once I set the seed by time, the code worked. But I will go with rmultinom since it can even be faster than gsl_ran_multinomial based on microbenchmark.
Anyway, #Dirk Eddelbuettel provided a great example of implementing gsl_ran_multinomial below. Just pay attention to the random seeds issue if someone met the same problem as me.
Here is a complete example taking a double vector of probabilities and returning an unsigned integer vector (at the compiled level) that is mapped to an integer vector by the time we are back in R:
Code
#include <RcppGSL.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
// [[Rcpp::depends(RcppGSL)]]
// [[Rcpp::export]]
std::vector<unsigned int> foo(int n, const std::vector <double> p) {
int k = p.size();
std::vector<unsigned int> nv(k);
gsl_rng_env_setup();
gsl_rng *s = gsl_rng_alloc(gsl_rng_mt19937); // Create RNG instance
gsl_ran_multinomial(s, k, n, &(p[0]), &(nv[0]));
gsl_rng_free(s);
return nv;
}
/*** R
foo(400, c(0.1, 0.2, 0.3, 0.4))
*/
Output
> Rcpp::sourceCpp("~/git/stackoverflow/75165241/answer.cpp")
> foo(400, c(0.1, 0.2, 0.3, 0.4))
[1] 37 80 138 145
>

Rcpp subsetting contiguous StringVector

Good afternoon,
I have been trying to use a similar method to subsetting x[200:300] in R while using Rcpp. (Note, this is not the problem I am trying to solve, but I need to subset many ranges within the functions I am trying to write in C++, and I found that this was the bottleneck of my performance)
However, although I have tried ussing the methods in rcpp, using iterators or other things, I just don't seem to find a solution that is minimally "fast." Most of the solutions I find are very slow.
And looking at the reference of Rcpp, I can't seem to find anything, not can I find it looking in StackExchange.
I know this code is pretty ugly right now... But I am just clueless
// [[Rcpp::export]]
StringVector range_test_( StringVector& x, int i, int j){
StringVector vect(x.begin()+i, x.begin()+j);
return vect;
}
And then, it is like 800 times slower. I have been trying to find the same x[i:j] function that R, which is very fast, within the rcpp base... but I can't find it.
tests_range <- rbenchmark::benchmark(
x[200:3000],
range_test_(x, 200, 3000),
order = NULL,
replications = 80
)[,1:4]
Gives as result
test replications elapsed relative
1 x[200:3000] 80 0.001 1
3 range_test_(x, 200, 3000) 80 0.822 822
If anybody knows how to access the subsetting function x[i:j] or something as fast within Rcpp I would really appreciate it. I just can't seem to find the tool I am missing.
The issue is that the iterator constructor makes a copy. See this page
Copy the data between iterators first and last to the created vector
However, you can try this instead
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::StringVector in_range(Rcpp::StringVector &x, int i, int j) {
return x[Rcpp::Range(i - 1, j - 1)]; // zero indexed
}
The time taken is a lot closer
> set.seed(20597458)
> x <- replicate(1e3, paste0(sample(LETTERS, 5), collapse = ""))
> head(x)
[1] "NHVFQ" "XMEOF" "DABUT" "XKTAZ" "NQXZL" "NPJLM"
>
> stopifnot(all.equal(in_range(x, 100, 200), x[100:200]))
>
> library(microbenchmark)
> microbenchmark(in_range(x, 100, 200), x[100:200], times = 1e4)
Unit: nanoseconds
expr min lq mean median uq max neval
in_range(x, 100, 200) 1185 1580 3669.780 1581 1976 3263205 10000
x[100:200] 790 790 1658.571 1185 1186 2331256 10000
Note that there is a page here on susbetting. I could not find a relevant example there though.

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.
To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
{
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
}
}
I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
barrier(CLK_LOCAL_MEM_FENCE);
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
}
if (localId==0) {
output[get_group_id(0)] = target[0];
}
}
https://pastebin.com/xN4yQ28N
You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features
A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
}
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
}
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

Longest Repeated Substring Better Complexity

I have implemented a solution by comparing the suffixes of a string after sorting the suffix list. Is there any linear time algorithm that performs better than this piece of code?
#include <iostream>
#include <cstring>
#include <algorithm>
using namespace std;
void preCompute(string input[],string s)
{
int n = s.length();
for(int i=0; i<n; i++)
input[i] = s.substr(i,n);
}
string LongestCommonSubString(string first,string second)
{
int n = min(first.length(),second.length());
for(int i=0; i<n; i++)
if(first[i]!=second[i])
return first.substr(0,i);
return first.substr(0,n);
}
string lrs(string s)
{
int n = s.length();
string input[n];
preCompute(input,s);
sort(input, input+n);
string lrs = "";
for(int i=0; i<n-1; i++)
{
string x = LongestCommonSubString(input[i],input[i+1]);
if(x.length()>lrs.length())
{
lrs = x;
}
}
return lrs;
}
int main()
{
string input[2] = {"banana","missisipi"};
for(int i=0;i<2;i++)
cout<<lrs(input[i])<<endl;
return 0;
}
I found a really good resource for this question. See here
You can build a suffix tree in linear time (see this). The longest repeated substring corresponds to the deepest internal node (when I say deepest I mean the path from the root has the maximum number of characters, not the maximum number of edges). The reason for that is simple. Internal nodes correspond to prefixes of suffixes (ie substrings) that occur in multiple suffixes.
In reality, this is fairly complex. So the approach you are taking is good enough. I have a few modifications that I can suggest:
Do not create substrings, substrings can be denoted by a pair of numbers. When you need the actual characters, look up the original string. In fact suffixes, correspond to a single index (the start index).
The longest common prefix of every pair of consecutive suffixes, can be computed while constructing your suffix array in linear time (but O(n log n) algorithms are much easier). Consult the references of this.
If you really insist on running the whole thing in linear time, then you can construct the suffix array in linear time. I am sure if you search around a bit, you can easily find pointers.
There are very elegant (but not linear) implementations described here.

Thread-safe random number generation for Monte-Carlo integration

Im trying to write something which very quickly calculates random numbers and can be applied on multiple threads. My current code is:
/* Approximating PI using a Monte-Carlo method. */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#define N 1000000000 /* As lareg as possible for increased accuracy */
double random_function(void);
int main(void)
{
int i = 0;
double X, Y;
double count_inside_temp = 0.0, count_inside = 0.0;
unsigned int th_id = omp_get_thread_num();
#pragma omp parallel private(i, X, Y) firstprivate(count_inside_temp)
{
srand(th_id);
#pragma omp for schedule(static)
for (i = 0; i <= N; i++) {
X = 2.0 * random_function() - 1.0;
Y = 2.0 * random_function() - 1.0;
if ((X * X) + (Y * Y) < 1.0) {
count_inside_temp += 1.0;
}
}
#pragma omp atomic
count_inside += count_inside_temp;
}
printf("Approximation to PI is = %.10lf\n", (count_inside * 4.0)/ N);
return 0;
}
double random_function(void)
{
return ((double) rand() / (double) RAND_MAX);
}
This works but from observing a resource manager I know its not using all the threads. Does rand() work for multithreaded code? And if not is there a good alternative? Many Thanks. Jack
Is rand() thread safe? Maybe, maybe not:
The rand() function need not be reentrant. A function that is not required to be reentrant is not required to be thread-safe."
One test and good learning exercise would be to replace the call to rand() with, say, a fixed integer and see what happens.
The way I think of pseudo-random number generators is as a black box which take an integer as input and return an integer as output. For any given input the output is always the same, but there is no pattern in the sequence of numbers and the sequence is uniformly distributed over the range of possible outputs. (This model isn't entirely accurate, but it'll do.) The way you use this black box is to choose a staring number (the seed) use the output value in your application and as the input for the next call to the random number generator. There are two common approaches to designing an API:
Two functions, one to set the initial seed (e.g. srand(seed)) and one to retrieve the next value from the sequence (e.g. rand()). The state of the PRNG is stored internally in sort of global variable. Generating a new random number either will not be thread safe (hard to tell, but the output stream won't be reproducible) or will be slow in multithreded code (you end up with some serialization around the state value).
A interface where the PRNG state is exposed to the application programmer. Here you typically have three functions: init_prng(seed), which returns some opaque representation of the PRNG state, get_prng(state), which returns a random number and changes the state variable, and destroy_peng(state), which just cleans up allocated memory and so on. PRNGs with this type of API should all be thread safe and run in parallel with no locking (because you are in charge of managing the (now thread local) state variable.
I generally write in Fortran and use Ladd's implementation of the Mersenne Twister PRNG (that link is worth reading). There are lots of suitable PRNG's in C which expose the state to your control. PRNG looks good and using this (with initialization and destroy calls inside the parallel region and private state variables) should give you a decent speedup.
Finally, it's often the case that PRNGs can be made to perform better if you ask for a whole sequence of random numbers in one go (e.g. the compiler can vectorize the PRNG internals). Because of this libraries often have something like get_prng_array(state) functions which give you back an array full of random numbers as if you put get_prng in a loop filling the array elements - they just do it more quickly. This would be a second optimization (and would need an added for loop inside the parallel for loop. Obviously, you don't want to run out of per-thread stack space doing this!

Resources