I am trying to create a new variable within data table under if statement: if string variable contains substring, then new variable equals to numerical value.
My data:
N X
1 aa1aa
2 bb2bb
3 cc-1bb
...
Dataframe contains several thousands of rows.
Result needed is new column containing numerical value which is withing string (X collumn):
N X Y
1 aa1aa 1
2 bb2bb 2
3 cc-1bb -1
I was trying with
for (i in 1:length(mydata)){
if (grep('1', mydata$X) == TRUE) {
mydata$Y <- 1 }
but I'm not sure if I'm even on correct way... Any help please?
This should work on more of your extended samples. Basically it takes out everything that's not a letter from the middle of the string.
X <- c("aa1aa", "bb2bb", "cc-1bb","aa+0.5b","fg-0.25h")
gsub("^[a-z]+([^a-z]*)[a-z]+$","\\1",X,perl=T)
#[1] "1" "2" "-1" "+0.5" "-0.25"
Using the example data from #Paulo you can use gsub from base R...
d$Y <- gsub( "[^0-9]" , "" , d$X )
something like this?
d <- data.frame(N = 1:3,
X = c('aa1aa', 'bb2bb', 'cc-1bb'),
stringsAsFactors = FALSE)
library(stringr)
d$Y <- as.numeric(str_extract_all(d$X,"\\(?[0-9,.]+\\)?"))
d
N X Y
1 1 aa1aa 1
2 2 bb2bb 2
3 3 cc-1bb 1
EDIT - Speed test
The gsub approch provided by #Simon is much faster than stringr
library(microbenchmark)
# 30000 lines data.frame
d1 <- data.frame(N = 1:30000,
X = rep(c('aa1aa', 'bb2bb', 'cc-1bb'), 10000),
stringsAsFactors = FALSE)
stringr
microbenchmark(as.numeric(str_extract_all(d1$X,"\\(?[0-9,.]+\\)?")),
times = 10L)
Unit: seconds
expr min lq median uq max neval
as.numeric(str_extract_all(d1$X, "\\\\(?[0-9,.]+\\\\)?")) 2.677408 2.75283 2.76473 2.781083 2.796648 10
base gsub
microbenchmark(gsub( "[^0-9]" , "" , d1$X ), times = 10L)
Unit: milliseconds
expr min lq median uq max neval
gsub("[^0-9]", "", d1$X) 44.95564 45.05358 45.07238 45.10201 45.23645 10
Related
A = {'a','b','c','b','a',...}
A is a <1X400> cell array and I want to create a matrix from A such that if the cell is a, the matrix shows 1, if it is b, it shows as 2 in the matrix and 3 for c.
Thank you.
Specific Case
For a simple specific case as listed in the question, you can use char to convert all the cell elements to characters and then subtract 96 from it, which is ascii equivalent of 'a'-1 -
A_numeric = char(A)-96
Sample run -
>> A
A =
'a' 'b' 'c' 'b' 'a'
>> A_numeric = char(A)-96
A_numeric =
1
2
3
2
1
Generic Case
For a generic substitution case, you need to do a bit more of work like so -
%// Inputs
A = {'correct','boss','cat','boss','correct','cat'}
newcellval = {'correct','cat','boss'}
newnumval = [8,2,5]
[unqcell,~,idx] = unique(A,'stable')
[~,newcell_idx,unqcell_idx] = intersect(newcellval,unqcell,'stable')
A_numeric = newnumval(changem(idx,newcell_idx,unqcell_idx))
Sample input-output -
>> A,newcellval,newnumval
A =
'correct' 'boss' 'cat' 'boss' 'correct' 'cat'
newcellval =
'correct' 'cat' 'boss'
newnumval =
8 2 5
>> A_numeric
A_numeric =
8 5 2 5 8 2
That's easy:
result = cell2mat(A)-'a'+1
For a generic association of letters to numbers 1,2,3...:
letters2numbers = 'abc'; %// 'a'->1, 'b'->2 etc.
[~, result] = ismember(cell2mat(A), letters2numbers)
For a generic association of strings to numbers 1,2,3...:
strings2numbers = {'hi', 'hello', 'hey', 'good morning', 'howdy'};
A = {'hello', 'hi', 'hello', 'howdy', 'bye'};
[~, result] = ismember(A, strings2numbers)
In this example,
result =
2 1 2 5 0
use a For Loop which iterate over A and convert character to number
for loop = 1:length(A)
outMat(loop) = char(A(loop)) - 96
end
I hope it works.
For example, s1='abc', s2='kokoabckokabckoab'.
Output should be 3. (number of times s1 appears in s2).
Not allowed to use for or strfind. Can only use reshape,repmat,size.
I thought of reshaping s2, so it would contain all of the possible strings of 3s:
s2 =
kok
oko
koa
oab
.... etc
But I'm having troubles from here..
Assuming you have your matrix reshaped into the format you have in your post, you can replicate s1 and stack the string such that it has as many rows as there are in the reshaped s2 matrix, then do an equality operator. Rows that consist of all 1s means that we have found a match and so you would simply search for those rows where the total sum is equal to the total length of s1. Referring back to my post on dividing up a string into overlapping substrings, we can decompose your string into what you have posted in your question like so:
%// Define s1 and s2 here
s1 = 'abc';
len = length(s1);
s2 = 'kokoabckokabckoab';
%// Hankel starts here
c = (1 : len).';
r = (len : length(s2)).';
nr = length(r);
nc = length(c);
x = [ c; r((2:nr)') ]; %-- build vector of user data
cidx = (1:nc)';
ridx = 0:(nr-1);
H = cidx(:,ones(nr,1)) + ridx(ones(nc,1),:); % Hankel subscripts
ind = x(H); % actual data
%// End Hankel script
%// Now get our data
subseqs = s2(ind.');
%// Case where string length is 1
if len == 1
subseqs = subseqs.';
end
subseqs contains the matrix of overlapping characters that you have alluded to in your post. You've noticed a small bug where if the length of the string is 1, then the algorithm won't work. You need to make sure that the reshaped substring matrix consists of a single column vector. If we ran the above code without checking the length of s1, we would get a row vector, and so simply transpose the result if this is the case.
Now, simply replicate s1 for as many times as we have rows in subseqs so that all of these strings get stacked into a 2D matrix. After, do an equality operator.
eqs = subseqs == repmat(s1, size(subseqs,1), 1);
Now, find the column-wise sum and see which elements are equal to the length of your string. This will produce a single column vector where 1 indicates that we have found a match, and zero otherwise:
sum(eqs, 2) == len
ans =
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
Finally, to add up how many times the substring matched, you just have to add up all elements in this vector:
out = sum(sum(eqs, 2) == len)
out =
2
As such, we have two instances where abc is found in your string.
Here is another one,
s1='abc';
s2='bkcokbacaabcsoabckokabckoabc';
[a,b] = ismember(s2,s1);
b = [0 0 b 0 0];
a1=circshift(b,[0 -1]);
a2=circshift(b,[0 -2]);
sum((b==1)&(a1==2)&(a2==3))
It gives 3 for your input and 4 for my example, and it seems to work well if ismember is okey.
Just for the fun of it: this can be done with nlfilter from the Image Processing Toolbox (I just discovered this function today and am eager to apply it!):
ds1 = double(s1);
ds2 = double(s2);
result = sum(nlfilter(ds2, [1 numel(ds1)], #(x) all(x==ds1)));
I am on the lookout for a gsub based function which would enable me to do combinatorial string replacement, so that if I would have an arbitrary number of string replacement rules
replrules=list("<x>"=c(3,5),"<ALK>"=c("hept","oct","non"),"<END>"=c("ane","ene"))
and a target string
string="<x>-methyl<ALK><END>"
it would give me a dataframe with the final string name and the substitutions that were made as in
name x ALK END
3-methylheptane 3 hept ane
5-methylheptane 5 hept ane
3-methyloctane 3 oct ane
5-methyloctane 5 ... ...
3-methylnonane 3
5-methylnonane 5
3-methylheptene 3
5-methylheptene 5
3-methyloctene 3
5-methyloctene 5
3-methylnonene 3
5-methylnonene 5
The target string would be of arbitrary structure, e.g. it could also be string="1-<ALK>anol" or each pattern could occur several times, as in string="<ALK>anedioic acid, di<ALK>yl ester"
What would be the most elegant way to do this kind of thing in R?
How about
d <- do.call(expand.grid, replrules)
d$name <- paste0(d$'<x>', "-", "methyl", d$'<ALK>', d$'<END>')
EDIT
This seems to work (substituting each of these into the strplit)
string = "<x>-methyl<ALK><END>"
string2 = "<x>-ethyl<ALK>acosane"
string3 = "1-<ALK>anol"
Using Richards regex
d <- do.call(expand.grid, list(replrules, stringsAsFactors=FALSE))
names(d) <- gsub("<|>","",names(d))
s <- strsplit(string3, "(<|>)", perl = TRUE)[[1]]
out <- list()
for(i in s) {
out[[i]] <- ifelse (i %in% names(d), d[i], i)
}
d$name <- do.call(paste0, unlist(out, recursive=F))
EDIT
This should work for repeat items
d <- do.call(expand.grid, list(replrules, stringsAsFactors=FALSE))
names(d) <- gsub("<|>","",names(d))
string4 = "<x>-methyl<ALK><END>oate<ALK>"
s <- strsplit(string4, "(<|>)", perl = TRUE)[[1]]
out <- list()
for(i in seq_along(s)) {
out[[i]] <- ifelse (s[i] %in% names(d), d[s[i]], s[i])
}
d$name <- do.call(paste0, unlist(out, recursive=F))
Well, I'm not exactly sure we can even produce a "correct" answer to your question, but hopefully this helps give you some ideas.
Okay, so in s, I just split the string where it might be of most importance. Then g gets the first value in each element of r. Then I constructed a data frame as an example. So then dat is a one row example of how it would look.
> (s <- strsplit(string, "(?<=l|\\>)", perl = TRUE)[[1]])
# [1] "<x>" "-methyl" "<ALK>" "<END>"
> g <- sapply(replrules, "[", 1)
> dat <- data.frame(name = paste(append(g, s[2], after = 1), collapse = ""))
> dat[2:4] <- g
> names(dat)[2:4] <- sapply(strsplit(names(g), "<|>"), "[", -1)
> dat
# name x ALK END
# 1 3-methylheptane 3 hept ane
I have an R dataframe whith 2 fields:
ID WORD
1 AAAAABBBBB
2 ABCAAABBBDDD
3 ...
I'd like to simplify the words with repeating letters by keeping only the letter and not the duplicates in a repetition:
e.g.: AAAAABBBBB should give me AB
and ABCAAABBBDDD should give me ABCABD
Anyone has an idea on how to do this?
Here's a solution with regex:
x <- c('AAAAABBBBB', 'ABCAAABBBDDD')
gsub("([A-Za-z])\\1+","\\1",x)
EDIT: By request, some benchmarking. I added Matthew Lundberg's pattern in the comment, matching any character. It appears that gsub is faster by an order of magnitude, and matching any character is faster than matching letters.
library(microbenchmark)
set.seed(1)
##create sample dataset
x <- apply(
replicate(100,sample(c(LETTERS[1:3],""),10,replace=TRUE))
,2,paste0,collapse="")
##benchmark
xm <- microbenchmark(
SAPPLY = sapply(strsplit(x, ''), function(x) paste0(rle(x)$values, collapse=''))
,GSUB.LETTER = gsub("([A-Za-z])\\1+","\\1",x)
,GSUB.ANY = gsub("(.)\\1+","\\1",x)
)
##print results
print(xm)
# Unit: milliseconds
# expr min lq median uq max
# 1 GSUB.ANY 1.433873 1.509215 1.562193 1.664664 3.324195
# 2 GSUB.LETTER 1.940916 2.059521 2.108831 2.227435 3.118152
# 3 SAPPLY 64.786782 67.519976 68.929285 71.164052 77.261952
##boxplot of times
boxplot(xm)
##plot with ggplot2
library(ggplot2)
qplot(y=time, data=xm, colour=expr) + scale_y_log10()
x <- c('AAAAABBBBB', 'ABCAAABBBDDD')
sapply(strsplit(x, ''), function(x) paste0(rle(x)$values, collapse=''))
## [1] "AB" "ABCABD"
Consider the following function on a string:
int F(string S)
{
int N = S.size();
int T = 0;
for (int i = 0; i < N; i++)
for (int j = i + 1; j < N; j++)
if (S[i] > S[j])
T++;
return T;
}
A string S0 of length N with all pairwise distinct characters has a total of N! unique permutations.
For example "bac" has the following 6 permutations:
bac
abc
cba
bca
acb
cab
Consider these N! strings in lexicographical order:
abc
acb
bac
bca
cab
cba
Now consider the application of F to each of these strings:
F("abc") = 0
F("acb") = 1
F("bac") = 1
F("bca") = 2
F("cab") = 2
F("cba") = 3
Given some string S1 of this set of permutations, we want to find the next string S2 in the set, that has the following relationship to S1:
F(S2) == F(S1) + 1
For example if S1 == "acb" (F = 1) than S2 == "bca" (F = 1 + 1 = 2)
One way to do this would be to start at one past S1 and iterate through the list of permutations looking for F(S) = F(S1)+1. This is unfortunately O(N!).
By what O(N) function on S1 can we calculate S2 directly?
Suppose length of S1 is n, biggest value for F(S1) is n(n-1)/2, if F(S1) = n(n-1)/2, means it's a last function and there isn't any next for it, but if F(S1) < n(n-1)/2, means there is at least one char x which is bigger than char y and x is next to y, find such a x with lowest index, and change x and y places. let see it by example:
S1 == "acb" (F = 1) , 1 < 3 so there is a char x which is bigger than another char y and its index is bigger than y, here smallest index x is c, and by first try you will replace it with a (which is smaller than x so algorithm finishes here)==> S2= "cab", F(S2) = 2.
Now let test it with S2, cab: x=b, y=a, ==> S3 = "cba".\
finding x is not hard, iterate the input, and have a variable name it min, while current visited character is smaller than min, set min as newly visited char, and visit next character, first time you visit a character which is bigger than min stop iteration, this is x:
This is pseudocode in c# (but I wasn't careful about boundaries e.g in input.Substring):
string NextString(string input)
{
var min = input[0];
int i=1;
while (i < input.Length && input[i] < min)
{
min = input[i];
i++;
}
if (i == input.Length) return "There isn't next item";
var x = input[i], y=input[i-1];
return input.Substring(0,i-2) + x + y + input.Substring(i,input.Length - 1 - i);
}
Here's the outline of an algorithm for a solution to your problem.
I'll assume that you have a function to directly return the n-th permutation (given n) and its inverse, ie a function to return n given a permutation. Let these be perm(n) and perm'(n) respectively.
If I've figured it correctly, when you have a 4-letter string to permute the function F goes like this:
F("abcd") = 0
F("abdc") = 1
F(perm(3)) = 1
F(...) = 2
F(...) = 2
F(...) = 3
F(perm(7)) = 1
F(...) = 2
F(...) = 2
F(...) = 3
F(...) = 3
F(...) = 4
F(perm(13)) = 2
F(...) = 3
F(...) = 3
F(...) = 4
F(...) = 4
F(...) = 5
F(perm(19)) = 3
F(...) = 4
F(...) = 4
F(...) = 5
F(...) = 5
F(perm(24)) = 6
In words, when you go from 3 letters to 4 you get 4 copies of the table of values of F, adding (0,1,2,3) to the (1st,2nd,3rd,4th) copy respectively. In the 2nd case, for example, you already have one derangement by putting the 2nd letter in the 1st place; this simply gets added to the other derangements in the same pattern as would be true for the original 3-letter strings.
From this outline it shouldn't be too difficult (but I haven't got time right now) to write the function F. Strictly speaking the inverse of F isn't a function as it would be multi-valued, but given n, and F(n) there are only a few cases for finding m st F(m)==F(n)+1. These cases are:
n == N! where N is the number of letters in the string, there is no next permutation;
F(n+1) < F(n), the sought-for solution is perm(n+(N-1)!), ;
F(n+1) == F(n), the solution is perm(n+2);
F(n+1) > F(n), the solution is perm(n+1).
I suspect that some of this might only work for 4 letter strings, that some of these terms will have to be adjusted for K-letter permutations.
This is not O(n), but it is at least O(n²) (where n is the number of elements in the permutation, in your example 3).
First, notice that whenever you place a character in your string, you already know how much of an increase in F that's going to mean -- it's however many characters smaller than that one that haven't been added to the string yet.
This gives us another algorithm to calculate F(n):
used = set()
def get_inversions(S1):
inv = 0
for index, ch in enumerate(S1):
character = ord(ch)-ord('a')
cnt = sum(1 for x in range(character) if x not in used)
inv += cnt
used.add(character)
return inv
This is not much better than the original version, but it is useful when inverting F. You want to know the first string that is lexicographically smaller -- therefore, it makes sense to copy your original string and only change it whenever mandatory. When such changes are required, we should also change the string by the least amount possible.
To do so, let's use the information that the biggest value of F for a string with n letters is n(n-1)/2. Whenever the number of required inversions would be bigger than this amount if we didn't change the original string, this means we must swap a letter at that point. Code in Python:
used = set()
def get_inversions(S1):
inv = 0
for index, ch in enumerate(S1):
character = ord(ch)-ord('a')
cnt = sum(1 for x in range(character) if x not in used)
inv += cnt
used.add(character)
return inv
def f_recursive(n, S1, inv, ign):
if n == 0: return ""
delta = inv - (n-1)*(n-2)/2
if ign:
cnt = 0
ch = 0
else:
ch = ord(S1[len(S1)-n])-ord('a')
cnt = sum(1 for x in range(ch) if x not in used)
for letter in range(ch, len(S1)):
if letter not in used:
if cnt < delta:
cnt += 1
continue
used.add(letter)
if letter != ch: ign = True
return chr(letter+ord('a'))+f_recursive(n-1, S1, inv-cnt, ign)
def F_inv(S1):
used.clear()
inv = get_inversions(S1)
used.clear()
return f_recursive(len(S1), S1, inv+1, False)
print F_inv("acb")
It can also be made to run in O(n log n) by replacing the innermost loop with a data structure such as a binary indexed tree.
Did you try to swap two neighbor characters in the string? It seems that it can help to solve the problem. If you swap S[i] and S[j], where i < j and S[i] < S[j], then F(S) increases by one, because all other pairs of indices are not affected by this permutation.
If I'm not mistaken, F calculates the number of inversions of the permutation.