Count the number of overlapping substrings within a string - string

example:
s <- "aaabaabaa"
p <- "aa"
I want to return 4, not 3 (i.e. counting the number of "aa" instances in the initial "aaa" as 2, not 1).
Is there any package to solve it? Or is there any way to count in R?

I believe that
find_overlaps <- function(p,s) {
gg <- gregexpr(paste0("(?=",p,")"),s,perl=TRUE)[[1]]
if (length(gg)==1 && gg==-1) 0 else length(gg)
}
find_overlaps("aa","aaabaabaa") ## 4
find_overlaps("not_there","aaabaabaa") ## 0
find_overlaps("aa","aaaaaaaa") ## 7
will do what you want, which would be more clearly expressed as "finding the number of overlapping substrings within a string".
This a minor variation on Finding the indexes of multiple/overlapping matching substrings

substring might be useful here, by taking every successive pair of characters.
( ss <- sapply(2:nchar(s), function(i) substring(s, i-1, i)) )
## [1] "aa" "aa" "ab" "ba" "aa" "ab" "ba" "aa"
sum(ss %in% p)
## [1] 4

I needed the answer to a related more-general question. Here is what I came up with generalizing Ben Bolker's solution:
my.data <- read.table(text = '
my.string my.cov
1.2... 1
.21111 2
..2122 3
...211 2
112111 4
212222 1
', header = TRUE, stringsAsFactors = FALSE)
desired.result.2ch <- read.table(text = '
my.string my.cov n.11 n.12 n.21 n.22
1.2... 1 0 0 0 0
.21111 2 3 0 1 0
..2122 3 0 1 1 1
...211 2 1 0 1 0
112111 4 3 1 1 0
212222 1 0 1 1 3
', header = TRUE, stringsAsFactors = FALSE)
desired.result.3ch <- read.table(text = '
my.string my.cov n.111 n.112 n.121 n.122 n.222 n.221 n.212 n.211
1.2... 1 0 0 0 0 0 0 0 0
.21111 2 2 0 0 0 0 0 0 1
..2122 3 0 0 0 1 0 0 1 0
...211 2 0 0 0 0 0 0 0 1
112111 4 1 1 1 0 0 0 0 1
212222 1 0 0 0 1 2 0 1 0
', header = TRUE, stringsAsFactors = FALSE)
find_overlaps <- function(s, my.cov, p) {
gg <- gregexpr(paste0("(?=",p,")"),s,perl=TRUE)[[1]]
if (length(gg)==1 && gg==-1) 0 else length(gg)
}
p <- c('11', '12', '21', '22', '111', '112', '121', '122', '222', '221', '212', '211')
my.output <- matrix(0, ncol = (nrow(my.data)+1), nrow = length(p))
for(i in seq(1,length(p))) {
my.data$p <- p[i]
my.output[i,1] <- p[i]
my.output[i,(2:(nrow(my.data)+1))] <-apply(my.data, 1, function(x) find_overlaps(x[1], x[2], x[3]))
apply(my.data, 1, function(x) find_overlaps(x[1], x[2], x[3]))
}
my.output
desired.result.2ch
desired.result.3ch
pre.final.output <- matrix(t(my.output[,2:7]), ncol=length(p), nrow=nrow(my.data))
final.output <- data.frame(my.data[,1:2], t(apply(pre.final.output, 1, as.numeric)))
colnames(final.output) <- c(colnames(my.data[,1:2]), paste0('x', p))
final.output
# my.string my.cov x11 x12 x21 x22 x111 x112 x121 x122 x222 x221 x212 x211
#1 1.2... 1 0 0 0 0 0 0 0 0 0 0 0 0
#2 .21111 2 3 0 1 0 2 0 0 0 0 0 0 1
#3 ..2122 3 0 1 1 1 0 0 0 1 0 0 1 0
#4 ...211 2 1 0 1 0 0 0 0 0 0 0 0 1
#5 112111 4 3 1 1 0 1 1 1 0 0 0 0 1
#6 212222 1 0 1 1 3 0 0 0 1 2 0 1 0

A tidy, and I think more readable solution is
library(tidyverse)
PatternCount <- function(text, pattern) {
#Generate all sliding substrings
map(seq_len(nchar(text) - nchar(pattern) + 1),
function(x) str_sub(text, x, x + nchar(pattern) - 1)) %>%
#Test them against the pattern
map_lgl(function(x) x == pattern) %>%
#Count the number of matches
sum
}
PatternCount("aaabaabaa", "aa")
# 4

Related

PYEDA truthtable of functions

Hope there is anybody who feels good with PYEDA.
I want to add fictious variables to function
Let me have f=x1, but how can I get truthtable for this function , which will have x2 too
Like truthtable for f(x1)=x1 is:
x1 f
0 0
1 1
But for f(x1,x2)=x1 is:
x1 x2 f
0 0 0
0 1 0
1 0 1
1 1 1
But I will get first table, pyeda will simplify x1&(x2|~x2) to x1 automatically. How can I add this x2?
def calcFunction(function, i):
#here is is point with dimension-size 4
function=function.restrict({x4:i[3]})
function = function.restrict({x3:i[2]})
function = function.restrict({x2:i[1]})
function = function.restrict({x1:i[0]})
if function.satisfy_one() is not None:
return 1
return 0
Here is my algo to fix it, I am calculating func in each point manually, where function can containt 1-4 variables and I am calculating for all point and combinations of x1...x4.
I'm not sure I understand the question as asked, but you might want to try the expression simplify method.
For example:
In [1]: f = (X[1] & X[2]) | (X[3] | X[4] | ~X[3])
In [2]: expr2truthtable(f)
Out[2]:
x[4] x[3] x[2] x[1]
0 0 0 0 : 1
0 0 0 1 : 1
0 0 1 0 : 1
0 0 1 1 : 1
0 1 0 0 : 1
0 1 0 1 : 1
0 1 1 0 : 1
0 1 1 1 : 1
1 0 0 0 : 1
1 0 0 1 : 1
1 0 1 0 : 1
1 0 1 1 : 1
1 1 0 0 : 1
1 1 0 1 : 1
1 1 1 0 : 1
1 1 1 1 : 1
In [3]: f = f.simplify()
In [4]: f
Out[4]: 1
In [5]: expr2truthtable(f)
Out[5]: 1

Pattern identification and sequence detection

I have a dataset 'df' that looks something like this:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6
A 1 0 0 1 0 1
B 1 1 0 0 1 0
C 1 1 1 0 0 1
D 0 0 1 0 0 1
As you can see there are several rows of ones and zeros. Can anyone suggest me a code in python such that I am able to count the number of times '1' occurs continuously before the first occurrence of a 1, 0 and 0 in order. For example, for member A, the first double zero event occurs at seen_2 and seen_3, so the event will be 1. Similarly for the member B, the first double zero event occurs at seen_3 and seen_4 so there are two 1s that occur before this. The resultant table should have a new column 'event' something like this:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 event
A 1 0 0 1 0 1 1
B 1 1 0 0 1 0 2
C 1 1 1 0 0 1 3
D 0 0 1 0 0 1 1
My approach:
df = df.set_index('MEMBER')
# count 1 on each rows since the last 0
s = (df.stack()
.groupby(['MEMBER', df.eq(0).cumsum(1).stack()])
.cumsum().unstack()
)
# mask of the zeros:
u = s.eq(0)
# look for the first 1 0 0
idx = (~u &
u.shift(-1, axis=1, fill_value=False) &
u.shift(-2, axis=1, fill_value=False) ).idxmax(1)
# look up
df['event'] = s.lookup(idx.index, idx)
Test data:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6
0 A 1 0 1 0 0 1
1 B 1 1 0 0 1 0
2 C 1 1 1 0 0 1
3 D 0 0 1 0 0 1
4 E 1 0 1 1 0 0
Output:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 event
0 A 1 0 1 0 0 1 1
1 B 1 1 0 0 1 0 2
2 C 1 1 1 0 0 1 3
3 D 0 0 1 0 0 1 1
4 E 1 0 1 1 0 0 2

Set to 0 x% of non zero values in numpy 2d array

I tried different ways but it seems impossible for me to do it efficiently without looping through.
Input is an array y and a percentage x.
e.g. input is
y=np.random.binomial(1,1,[10,10])
x=0.5
output
[[0 0 0 0 1 1 1 1 0 1]
[1 0 1 0 0 1 0 1 0 1]
[1 0 1 1 1 1 0 0 0 1]
[0 1 0 1 1 0 1 0 1 1]
[0 1 1 0 0 1 1 1 0 0]
[0 0 1 1 1 0 1 1 0 1]
[0 1 0 0 0 0 1 0 1 1]
[0 0 0 1 1 1 1 1 0 0]
[0 1 1 1 1 0 0 1 0 0]
[1 0 1 0 1 0 0 0 0 0]]
Here's one based on masking -
def set_nonzeros_to_zeros(a, setz_ratio):
nz_mask = a!=0
nz_count = nz_mask.sum()
z_set_count = int(np.round(setz_ratio*nz_count))
idx = np.random.choice(nz_count,z_set_count,replace=False)
mask0 = np.ones(nz_count,dtype=bool)
mask0.flat[idx] = 0
nz_mask[nz_mask] = mask0
a[~nz_mask] = 0
return a
We are skipping the generation all the indices with np.argwhere/np.nonzero in favor of a masking based one to focus on performance.
Sample run -
In [154]: np.random.seed(0)
...: a = np.random.randint(0,3,(5000,5000))
# number of non-0s before using solution
In [155]: (a!=0).sum()
Out[155]: 16670017
In [156]: a_out = set_nonzeros_to_zeros(a, setz_ratio=0.2) #set 20% of non-0s to 0s
# number of non-0s after using solution
In [157]: (a_out!=0).sum()
Out[157]: 13336014
# Verify
In [158]: 16670017 - 0.2*16670017
Out[158]: 13336013.6
There are a few vectorized methods that might help you, depending on what you want to do:
# Flatten the 2D array and get the indices of the non-zero elements
c = y.flatten()
d = c.nonzero()[0]
# Shuffle the indices and set the first 100x % to zero
np.random.shuffle(d)
x = 0.5
c[d[:int(x*len(d))]] = 0
# reshape to the original 2D shape
y = c.reshape(y.shape)
No doubt there are some efficiency improvements to be made here.

how do you replace only a certain number of items in a list randomly?

board = []
for x in range(0,8):
board.append(["0"] * 8)
def print_board(board):
for row in board:
print(" ".join(row))
this code creates a grid of zeros but I wish to replace 5 of them with ones and another five with twos
does anyone know a way to do this?
If you want to randomly set some coordinates with "1" and "2", you can do it like this:
import random
board = []
for x in range(0, 8):
board.append(["0"] * 8)
def print_board(board):
for row in board:
print(" ".join(row))
def generate_coordinates(x, y, k):
coordinates = [(i, j) for i in range(x) for j in range(y)]
random.shuffle(coordinates)
return coordinates[:k]
coo = generate_coordinates(8, 8, 10)
ones = coo[:5]
twos = coo[5:]
for i, j in ones:
board[i][j] = "1"
for i, j in twos:
board[i][j] = "2"
print_board(board)
Output
0 1 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0
0 0 2 0 0 0 0 0
1 0 0 0 2 0 0 0
0 0 0 0 2 0 0 2
2 0 0 0 0 0 0 1
Notes:
The code above generates a random sample each time so the output will be different each time (to generate the same use random.seed(42), you can change 42 for any number you want.
The function generate_coordinates receives x (number of rows), y (number of columns) and k (the number of coordinates to pick). It generates a sequence of coordinates of x*y, shuffles it and picks the k first.
In your specific case x = 8, y = 8 and k = 10 (5 for the ones and 5 for the twos)
Finally, this picks the positions for the ones and twos and changes the values:
ones = coo[:5]
twos = coo[5:]
for i, j in ones:
board[i][j] = "1"
for i, j in twos:
board[i][j] = "2"

Matlab string operation

I have converted a string to binary as follows
message='hello my name is kamran';
messagebin=dec2bin(message);
Is there any method for storing it in array?
I am not really sure of what you want to do here, but if you need to concatenate the rows of the binary representation (which is a matrix of numchars times bits_per_char), this is the code:
message = 'hello my name is kamran';
messagebin = dec2bin(double(message));
linearmessagebin = reshape(messagebin',1,numel(messagebin));
Please note that the double conversion returns your ASCII code. I do not have access to a Matlab installation here, but for example octave complains about the code you provided in the original question.
NOTE
As it was kindly pointed out to me, you have to transpose the messagebin before "serializing" it, in order to have the correct result.
If you want the result as numeric matrix, try:
>> str = 'hello world';
>> b = dec2bin(double(str),8) - '0'
b =
0 1 1 0 1 0 0 0
0 1 1 0 0 1 0 1
0 1 1 0 1 1 0 0
0 1 1 0 1 1 0 0
0 1 1 0 1 1 1 1
0 0 1 0 0 0 0 0
0 1 1 1 0 1 1 1
0 1 1 0 1 1 1 1
0 1 1 1 0 0 1 0
0 1 1 0 1 1 0 0
0 1 1 0 0 1 0 0
Each row corresponds to a character. You can easily reshape it into to sequence of 0,1

Resources