Finding sequences of nucleotides probability in multiple strings - string

This is my first post here so please be patient with me.
I am stuck on a problem and don't know what to do. I am given a
non-overlapping substring of length 'm' that appears twice in a string of length 'n'. What is the probability of finding the substring of length 'm' in both that string and another string of length '2n'?
For argument's sake let's say that m = 4 and n = 33. I have tried to use Independent Event probability as well as Markov Chain Models, but my answer never seems to be correct.
What would be the chance that the same 2 non-overlapping substrings of length 4 that are found in the string of length 33 will be found in a string of length 66?

Related

what will be the dp and transitions in this problem

Vasya has a string s of length n consisting only of digits 0 and 1. Also he has an array a of length n.
Vasya performs the following operation until the string becomes empty: choose some consecutive substring of equal characters, erase it from the string and glue together the remaining parts (any of them can be empty). For example, if he erases substring 111 from string 111110 he will get the string 110. Vasya gets ax points for erasing substring of length x.
Vasya wants to maximize his total points, so help him with this!
https://codeforces.com/problemset/problem/1107/E
i was trying to get my head around the editorial,but couldn't understand it... can anyone tell an easy way to do it?
input:
7
1101001
3 4 9 100 1 2 3
output:
109
Explanation
the optimal sequence of erasings is: 1101001 → 111001 → 11101 → 1111 → ∅.
Here, we consider removing prefixes instead of substrings. Why?
We try to remove a consecutive prefix of a particular state which is actually a substring in the main string. So, our DP states will be start index, end index, prefix length.
Let's consider an example str = "1010110". Here, initially start=0, end=7, and prefix=1(the first '1' will be the only prefix now). we iterate over all the indices in the current state except the starting index and check if str[i]==str[start]. Here, for example, str[4]==str[0]. Now we divide the string into "010" with prefix=1(010) && "110" with prefix=2(1010110). These two are now two individual subproblems. So, when there remains a string with length 1, we return aprefix.
Here is my code.

Generate partial strings which have predefined minimum lengths (Matlab)

I have an initial string Init={ABCDEFGH}. How can I generate 100 partial strings (randomly) from Init string which have these conditions:
A pre-defined minimum lengths.
The order of elements in each partial string should be from 'A' to 'Z'.
No repeated characters in each partial strings
The expected output should be as follows: 100 partial strings, minimum length of each partial string is 5
Output = {'BCEGH';'ACEFG';'ABCDEF';'BCFGH';'BCDEG';....;'ABEFH';'ABCEGH'}
numel(Output) = 100
To do this, I started by generating random numbers for the length of each partial string. Then I generated random numbers corresponding to each letter in each string. Then I transferred those numbers into their corresponding letters. The comments should explain the rest.
n=100 %// how many samples to take
C='ABCDEFGH' %// take samples from these letters
maxL=numel(C) %// the longest string
minL=5 %// the shortest string
len=randi([minL maxL],[n 1]) %// generate length of each partial string
arrayfun(#(l) C(randsample(1:8,l)),len,'uni',0) %// randomly sample letters to give strings of correct length
and n=4 gives, for example
ans =
'CFHABEDG'
'CFHABE'
'FAHBE'
'DGHFABE'
I'm not sure this is truly random because it assumes that there are the same number of strings of each length, but I don't think this is true. I think len should be weighted with respect to the number of strings of each length. I think (but I'm not sure) that this should fix that:
for i=1:(maxL-minL+1)
w(i)=factorial(minL-1+i)*nchoosek(maxL,minL-1+i);
end
len=minL-1+randsample(1:(maxL-minL+1),n,true,w./sum(w))

How to find all strings that do not contain substring palindromes

Disclaimer: This is a problem lifted from HackerRank, but their editorial answer wasn't sufficient so I hoped to get better answers. If it's against any policy, please let me know and I'll take this down.
Problem:
You are given two integers, N and M. Count the number of strings of length N under the alphabet set of size M that doesn't contain any palindromic string of the length greater than 1 as a consecutive substring.
N=2,M=2 -> 2 :: AA, AB, BA, BB
N=2,M=3 -> 6 :: AA, AB, AC, BA, BB, BC, CA, CB, CC
ABCDE counts as it does not contain any palindromic substrings.
ABCCC does not count as it does contain "CCC", a palindrome of length >1.
Editorial
Here is the provided answer which I think is wrong:
For N>=3, there are (M−2) ways to choose any next symbol (after the first two) - basically it should not coincide with the previous and pred-previous symbols, that aren't equal.
If N=1, return M
If N=2, return M * (M-1)
If N>=3, return M * (M-1) * (M-2)^(N-2)
counterexample: N=4, M=3, "ABCC"
My Solution Try
When I was working on this problem, I tried to find all the strings that contained palindromic substrings and subtracting that from the total, M^N. I ran into a lot of problems with over counting. For example, "ABABA" has "ABA","BAB","ABA" of n=3, and "ABABA" of n=5.
Thanks for any help in elucidating this problem. I really hope for a good answer to figure this out!
Suppose you build up palindrome-free strings one letter at a time. For the first letter, you have M choices, and for the second, you have M-1, since you can't use the first letter. This much is obvious.
For every letter after the first two, you can't use the previous letter, and you can't use the letter before that, so that's two choices eliminated. What about the other letters? Well, if using one of those creates a palindrome, it would have to be a palindrome of length at least 4 - but if adding a letter creates a palindrome of length K+2 for K>=2, the string must already have had a palindrome of length K for the new palindrome to build off of. (For K<2, this is okay.) Since the string didn't have any palindromes of length >=2, we can conclude that adding any letter other than the previous two letters is fine.
Thus, we have M choices for the first letter, M-1 choices for the second, and M-2 for every letter after that.

Deterministic automata to find number of subsequence in string of another string

Deterministic automata to find number of subsequences in string ?
How can I construct a DFA to find number of occurence string as a subsequence in another string?
eg. In "ssstttrrriiinnngggg" we have 3 subsequences which form string "string" ?
also both string to be found and to be searched only contain characters from specific character Set .
I have some idea about storing characters in stack poping them accordingly till we match , if dont match push again .
Please tell DFA solution ?
OVERLAPPING MATCHES
If you wish to count the number of overlapping sequences then you simply construct a DFA that matches the string, e.g.
1 -(if see s)-> 2 -(if see t)-> 3 -(if see r)-> 4 -(if see i)-> 5 -(if see n)-> 6 -(if see g)-> 7
and then compute the number of ways of being in each state after seeing each character using dynamic programming. See the answers to this question for more details.
DP[a][b] = number of ways of being in state b after seeing the first a characters
= DP[a-1][b] + DP[a-1][b-1] if character at position a is the one needed to take state b-1 to b
= DP[a-1][b] otherwise
Start with DP[0][b]=0 for b>1 and DP[0][1]=1.
Then the total number of overlapping strings is DP[len(string)][7]
NON-OVERLAPPING MATCHES
If you are counting the number of non-overlapping sequences, then if we assume that the characters in the pattern to be matched are distinct, we can use a slight modification:
DP[a][b] = number of strings being in state b after seeing the first a characters
= DP[a-1][b] + 1 if character at position a is the one needed to take state b-1 to b and DP[a-1][b-1]>0
= DP[a-1][b] - 1 if character at position a is the one needed to take state b to b+1 and DP[a-1][b]>0
= DP[a-1][b] otherwise
Start with DP[0][b]=0 for b>1 and DP[0][1]=infinity.
Then the total number of non-overlapping strings is DP[len(string)][7]
This approach will not necessarily give the correct answer if the pattern to be matched contains repeated characters (e.g. 'strings').

Given string s, find the shortest string t, such that, t^m=s

Given string s, find the shortest string t, such that, t^m=s.
Examples:
s="aabbb" => t="aabbb"
s="abab" => t = "ab"
How fast can it be done?
Of course naively, for every m divides |s|, I can try if substring(s,0,|s|/m)^m = s.
One can figure out the solution in O(d(|s|)n) time, where d(x) is the number of divisors of s. Can it be done more efficiently?
This is the problem of computing the period of a string. Knuth, Morris and Pratt's sequential string matching algorithm is a good place to get started. This is in a paper entitled "Fast Pattern Matching in Strings" from 1977.
If you want to get fancy with it, then check out the paper "Finding All Periods and Initial Palindromes of a String in Parallel" by Breslauer and Galil in 1991. From their abstract:
An optimal O(log log n) time CRCW-PRAM algorithm for computing all
periods of a string is presented. Previous parallel algorithms compute
the period only if it is shorter than half of the length of the
string. This algorithm can be used to find all initial palindromes of
a string in the same time and processor bounds. Both algorithms are
the fastest possible over a general alphabet. We derive a lower bound
for finding palindromes by a modification of a previously known lower
bound for finding the period of a string [3]. When p processors are
available the bounds become \Theta(d n p e + log log d1+p=ne 2p).
I really like this thing called the z-algorithm: http://www.utdallas.edu/~besp/demo/John2010/z-algorithm.htm
For every position it calculates the longest substring starting from there, that is also a prefix of the whole string. (in linear time of course).
a a b c a a b x a a a z
1 0 0 3 1 0 0 2 2 1 0
Given this "z-table" it is easy to find all strings that can be exponentiated to the whole thing. Just check for all positions if pos+z[pos] = n.
In our case:
a b a b
0 2 0
Here pos = 2 gives you 2+z[2] = 4 = n hence the shortest string you can use is the one of length 2.
This will even allow you to find cases where only a prefix of the exponentiated string matches, like:
a b c a
0 0 1
Here (abc)^2 can be cut down to your original string. But of course, if you don't want matches like this, just go over the divisors only.
Yes you can do it in O(|s|) time.
You can search for a "target" string of length n in a "source" string of length m in O(n+m) time. Build a solution based on that.
Let both source and target be s. An additional constraint is that 1 and any positions in the source that do not divide |s| are not valid starting positions for the match. Of course the search per se will always fail. But if there's a partial match and you have reached the end of the sourse string, then you have a solution to the original problem.
a modification to Boyer-Moore could possibly handle this in O(n) where n is length of s
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

Resources