see if a string is embedded in a larger string - string

I have data that looks like this using R.
> hits
Views on a 51-letter DNAString subject
subject: TCAGAAACAAAACCCAAAATCAGTAAGGAGGAGAAAGAAACCTAGGGAGAA
views:
start end width
[1] 1 10 10 [TCAGAAACAA]
[2] 14 23 10 [CCAAAATCAG]
[3] 19 28 10 [ATCAGTAAGG]
[4] 20 29 10 [TCAGTAAGGA]
[5] 21 30 10 [CAGTAAGGAG]
So I have a 51 length string called
subject = TCAGAAACAAAACCCAAAATCAGTAAGGAGGAGAAAGAAACCTAGGGAGAA.
5 substrings are extracted from this subject. You can see them above. I'd like to see if the 5 substrings are in my area of interest. This area of interest is from position 14 - 27.
subject = TCAGAAACAAAAC |-> CCAAAATCAGTAAG <-| GAGGAGAAAGAAACCTAGGGAGAA.
In other words, I have 5 substrings from the subject string. Out of these 5 strings, I am only looking for strings that lie between position 14 - 27 of the subject string. This is my area of interest.
The first [1] substring [TCAGAAACAA] is not that important since it is embedded right at the start (given by the coordinates 1 - 10) and is outside my area of interest.
The second [2] string given by the coordinates 14 - 23 tells me that it in entirely embedded in my area of interest (which again is 14 - 27).
The third [3] string is given by the coordinates 19 - 28. This is important to me as the majority of the string is embedded in my area of interest.
The fourth [4] string is given by the coordinates 20 - 29. Again this is important to me since the majority of the string is embedded in my area of interest except the last the characters.
The story is the same for the fifth substring.
Basically if 60% of the string is embedded in my area of interest I'd like to count it.
Can someone give me an algorithm in pseudocode that can do this? I have been thinking about this for a while drawing diagrams but I can't seem to implement it. I am doing this in R so I will convert the pseudocode to R. Also the number 60% is arbritrary. I'll have to confirm this with my supervisor but I am sure this is irrelevant.

If I understood well, you need to
Define an 'area of interest' given by a start position and an end position.
Find a string or an accepted portion of a string in the area of interest of the larger string.
So this is what I would do in javascript
var fractionIsInString = function (areaOfInterest, stringToBeFound, acceptedFraction) {
var fractionLength = Math.floor(stringToBeFound.length*acceptedFraction),
startPosition = 0,
endPosition = fractionLength,
fraction,
keepSearching = true;
do {
fraction = stringToBeFound.substring(startPosition, endPosition);
if (areaOfInterest.indexOf(fraction) > -1) {
return true;
}
startPosition++;
endPosition++;
keepSearching = endPosition < stringToBeFound.length;
} while (keepSearching);
return false;
};
To call it you simply say
fractionIsInString('CCAAAATCAGTAAG', 'TCAGAAACAA', 0.6);
The first parameter is your area of interest, which can be obtained like this
subject.substring(14, 27);
The second parameter is the first of the strings you get from your subject. The one that goes from 0 to 10.
The third parameter is the portion of the second parameter that you want to be found. 60% in this case.
How the function works is that it looks for the fraction of the string in the larger string and if the fraction is not found, it moves to the next fraction of the string and so on until it finds a fraction that is found or it reaches the end of the string.

def substring_index(longstring, substring):
"""Return the index of the substring in longstring."""
# Python has a built in function for this.
def is_interesting(index, length, interesting_start, interesting_end, percentage):
"""Return true if the substring is interesting."""
interesting = 0
uninteresting = 0
# check if the character at each position from index to index + length
# is in the interesting range.
for x in range(index, index + length + 1):
if interesting_start < x < interesting_end:
interesting += 1
else:
uninteresting += 1
# Do some math to see if interesting / (interesting + uninteresting)
# is bigger than percentage
Use the substring_index function to see if and where the index lies in the longstring.
Use the is_interesting function to return a boolean based on whether the substring is interesting.
So, for the first substring, you could call it this like:
longstring = "TCAGAAACAAAACCCAAAATCAGTAAGGAGGAGAAAGAAACCTAGGGAGAA"
substring = "TCAGAAACAA"
is_interesting(substring_index(longstring, substring), len(substring), 14, 27, 0.6)

Related

How to remove kth element in O(1) time complexity

Given a string I need to remove the smallest character and return the sum of indices of removed charecter.
Suppose the string is 'abcab' I need to remove first a at index 1.
We are left with 'bcab'. Now remove again a which is smallest in remaining string and is at index 3
We are left with 'bcb'.
In the same way remove b at index 1,then remove again b from 'cb' at index 2 and finally remove c
Total of all indices is 1+3+1+2+1=8
Question is simple but we need to do it in O(n). for that I need to remove kth element in O(1). In python del list[index] has time complexity O(n).
How can I delete in constant time using python
Edit
This is the exact question
You are given a string S of size N. Assume that count is equal to 0.
Your task is the remove all the N elements of string S by performing the following operation N times
• In a single operation, select an alphabetically smallest character in S, for example, Remove from S and add its index to count. If multiple characters such as c exist, then select that has the smallest index.
Print the value of count.
Note Consider 1-based indexing
Solve the problem for T test cases
Input format
The first line of the input contains an integer T denoting the number of test cases • The first line of each test case contains an integer N denoting the size of string S
• The second line of each test case contains a string S
Output format
For each test case print a single line containing one integer denoting the value of count
1<T, N < 10^5
• S contains only lowercase English alphabets
Sum of N over all test cases does not exceed 10
Sample input 1
5
abcab
Sample Output1
8
Explanation
The operations occur in the following order
Current string S= abcab', The alphabetically smallest character of s is 'a As there are 2 occurrences of a, we choose the first occurrence. Its Index 1 will be added to the count and a will be removed. Therefore, S becomes bcab
a will.be removed from 5 (bcab) and 3 will.be added to count
The first occurrence of b will be removed from (bcb) and 1 will be added to count.
b will be removed from s (cb) and 2 will be added to count
c will be removed from 5 (c) and 1 will be added to count
If you follow your procedure of repeatedly removing the first occurrence of the smallest character, then each character's index -- when you remove it -- is the number of preceding larger characters in the original string plus one.
So what you really need to do is find, for each character, the number of preceding larger characters, and then add up all those counts.
There are only 26 characters, so you can do this as you go with 26 counters.
Please link to the original problem statement, or copy/paste exactly what it says, without trying to explain it. As is, what you're asking for is impossible.
Forget deleting: if what you're asking for was possible, sorting would be worse-case O(n) (remove the minimum remaining n times, at O(1) cost for each), but it's well known that comparison-based sorting cannot do better than worst case O(n log n).
One bet: the original problem statement doesn't require that you delete anything - but instead that you return the result as if you had deleted.
With one pass over the input
Putting together various ideas, the final index of a character is one more than the number of larger characters seen before it. So it's possible to do this in one left-to-right pass over the input, using O(1) storage and O(n) time, while deleting nothing:
def crunch(s):
neq = [0] * 26
result = 0
orda = ord('a')
for ch in map(ord, s):
ch -= orda
result += sum(neq[i] for i in range(ch + 1, 26)) + 1
neq[ch] += 1
return result
For your original:
>>> crunch('abcab')
8
But it's also possible to process arbitary iterables one character at a time:
>>> from itertools import repeat, chain
>>> crunch(chain(repeat('y', 1000000), 'xz'))
2000002
x is originally at (1-based) index 1000001, which accounts for half the result. Then each of a million 'y's is conceptually deleted, each at index 1. Finally 'z' is at index 1, for a grand total of 2000002.
Looks like you're only interested in the resulting sum of indices and don't need to simulate this algorithm step by step.
In which case you could compute the result in the following way:
For each letter from a to z:
Have a counter of already removed letters set to 0
Iterate over the string and if you encounter the current letter add current_index - already_removed_counter to the result.
2a. If you encounter current or earlier (smaller) letter increase the counter as it already has been removed
The time complexity is 26 * O{n} which is O{n}.
Since there are only 26 distinct chatacters in the string, we can take each character separately and linearly traverse the string to find all its occurences. Keep a counter of how many chacters were found. Each time an occurence of a given character is found display its index decreased by the counter. Before switching to a new character, remove all the occurences of the previous one - this can be done in linear time.
res = 0
for c in 'a' .. 'z'
cnt = 0
for idx = 1 .. len(s)
if s[idx] = c
print idx - cnt
res += idx - cnt
cnt++
removeAll(s, c)
return res
where
removeAll(s,c):
i = 1
cnt = 0
n = len(s)
while (i < n)
if s[i + cnt] = c
cnt++
n--
else
s[i] = s[i + cnt]
i++
len(s) = n
It prints the elements of the sum to better illustrate what's going on.
Edit:
An updated version based on Igor's answer, that does not require actually removing elements. The complexity is the same i.e. O(n).
res = 0
for c in 'a' .. 'z'
cnt = 0
for idx = 1 .. len(s)
if s[idx] <= c
if s[idx] = c
print idx - cnt
res += idx - cnt
cnt++
return res

Efficiently counting the number of substrings of a digit string that are divisible by k?

We are given a string which consists of digits 0-9. We have to count number of sub-strings divisible by a number k. One way is to generate all the sub-strings and check if it is divisible by k but this will take O(n^2) time. I want to solve this problem in O(n*k) time.
1 <= n <= 100000 and 2 <= k <= 1000.
I saw a similar question here. But k was fixed as 4 in that question. So, I used the property of divisibility by 4 to solve the problem.
Here is my solution to that problem:
int main()
{
string s;
vector<int> v[5];
int i;
int x;
long long int cnt = 0;
cin>>s;
x = 0;
for(i = 0; i < s.size(); i++) {
if((s[i]-'0') % 4 == 0) {
cnt++;
}
}
for(i = 1; i < s.size(); i++) {
int f = s[i-1]-'0';
int s1 = s[i] - '0';
if((10*f+s1)%4 == 0) {
cnt = cnt + (long long)(i);
}
}
cout<<cnt;
}
But I wanted a general algorithm for any value of k.
This is a really interesting problem. Rather than jumping into the final overall algorithm, I thought I'd start with a reasonable algorithm that doesn't quite cut it, then make a series of modifications to it to end up with the final, O(nk)-time algorithm.
This approach combines together a number of different techniques. The major technique is the idea of computing a rolling remainder over the digits. For example, let's suppose we want to find all prefixes of the string that are multiples of k. We could do this by listing off all the prefixes and checking whether each one is a multiple of k, but that would take time at least Θ(n2) since there are Θ(n2) different prefixes. However, we can do this in time Θ(n) by being a bit more clever. Suppose we know that we've read the first h characters of the string and we know the remainder of the number formed that way. We can use this to say something about the remainder of the first h+1 characters of the string as well, since by appending that digit we're taking the existing number, multiplying it by ten, and then adding in the next digit. This means that if we had a remainder of r, then our new remainder is (10r + d) mod k, where d is the digit that we uncovered.
Here's quick pseudocode to count up the number of prefixes of a string that are multiples of k. It runs in time Θ(n):
remainder = 0
numMultiples = 0
for i = 1 to n: // n is the length of the string
remainder = (10 * remainder + str[i]) % k
if remainder == 0
numMultiples++
return numMultiples
We're going to use this initial approach as a building block for the overall algorithm.
So right now we have an algorithm that can find the number of prefixes of our string that are multiples of k. How might we convert this into an algorithm that finds the number of substrings that are multiples of k? Let's start with an approach that doesn't quite work. What if we count all the prefixes of the original string that are multiples of k, then drop off the first character of the string and count the prefixes of what's left, then drop off the second character and count the prefixes of what's left, etc? This will eventually find every substring, since each substring of the original string is a prefix of some suffix of the string.
Here's some rough pseudocode:
numMultiples = 0
for i = 1 to n:
remainder = 0
for j = i to n:
remainder = (10 * remainder + str[j]) % k
if remainder == 0
numMultiples++
return numMultiples
For example, running this approach on the string 14917 looking for multiples of 7 will turn up these strings:
String 14917: Finds 14, 1491, 14917
String 4917: Finds 49,
String 917: Finds 91, 917
String 17: Finds nothing
String 7: Finds 7
The good news about this approach is that it will find all the substrings that work. The bad news is that it runs in time Θ(n2).
But let's take a look at the strings we're seeing in this example. Look, for example, at the substrings found by searching for prefixes of the entire string. We found three of them: 14, 1491, and 14917. Now, look at the "differences" between those strings:
The difference between 14 and 14917 is 917.
The difference between 14 and 1491 is 91
The difference between 1491 and 14917 is 7.
Notice that the difference of each of these strings is itself a substring of 14917 that's a multiple of 7, and indeed if you look at the other strings that we've matched later on in the run of the algorithm we'll find these other strings as well.
This isn't a coincidence. If you have two numbers with a common prefix that are multiples of the same number k, then the "difference" between them will also be a multiple of k. (It's a good exercise to check the math on this.)
So this suggests another route we can take. Suppose that we find all prefixes of the original string that are multiples of k. If we can find all of them, we can then figure out how many pairwise differences there are among those prefixes and potentially avoid rescanning things multiple times. This won't find everything, necessarily, but it will find all substrings that can be formed by computing the difference of two prefixes. Repeating this over all suffixes - and being careful not to double-count things - could really speed things up.
First, let's imagine that we find r different prefixes of the string that are multiples of k. How many total substrings did we just find if we include differences? Well, we've found k strings, plus one extra string for each (unordered) pair of elements, which works out to k + k(k-1)/2 = k(k+1)/2 total substrings discovered. We still need to make sure we don't double-count things, though.
To see whether we're double-counting something, we can use the following technique. As we compute the rolling remainders along the string, we'll store the remainders we find after each entry. If in the course of computing a rolling remainder we rediscover a remainder we've already computed at some point, we know that the work we're doing is redundant; some previous scan over the string will have already computed this remainder and anything we've discovered from this point forward will have already been found.
Putting these ideas together gives us this pseudocode:
numMultiples = 0
seenRemainders = array of n sets, all initially empty
for i = 1 to n:
remainder = 0
prefixesFound = 0
for j = i to n:
remainder = (10 * remainder + str[j]) % k
if seenRemainders[j] contains remainder:
break
add remainder to seenRemainders[j]
if remainder == 0
prefixesFound++
numMultiples += prefixesFound * (prefixesFound + 1) / 2
return numMultiples
So how efficient is this? At first glance, this looks like it runs in time O(n2) because of the outer loops, but that's not a tight bound. Notice that each element can only be passed over in the inner loop at most k times, since after that there aren't any remainders that are still free. Therefore, since each element is visited at most O(k) times and there are n total elements, the runtime is O(nk), which meets your runtime requirements.

How to compute word scores in Scrabble using MATLAB

I have a homework program I have run into a problem with. We basically have to take a word (such as MATLAB) and have the function give us the correct score value for it using the rules of Scrabble. There are other things involved such as double word and double point values, but what I'm struggling with is converting to ASCII. I need to get my string into ASCII form and then sum up those values. We only know the bare basics of strings and our teacher is pretty useless. I've tried converting the string into numbers, but that's not exactly working out. Any suggestions?
function[score] = scrabble(word, letterPoints)
doubleword = '#';
doubleletter = '!';
doublew = [findstr(word, doubleword)]
trouble = [findstr(word, doubleletter)]
word = char(word)
gameplay = word;
ASCII = double(gameplay)
score = lower(sum(ASCII));
Building on Francis's post, what I would recommend you do is create a lookup array. You can certainly convert each character into its ASCII equivalent, but then what I would do is have an array where the input is the ASCII code of the character you want (with a bit of modification), and the output will be the point value of the character. Once you find this, you can sum over the points to get your final point score.
I'm going to leave out double points, double letters, blank tiles and that whole gamut of fun stuff in Scrabble for now in order to get what you want working. By consulting Wikipedia, this is the point distribution for each letter encountered in Scrabble.
1 point: A, E, I, O, N, R, T, L, S, U
2 points: D, G
3 points: B, C, M, P
4 points: F, H, V, W, Y
5 points: K
8 points: J, X
10 points: Q, Z
What we're going to do is convert your word into lower case to ensure consistency. Now, if you take a look at the letter a, this corresponds to ASCII code 97. You can verify that by using the double function we talked about earlier:
>> double('a')
97
As there are 26 letters in the alphabet, this means that going from a to z should go from 97 to 122. Because MATLAB starts indexing arrays at 1, what we can do is subtract each of our characters by 96 so that we'll be able to figure out the numerical position of these characters from 1 to 26.
Let's start by building our lookup table. First, I'm going to define a whole bunch of strings. Each string denotes the letters that are associated with each point in Scrabble:
string1point = 'aeionrtlsu';
string2point = 'dg';
string3point = 'bcmp';
string4point = 'fhvwy';
string5point = 'k';
string8point = 'jx';
string10point = 'qz';
Now, we can use each of the strings, convert to double, subtract by 96 then assign each of the corresponding locations to the points for each letter. Let's create our lookup table like so:
lookup = zeros(1,26);
lookup(double(string1point) - 96) = 1;
lookup(double(string2point) - 96) = 2;
lookup(double(string3point) - 96) = 3;
lookup(double(string4point) - 96) = 4;
lookup(double(string5point) - 96) = 5;
lookup(double(string8point) - 96) = 8;
lookup(double(string10point) - 96) = 10;
I first create an array of length 26 through the zeros function. I then figure out where each letter goes and assign to each letter their point values.
Now, the last thing you need to do is take a string, take the lower case to be sure, then convert each character into its ASCII equivalent, subtract by 96, then sum up the values. If we are given... say... MATLAB:
stringToConvert = 'MATLAB';
stringToConvert = lower(stringToConvert);
ASCII = double(stringToConvert) - 96;
value = sum(lookup(ASCII));
Lo and behold... we get:
value =
10
The last line of the above code is crucial. Basically, ASCII will contain a bunch of indexing locations where each number corresponds to the numerical position of where the letter occurs in the alphabet. We use these positions to look up what point / score each letter gives us, and we sum over all of these values.
Part #2
The next part where double point values and double words come to play can be found in my other StackOverflow post here:
Calculate Scrabble word scores for double letters and double words MATLAB
Convert from string to ASCII:
>> myString = 'hello, world';
>> ASCII = double(myString)
ASCII =
104 101 108 108 111 44 32 119 111 114 108 100
Sum up the values:
>> total = sum(ASCII)
total =
1160
The MATLAB help for char() says (emphasis added):
S = char(X) converts array X of nonnegative integer codes into a character array. Valid codes range from 0 to 65535, where codes 0 through 127 correspond to 7-bit ASCII characters. The characters that MATLAB® can process (other than 7-bit ASCII characters) depend upon your current locale setting. To convert characters into a numeric array, use the double function.
ASCII chart here.

Mapping unique combinations to numbers

I am trying to come up with a solution to a problem I thought of. I have the number of permutations of 26 characters with 6 possible spots as 26^6 = 308 915 776. I was trying to make a way so that I could map each number to a unique combination and be able to go back and forth from combination to number.
An example:
1 = aaaaaa
2 = aaaaab
27 = aaaaba
Is it possible to write a polynomial time algorithm that would convert between the two and/or are there any efficient examples of what I am trying to do.
This is just base conversion my friend.
Since you didn't specify a language, the following is pseudo-code with array indexing and string indexing starting at 0 and assignment is :=.
if you let 'a' be 0, and 'z' be 25, then to convert from base 26 to base 10:
total:= 0
loop index from 0 to 5
temp:= 'z' - input[index] // Left to right. Single base 26 digit to base 10
total:= 26 * total + temp // Shift left and add the converted digit
increment index and goto loop start
To go back to letters (base 26) is also easy:
result:= ''
loop index from 0 to 5
temp:= 'a' + input mod 26 // Input modulus 26 is the base 26 digit to add next
result:= temp + result // Append current result to the new base 26 digit
input:= input div 26 // Divide input by 26, throw away the remainder
increment index and goto loop start
If you want all a's to be 1, then add one after converting from base 26 to base 10 and subtract 1 before converting from base 10 to base 26. Personally, I'd let all a's be 0.
You could map it via pointers into a double:
char *example = "abcdef";
double d = 0;
char *p = (char *)&d;
for (int i=0; i<6; i++)
p[i] = example[i];
// d is your code
It's not so beautiful and not 100% allowed, but it works.

How compiler is converting integer to string and vice versa

Many languages have functions for converting string to integer and vice versa. So what happens there? What algorithm is being executed during conversion?
I don't ask in specific language because I think it should be similar in all of them.
To convert a string to an integer, take each character in turn and if it's in the range '0' through '9', convert it to its decimal equivalent. Usually that's simply subtracting the character value of '0'. Now multiply any previous results by 10 and add the new value. Repeat until there are no digits left. If there was a leading '-' minus sign, invert the result.
To convert an integer to a string, start by inverting the number if it is negative. Divide the integer by 10 and save the remainder. Convert the remainder to a character by adding the character value of '0'. Push this to the beginning of the string; now repeat with the value that you obtained from the division. Repeat until the divided value is zero. Put out a leading '-' minus sign if the number started out negative.
Here are concrete implementations in Python, which in my opinion is the language closest to pseudo-code.
def string_to_int(s):
i = 0
sign = 1
if s[0] == '-':
sign = -1
s = s[1:]
for c in s:
if not ('0' <= c <= '9'):
raise ValueError
i = 10 * i + ord(c) - ord('0')
return sign * i
def int_to_string(i):
s = ''
sign = ''
if i < 0:
sign = '-'
i = -i
while True:
remainder = i % 10
i = i / 10
s = chr(ord('0') + remainder) + s
if i == 0:
break
return sign + s
I wouldn't call it an algorithm per se, but depending on the language it will involve the conversion of characters into their integral equivalent. Many languages will either stop on the first character that cannot be represented as an integer (e.g. the letter a), will blindly convert all characters into their ASCII value (e.g. the letter a becomes 97), or will ignore characters that cannot be represented as integers and only convert the ones that can - or return 0 / empty. You have to get more specific on the framework/language to provide more information.
String to integer:
Many (most) languages represent strings, on some level or another, as an array (or list) of characters, which are also short integers. Map the ones corresponding to number characters to their number value. For example, '0' in ascii is represented by 48. So you map 48 to 0, 49 to 1, and so on to 9.
Starting from the left, you multiply your current total by 10, add the next character's value, and move on. (You can make a larger or smaller map, change the number you multiply by at each step, and convert strings of any base you like.)
Integer to string is a longer process involving base conversion to 10. I suppose that since most integers have limited bits (32 or 64, usually), you know that it will come to a certain number of characters at most in a string (20?). So you can set up your own adder and iterate through each place for each bit after calculating its value (2^place).

Resources