finding the shortest subsegment - string

I wanted to know which alogrithm should i apply.
Their is a sentence given and a list to words. We have to find the first shortest sub segment that contains all the words in the list of words.
eg:
Sentence - this is the best problem i have ever solved
List of words -
is
best
this
The answer should be:
this is the best
If there are many such sub segments then we have to print the one that contains the smallest number of words and appears first in sentence.

Here is my approach to solve the above problem.
1. Take 2 pointers head and tail both point to 0
Now move the head pointer until the word pointed to by the head pointer is a valid keyword; now mark it as head.
2. Now move tail pointer until the sentence contains all the given keywords at least once; now mark it as tail.
And this is the first valid subsegment with all valid keywords and calculate it's length
3. Now check word frequency at head - if it is greater than 1 now move head pointer to a word in the sentence which is a valid keyword, as well as it contains frequency of word as 1.
4. Now check whether all keywords are there or not - if yes, calculate it's length and store it as min sub-segment.
5. If it does not contains all valid keywords now move tail pointer until all keywords are found and calculate its length like (tail-head+1); if it is greater than min one then ignore it.
6. Now continue this process until last keyword of given sentence
The complexity of the above approach is o(n).
For example let us take this sentance
Hi this is a funny world this is a good experience with this world
and i need to find 3 keywords
this
is
world
at first consider 2 hash tables namely required,obtained
now store all required keywords in required table.
now take head and tail as 0 now check hi is a valid keyword since it it not move head
now check for next keyword i.e this ,now this is a valid keyword so make a count of 1 and store this word position as head .so now head is 1
now move tail pointer so next keyword is "is" ,it is a valid one hence increment count
now similarly check for a,funny keywords since they are not valid ones hence move tail to world
now world is a valid one as well as count is 3 and tail is 4 whenever count == no of required keywords(in our case it is 3) that means our segment contains all valid keywords
now it's length is (4-1+1)=4
now check frequency of word at head it is one hence if we move this head pointer then we won't get a valid segment
so now move tail pointer to next word this now update frequency of this to 2 from 1 and counter becomes 4
so now we can move our head pointer now move to a keyword is now update counter as 3 because our segment won't contain this at this moment because we have shifted the head pointer from this keyword
now again count is 3 hence calculate it's length again it is 4
so check freq of head keyword is it is 1 hence move tail pointer to next keyword is now is keyword freq is more than 1 hence now move head pointer until we get a valid keyword with freq as 1 now obtained keyword is world and head position is 5 and tail position is 7
and counter is 3 so calculate length as 7-5+1 which is 3 hence this is a min length that we found till now
now move tail until keyword freq at head is more than 1 now finally our tail become 13
now move head from 5 to 6 calculate it's length ,and it becomes 13-6+1 which is 8 so ignore it
now further we cannot move our tail hence print the words from min_head to min_tail as final result
in our case the answer is
world this is

Consider the following simple approach -
Make a dictionary mapping(enumeration) for each word in the sentence. Like -
this[1] is[2] the[3] best[4] problem[5] i[6] have[7] ever[8] solved[9]
Assuming all are distinct words in the sentence.
Now, taking one word at a time and keeping the record of max and min value of that word as key. In this case it would be 4 and 1, resp.
Return the string within the limits.

Related

Why am I getting a change in list elements but not the subtracted value of that element, when using for loop and print(my_list[i-1])

I have recently started learning python and am currently on fundamentals so please accept my excuse if this question sounds silly. I am a little confused with the indexing behavior of the list while I was learning the bubble sort algorithm.
For example:
code
my_list = [8,10,6,2,4]
for i in range(len(my_list)):
print(my_list[i])
for i in range(len(my_list)):
print(i)
Result:
8
10
6
2
4
0
1
2
3
4
The former for loop gave elements of the list (using indexing) while the latter provided its position, which is understandable. But when I'm experimenting with adding (-1) i.e. print (my_list[i-1]) and print(i-1) in both the for loops, I expect -1 to behave like a simple negative number and subtract a value from the indexed element in the first for loop i.e. 8-1=7
Rather, it's acting like a positional indicator of the list of elements and giving the last index value 4.
I was expecting this result from the 2nd loop. Can someone please explain to me why the print(my_list[i-1]) is actually changing the list elements selection but not actually subtracting value 1 from the list elements itself i.e. [8(-1), 10(-1), 6(-1)...
Thank you in advance.
The list index in the expression my_list[i-1] is the part between the brackets, i.e. i-1. So by subtracting in there, you are indeed modifying the index. If instead you want to modify the value in the list, that is, what the index is pointing at, you would use my_list[i] - 1. Now, the subtraction comes after the retrieval of the list value.
Here when you are trying to run the first for loop -
my_list = [8,10,6,2,4]
for i in range(len(my_list)):
print(my_list[i-1])
Here in the for loop you are subtracting the index not the Integer at that index number. So for doing that do the subtraction like -
for i in range(len(my_list)):
print(my_list[i]-1)
and you were getting the last index of the list because the loop starts with 0 and you subtracted 1 from it and made it -1 and list[-1] always returns the last index of the list.
Note: Here it is not good practice to iterate a list through for loop like you did above. You can do this by simply by -
for i in my_list:
print(i-1)
The result will remain the same with some conciseness in the code

what will be the dp and transitions in this problem

Vasya has a string s of length n consisting only of digits 0 and 1. Also he has an array a of length n.
Vasya performs the following operation until the string becomes empty: choose some consecutive substring of equal characters, erase it from the string and glue together the remaining parts (any of them can be empty). For example, if he erases substring 111 from string 111110 he will get the string 110. Vasya gets ax points for erasing substring of length x.
Vasya wants to maximize his total points, so help him with this!
https://codeforces.com/problemset/problem/1107/E
i was trying to get my head around the editorial,but couldn't understand it... can anyone tell an easy way to do it?
input:
7
1101001
3 4 9 100 1 2 3
output:
109
Explanation
the optimal sequence of erasings is: 1101001 → 111001 → 11101 → 1111 → ∅.
Here, we consider removing prefixes instead of substrings. Why?
We try to remove a consecutive prefix of a particular state which is actually a substring in the main string. So, our DP states will be start index, end index, prefix length.
Let's consider an example str = "1010110". Here, initially start=0, end=7, and prefix=1(the first '1' will be the only prefix now). we iterate over all the indices in the current state except the starting index and check if str[i]==str[start]. Here, for example, str[4]==str[0]. Now we divide the string into "010" with prefix=1(010) && "110" with prefix=2(1010110). These two are now two individual subproblems. So, when there remains a string with length 1, we return aprefix.
Here is my code.

Binary search - worst/avg case

I'm finding it difficult to understand why/how the worst and average case for searching for a key in an array/list using binary search is O(log(n)).
log(1,000,000) is only 6. log(1,000,000,000) is only 9 - I get that, but I don't understand the explanation. If one did not test it, how do we know that the avg/worst case is actually log(n)?
I hope you guys understand what I'm trying to say. If not, please let me know and I'll try to explain it differently.
Worst case
Every time the binary search code makes a decision, it eliminates half of the remaining elements from consideration. So you're dividing the number of elements by 2 with each decision.
How many times can you divide by 2 before you are down to only a single element? If n is the starting number of elements and x is the number of times you divide by 2, we can write this as:
n / (2 * 2 * 2 * ... * 2) = 1 [the '2' is repeated x times]
or, equivalently,
n / 2^x = 1
or, equivalently,
n = 2^x
So log base 2 of n gives you x, which is the number of decisions being made.
Finally, you might ask, if I used log base 2, why is it also OK to write it as log base 10, as you have done? The base does not matter because the difference is only a constant factor which is "ignored" by Big O notation.
Average case
I see that you also asked about the average case. Consider:
There is only one element in the array that can be found on the first try.
There are only two elements that can be found on the second try. (Because after the first try, we chose either the right half or the left half.)
There are only four elements that can be found on the third try.
You can see the pattern: 1, 2, 4, 8, ... , n/2. To express the same pattern going in the other direction:
Half the elements take the maximum number of decisions to find.
A quarter of the elements take one fewer decision to find.
etc.
Since half of the elements take the maximum amount of time, it doesn't matter how much less time the other elements take. We could assume that all elements take the maximum amount of time, and even if half of them actually take 0 time, our assumption would not be more than double whatever the true average is. We can ignore "double" since it is a constant factor. So the average case is the same as the worst case, as far as Big O notation is concerned.
For binary search, the array should be arranged in ascending or descending order.
In each step, the algorithm compares the search key value with the key value of the middle element of the array.
If the keys match, then a matching element has been found and its index, or position, is returned.
Otherwise, if the search key is less than the middle element's key, then the algorithm repeats its action on the sub-array to the left of the middle element.
Or, if the search key is greater,then the algorithm repeats its action on the sub-array to the right.
If the remaining array to be searched is empty, then the key cannot be found in the array and a special "not found" indication is returned.
So, a binary search is a dichotomic divide and conquer search algorithm. Thereby it takes logarithmic time for performing the search operation as the elements are reduced by half in each of the iteration.
For sorted lists which we can do a binary search, each "decision" made by the binary search compares your key to the middle element, if greater it takes the right half of the list, if less it will take the left half of the list (if it's a match it will return the element at that position) you effectively reduce your list by half for every decision yielding O(logn).
Binary search however, only works for sorted lists. For un-sorted lists you can do a straight search starting with the first element yielding a complexity of O(n).
O(logn) < O(n)
Although it entirely depends on how many searches you'll be doing, your inputs, etc what your best approach would be.
For Binary search the prerequisite is a sorted array as input.
• As the list is sorted:
• Certainly we don't have to check every word in the dictionary to look up a word.
• A basic strategy is to repeatedly halve our search range until we find the value.
• For example, look for 5 in the list of 9 #s below.v = 1 1 3 5 8 10 18 33 42
• We would first start in the middle: 8
• Since 5<8, we know we can look at just the first half: 1 1 3 5
• Looking at the middle # again, narrow down to 3 5
• Then we stop when we're down to one #: 5
How many comparison is needed: 4 =log(base 2)(9-1)=O(log(base2)n)
int binary_search (vector<int> v, int val) {
int from = 0;
int to = v.size()-1;
int mid;
while (from <= to) {
mid = (from+to)/2;
if (val == v[mid])
return mid;
else if (val > v[mid])
from = mid+1;
else
to = mid-1;
}
return -1;
}

Deterministic automata to find number of subsequence in string of another string

Deterministic automata to find number of subsequences in string ?
How can I construct a DFA to find number of occurence string as a subsequence in another string?
eg. In "ssstttrrriiinnngggg" we have 3 subsequences which form string "string" ?
also both string to be found and to be searched only contain characters from specific character Set .
I have some idea about storing characters in stack poping them accordingly till we match , if dont match push again .
Please tell DFA solution ?
OVERLAPPING MATCHES
If you wish to count the number of overlapping sequences then you simply construct a DFA that matches the string, e.g.
1 -(if see s)-> 2 -(if see t)-> 3 -(if see r)-> 4 -(if see i)-> 5 -(if see n)-> 6 -(if see g)-> 7
and then compute the number of ways of being in each state after seeing each character using dynamic programming. See the answers to this question for more details.
DP[a][b] = number of ways of being in state b after seeing the first a characters
= DP[a-1][b] + DP[a-1][b-1] if character at position a is the one needed to take state b-1 to b
= DP[a-1][b] otherwise
Start with DP[0][b]=0 for b>1 and DP[0][1]=1.
Then the total number of overlapping strings is DP[len(string)][7]
NON-OVERLAPPING MATCHES
If you are counting the number of non-overlapping sequences, then if we assume that the characters in the pattern to be matched are distinct, we can use a slight modification:
DP[a][b] = number of strings being in state b after seeing the first a characters
= DP[a-1][b] + 1 if character at position a is the one needed to take state b-1 to b and DP[a-1][b-1]>0
= DP[a-1][b] - 1 if character at position a is the one needed to take state b to b+1 and DP[a-1][b]>0
= DP[a-1][b] otherwise
Start with DP[0][b]=0 for b>1 and DP[0][1]=infinity.
Then the total number of non-overlapping strings is DP[len(string)][7]
This approach will not necessarily give the correct answer if the pattern to be matched contains repeated characters (e.g. 'strings').

Understanding the Knuth Morris Pratt(KMP) Failure Function

I've been reading the Wikipedia article about the Knuth-Morris-Pratt algorithm and I'm confused about how the values are found in the jump/partial match table.
i | 0 1 2 3 4 5 6
W[i] | A B C D A B D
T[i] | -1 0 0 0 0 1 2
If someone can more clearly explain the shortcut rule because the sentence
"let us say that we discovered a proper suffix which is a proper prefix and ending at W[2] with length 2 (the maximum possible)"
is confusing. If the proper suffix ends at W[2] wouldn't it be size of 3?
Also I'm wondering why T[4] isn't 1 when there is a prefix and suffix of size 1: The A.
Thanks for any help that can be offered.
Notice that the failure function T[i] does not use i as an index, but rather as a length. Therefore, T[2] represents the length of the longest proper border (a string that is both a prefix and suffix) of the string formed from the first two characters of W, rather than the longest proper border formed by the string ending at character 2. This is why the maximum possible value of T[2] is 2 rather than 3 - the substring formed from the first two characters of W can't have length any greater than 2.
Using this interpretation, it's also easier to see why T[4] is 0 rather than 1. The substring of W formed from the first four characters of W is ABCD, which has no proper prefix that is also a proper suffix.
Hope this helps!
"let us say that we discovered a proper suffix which is a proper prefix and ending at W[2] with length 2 (the maximum possible)"
Okay, the length can be maximum 2, it's correct, here is why...
One fact: "proper" prefix can't be the whole string , same goes for "proper" suffix(like proper subset)
Lets, W[0]=A W[1]=A W[2]=A , i.e the pattern is "AAA", so, the (max length)proper prefix can be "AA" (left to right) and, the (max length) proper suffix can be "AA" (right to left)
//yes, the prefix and suffix have overlaps (the middle "A")
So, the value would be 2 rather than 3, it would have been 3 only if the prefix was not proper.

Resources