Detect all k-length words of string in prolog - string

Words are any symbol characters that are separated by white-spaces or by start/end points of string. For ex. [w,o,r,d,1,' ',w,o,r,d,2].
I need to find all k-length words of given string and append them into the result string (separated with white-spaces).
This is what I'm expecting for example in the case of k = 5:
?- kthWords([w,o,r,d,1,'',w,r,d,'',w,o,r,d,2], 5, X).
X = [w,o,r,d,1,'',w,o,r,d,2].

You could write:
final_kthWords(L,K,Outlist):-
kthWords(L,K,L1),
reverse(L1,[_|T]),
reverse(T,Outlist).
kthWords([],_,[]):-!.
kthWords(L,K,L1):-
find_word(L,Word,L2),
length(Word,N),
(N=:=K-> append(Word,[' '|T],L1),kthWords(L2,K,T);
kthWords(L2,K,L1)).
find_word([],[],[]).
find_word([H|T],[H|T1],L):-dif(H,' '),find_word(T,T1,L).
find_word([H|T],[],T):- H = ' '.
Where kthWords/3 calls find_word/2 which finds the words and finally kthWords returns the output list but it adds an ' ' in the end. The only thing that final_kthWords(L,K,Outlist)/3 does is removing the extra ' ' in the end of the list and returns the right list:
?- final_kthWords([w,o,r,d,1,' ',w,r,d,' ',w,o,r,d,2], 5, X).
X = [w, o, r, d, 1, ' ', w, o, r, d, 2] ;
false.

Hoping that someone else can propose a simpler solution... the following seems to work
kthWordsH([], 0, _, R0, R0).
kthWordsH([], N, _, _, []) :-
N \= 0.
kthWordsH([' ' | Tl], 0, Len, W, Revult) :-
kthWordsH(Tl, Len, Len, [], Res0),
append(Res0, [' ' | W], Revult).
kthWordsH([' ' | Tl], N, Len, _, Revult) :-
N \= 0,
kthWordsH(Tl, Len, Len, [], Revult).
kthWordsH([H | Tl], 0, Len, _, Revult) :-
H \= ' ',
kthWordsH(Tl, Len, Len, [], Revult).
kthWordsH([H | Tl], N, Len, Tw, Revult) :-
H \= ' ',
N \= 0,
Nm1 is N-1,
kthWordsH(Tl, Nm1, Len, [H | Tw], Revult).
kthWords(List, Len, Result) :-
kthWordsH(List, Len, Len, [], Revult),
reverse(Revult, Result).

Solution without reverse.
% return a word of k length, or return [] otherwise
kword(K,L,W):-
length(L,K) -> append(L,[' '],W); W=[].
% if no more chars, then check final word in L and
% append to word list Ls to return Lw
kwords(K,[],L,Ls,Lw):-
kword(K,L,W),
append(Ls,W,Lw).
% if char is space, then append to Ls if word of length K
% if not space, append char to "in progress" work list L
kwords(K,[C|Cs],L,Ls,Lw):-
( C=' ' ->
( kword(K,L,W),
append(Ls,W,Ls0),
L2 = []
);
( append(L,[C],L2),
Ls0 = Ls
)
),
kwords(K,Cs,L2,Ls0,Lw).
% intialise predicate call with empty word and empty result
kthWords(Cs,K,L):- kwords(K,Cs,[],[],L).

Related

Smallest window (substring) that has both uppercase and corresponding lowercase characters

I was asked the following question in an onsite interview:
A string is considered "balanced" when every letter in the string appears both in uppercase and lowercase. For e.g., CATattac is balanced (a, c, t occur in both cases), while Madam is not (a, d only appear in lowercase). Write a function that, given a string, returns the shortest balanced substring of that string. For e.g.,:
“azABaabza” should return “ABaab”
“TacoCat” should return -1 (not balanced)
“AcZCbaBz” should returns the entire string
Doing it with the brute force approach is trivial - calculating all the pairs of substrings and then checking if they are balanced, while keeping track of the size and starting index of the smallest one.
How do I optimize? I have a strong feeling it can be done with a sliding-window/two-pointer approach, but I am not sure how. When to update the pointers of the sliding window?
Edit: Removing the sliding-window tag since this is not a sliding-window problem (as discussed in the comments).
Due to the special property of string. There is only 26 uppercase letters and 26 lowercase letters.
We can loop every 26 letter j and denote the minimum length for any substrings starting from position i to find matches for uppercase and lowercase letter j be len[i][j]
Demo C++ code:
string s = "CATattac";
// if len[i] >= s.size() + 1, it denotes there is no matching
vector<vector<int>> len(s.size(), vector<int>(26, 0));
for (int i = 0; i < 26; ++i) {
int upperPos = s.size() * 2;
int lowerPos = s.size() * 2;
for (int j = s.size() - 1; j >= 0; --j) {
if (s[j] == 'A' + i) {
upperPos = j;
} else if (s[j] == 'a' + i) {
lowerPos = j;
}
len[j][i] = max(lowerPos - j + 1, upperPos - j + 1);
}
}
We also keep track of the count of characters.
// cnt[i][j] denotes the number of characters j in substring s[0..i-1]
// cnt[0][j] is always 0
vector<vector<int>> cnt(s.size() + 1, vector<int>(26, 0));
for (int i = 0; i < s.size(); ++i) {
for (int j = 0; j < 26; ++j) {
cnt[i + 1][j] = cnt[i][j];
if (s[i] == 'A' + j || s[i] == 'a' + j) {
++cnt[i + 1][j];
}
}
}
Then we can loop over s.
int m = s.size() + 1;
for (int i = 0; i < s.size(); ++i) {
bool done = false;
int minLen = 1;
while (!done && i + minLen <= s.size()) {
// execute at most 26 times, a new character must be added to change minLen
int prevMinLen = minLen;
done = true;
for (int j = 0; j < 26 && i + minLen <= s.size(); ++j) {
if (cnt[i + minLen][j] - cnt[i][j] > 0) {
// character j exists in the substring, have to find pair of it
minLen = max(minLen, len[i][j]);
}
}
if (prevMinLen != minLen) done = false;
}
// find overall minLen
if (i + minLen <= s.size())
m = min(m, minLen);
cout << minLen << '\n';
}
Output: (if i + minLen <= s.size(), it is valid. Otherwise substring doesn't exist if starting at that position)
The invalid output difference is due to how the array len is generated.
8
4
15
14
13
12
11
10
I'm not sure whether there is a simpler solution but it is the best I could think of right now.
Time complexity: O(N) with a constant of 26 * 26
Edit: I previously had O(nlog(n)) due to a unnecessary binary search.
I thought of a solution, which is technically O(n), where n is the length of the string, but the constant is pretty large.
For simplicity's sake, let's consider an analogous situation with only two letters, A and B (and their lowercase counterparts), and let l be the size of the alphabet for future reference. I worked on an example string ABabBaaA.
We start by computing the prefix counts of the number of occurrences of each letter. In this case, we get
i: 0, 1, 2, 3, 4, 5, 6, 7, 8
----------------------------
A: 0, 1, 1, 1, 1, 1, 1, 1, 2
a: 0, 0, 0, 1, 1, 1, 2, 3, 3
B: 0, 0, 1, 1, 1, 2, 2, 2, 2
b: 0, 0, 0, 0, 1, 1, 1, 1, 1
This way, assuming we are indexing the string starting from 1 (for implementation's sake you can add an extra character to the beginning, like a dollar sign $), we can get the number of occurrences of each letter on any substring in constant time (or rather -- in O(l), but in my case l is set to 2 and in your case l = 26 so technically this is constant time).
OK now we prepare arrays / vectors / queues of character indices, so if the character A appears on indices 1 and 8, the structure will consist of 1 and 8. We get
A: 1, 8
a: 3, 6, 7
B: 2, 5
b: 4
What is important, is that in arrays and vectors, we can look up certain "lowest element greater than" in amortized constant time by discarding indices which are smaller than every index one by one.
Now, the algorithm. Starting at each (left) index greater than 0, we will find the earliest right index for which the substring bound by [left_index, right_index] is balanced. We do that as follows:
Start with left_index = right_index = i for i = 1, ..., n.
Read the array of prefix counts for right_index and subtract the prefix counts for left_index - 1 receiving the counts for the substring [left_index, right_index]. Find any letter, which fails the "balance" check. If there is none, you found the shortest balanced substring starting at left_index.
Find the first occurrence of the "missing" letter, greater than left_index. Set right_index to the index of that occurrence. Go to step 1 keeping the modified right_index.
For example: starting with left_index = right_index = 1 we see that the number of occurrences of each letter in the substring is 1, 0, 0, 0, so a fails the check. The earliest occurrence of a is 3, so we set right_index = 3. We go back to step 1 receiving a new array of occurrences: 1, 1, 1, 0. Now b fails the check, and its earliest occurrence greater than 1 is 4, so we set right_index to 4. We go to step 1 receiving an array of occurrences 1, 1, 1, 1, which passes the balance check.
Another example: starting with left_index = right_index = 2 we get in step 1 an array of occurrences 0, 0, 1, 0. Now b fails the check. The earliest occurrence of b greater than left_index is 4, so we set right_index to 4. Now we get an array of occurrences 0, 1, 1, 1, so A fails the check. The earliest occurrence of A greater than left_index is 8, so we set right_index to that. Now, the array of occurrences is 2-1, 3-0, 2-0, 1-0, which is 1, 3, 2, 1 and it passes the balance check.
Ultimately we will find the shortest balanced substring to be bB with left_index = 4.
The complexity of this algorithm is O(nl^2) because: we start at n different indices and we perform a maximum of l lookups (for l different letters which can fail the check) in O(1). For each lookup, we have to calculate l differences of prefix sums. But as l is constant (albeit it may be large, like 26), this simplifies to O(n).
I'm using a recursive approach to this; I'm not sure what it's time complexity is though.
The idea is we check what characters in the string are present in both their lower and upper form formats. For any characters that aren't given in both forms, we replace them with a space ' '. We then split the remaining string on ' ' into a list.
In the first case, if we have only one string left after it- we return it's length.
In the second case, if we have no characters left, we return -1.
In the third case, if we have more than one string left, we re-evaluate each of the strings sub-lengths and return the length of the longest string we then evaluate.
from collections import Counter
def findMutual(s):
lower = dict(Counter( [x for x in s if x.lower() == x] ))
upper = dict(Counter( [x for x in s if x.upper() == x] ))
mutual = {}
for charr in lower:
if charr.upper() in upper:
mutual[charr] = upper[charr.upper()] + lower[charr]
matching_charrs = ''.join([x if x.lower() in mutual else ' ' for x in s ]).split()
print(s)
print(matching_charrs)
return matching_charrs
def smallestSubstring(s):
matching_charrs = findMutual(s)
if len(matching_charrs) == 1:
return(len(matching_charrs[0]))
elif len(matching_charrs) == 0:
return(-1)
else:
list_lens = []
for i in matching_charrs:
list_lens.append(smallestSubstring(i))
return max(list_lens)
print(smallestSubstring('azABaabza'))
print(smallestSubstring('dAcZCbaBz'))
print(smallestSubstring('TacoCat'))
print(smallestSubstring('Tt'))
print(smallestSubstring('T'))
print(smallestSubstring('TaCc'))

Swap two characters in the cell array of strings

I have a cell array of string and I want to swap A and B in a percentage of the cell array , like 20%, 30% of the total number of strings in the cell array
For example :
A_in={ 'ABCDE'
'ACD'
'ABCDE'
'ABCD'
'CDE' };
Now, we need to swap A and B in 40% of the sequences in A (2/5 sequences ). There are some sequences which do not contain A and B so we just skip them, and we will swap the sequences which contain AB . The pickup sequences in A are chosen randomly. I appropriate someone can tell me how to do this . The expected output is:
A_out={ 'ABCDE'
'ACD'
'BACDE'
'BACD'
'CDE' }
Get the random precent index with randsample and swap with strrep
% Input
swapStr = 'AB';
swapPerc = 0.4; % 40%
% Get index to swap
hasPair = find(~cellfun('isempty', regexp(A_in, swapStr)));
swapIdx = randsample(hasPair, ceil(numel(hasPair) * swapPerc));
% Swap char pair
A_out = A_in;
A_out(swapIdx) = strrep(A_out(swapIdx), swapStr, fliplr(swapStr));
you can use strfind, like:
A_in={ 'ABCDE';
'ACD';
'ABCDE';
'ABCD';
'CDE' };
ABcells = strfind(A_in,'AB');
idxs = find(~cellfun(#isempty,ABcells));
n = numel(idxs);
perc = 0.6;
k = round(n*perc);
idxs = randsample(idxs,k);
A_out = A_in;
A_out(idxs) = cellfun(#(a,idx) [a(1:idx-1) 'BA' a(idx+2:end)],A_in(idxs),ABcells(idxs),'UniformOutput',false);

Finding maximum substring that is cyclic equivalent

This is a problem from a programming contest that was held recently.
Two strings a[0..n-1] and b[0..n-1] are called cyclic equivalent if and only if there exists an offset d, such that for all 0 <= i < n, a[i] = b[(i + d) mod n].
Given two strings s[0..L-1] and t[0..L-1] with same length L. You need to find the maximum p such that s[0..p-1] and t[0..p-1] are cyclic equivalent.Print 0 if no such valid p exists.
Input
The first line contains an integer T indicating the number of test cases.
For each test case, there are two lines in total. The first line contains s. The second line contains t.
All strings contain only lower case alphabets.
Output
Output T lines in total. Each line should start with "Case #: " and followed by the maximum p. Here "#" is the number of the test case starting from 1.
Constraints
1 ≤ T ≤ 10
1 ≤ L ≤ 1000000
Example
Input:
2
abab
baba
abab
baac
Output:
Case 1: 4
Case 2: 3
Explanation
Case 1, d can be 1.
Case 2, d can be 2.
My approach :
Generate all substrings of S and T in the from S[0...i], T[0...i] and concatenate S[0...i] with itself and check if T is a substring of S[0...i]+S[0...i]. if it a substring then maximum P = i
bool isCyclic( string s, string t ){
string str = s;
str.append(s);
if( str.find(t) != string::npos )
return true;
return false;
}
int main(){
string s, t;
int t1,l, o=1;
scanf("%d", &t1);
while( t1-- ){
cin>>s>>t;
l = min( s.length(), t.length());
int i, maxP = 0;
for( i=1; i<=l; i++ ){
if( isCyclic(s.substr(0,i), t.substr(0,i)) ){
maxP = i;
}
}
printf("Case %d: %d\n", o++, maxP);
}
return 0;
}
I knew that this not the most optimized approach for this problem since i got Time Limit Exceeded.I came to know that prefix function can be used to get an O(n) algorithm. I dont know about prefix function.Could someone explain the O(n) approach ?
Contest link http://www.codechef.com/ACMKGP14/problems/ACM14KP3

Modified longest common substring

Given two strings what is an efficient algorithm to find the number and length of longest common sub-strings with the sub-strings being called common if :
1) they have at-least x% characters same and at same position.
2) the start and end indexes of the sub-strings being same.
Ex :
String 1 -> abedefkhj
String 2 -> kbfdfjhlo
suppose the x% being asked is 40,then, ans is,
5 1
where 5 is the longest length and 1 is the number of sub-strings in each string satisfying the given property. Sub-String is "abede" in string 1 and "kbfdf" in string 2.
You can use smth like Levenshtein distance without deleting and inserting.
Build the table, where every element [i, j] is error for substring from position [i] to position [j].
foo(string a, string b, int x):
len = min(a.length, b.length)
error[0][0] = 0 if a[0] == b[0] else 1;
for (end: [1 -> len-1]):
for (start: [end -> 0]):
if a[end] == b[end]:
error[start][end] = error[start][end - 1]
else:
error[start][end] = error[start][end - 1] + 1
best_len = 0;
best_pos = 0;
for (i: [0 -> len-1]):
for (j: [i -> 0]):
len = i - j + 1
error_percent = 100 * error[i][j] / len
if (error_percent <= x and len > best_len):
best_len = len
best_pos = j
return (best_len, best_pos)

How do you sort and efficiently find elements in a cell array (of strings) in Octave?

Is there built-in functionality for this?
GNU Octave search a cell array of strings in linear time O(n):
(The 15 year old code in this answer was tested and correct on GNU Octave 3.8.2, 5.2.0 and 7.1.0)
The other answer has cellidx which was depreciated by octave, it still runs but they say to use ismember instead, like this:
%linear time string index search.
a = ["hello"; "unsorted"; "world"; "moobar"]
b = cellstr(a)
%b =
%{
% [1,1] = hello
% [2,1] = unsorted
% [3,1] = world
% [4,1] = moobar
%}
find(ismember(b, 'world')) %returns 3
ismember finds 'world' in index slot 3. This is a expensive linear time O(n) operation because it has to iterate through all elements whether or not it is found.
To achieve a logarathmic time O(log n) solution, then your list needs to come pre-sorted and then you can use binary search:
If your cell array is already sorted, you can do O(log-n) worst case:
function i = binsearch(array, val, low, high)
%binary search algorithm for numerics, Usage:
%myarray = [ 30, 40, 50.15 ]; %already sorted list
%binsearch(myarray, 30, 1, 3) %item 30 is in slot 1
if ( high < low )
i = 0;
else
mid = floor((low + high) / 2);
if ( array(mid) > val )
i = binsearch(array, val, low, mid-1);
elseif ( array(mid) < val )
i = binsearch(array, val, mid+1, high);
else
i = mid;
endif
endif
endfunction
function i = binsearch_str(array, val, low, high)
% binary search for strings, usage:
%myarray2 = [ "abc"; "def"; "ghi"]; #already sorted list
%binsearch_str(myarray2, "abc", 1, 3) #item abc is in slot 1
if ( high < low )
i = 0;
else
mid = floor((low + high) / 2);
if ( mystrcmp(array(mid, [1:end]), val) == 1 )
i = binsearch(array, val, low, mid-1);
elseif ( mystrcmp(array(mid, [1:end]), val) == -1 )
i = binsearch_str(array, val, mid+1, high);
else
i = mid;
endif
endif
endfunction
function ret = mystrcmp(a, b)
%this function is just an octave string compare, its behavior follows the
%strcmp(str1,str2)'s in C and java.lang.String.compareTo(...)'s in Java,
%that is:
% -returns 1 if string a > b
% -returns 0 if string a == b
% -return -1 if string a < b
% The gt() operator does not support cell array. If the single word
% is passed as an one-element cell array, converts it to a string.
a_as_string = a;
if iscellstr( a )
a_as_string = a{1}; %a was passed as a single-element cell array.
endif
% The gt() operator does not support cell array. If the single word
% is passed as an one-element cell array, converts it to a string.
b_as_string = b;
if iscellstr( b )
b_as_string = b{1}; %b was passed as a single-element cell array.
endif
% Space-pad the shortest word so as they can be used with gt() and lt() operators.
if length(a_as_string) > length( b_as_string )
b_as_string( length( b_as_string ) + 1 : length( a_as_string ) ) = " ";
elseif length(a_as_string) < length( b_as_string )
a_as_string( length( a_as_string ) + 1 : length( b_as_string ) ) = " ";
endif
letters_gt = gt(a_as_string, b_as_string); %list of boolean a > b
letters_lt = lt(a_as_string, b_as_string); %list of boolean a < b
ret = 0;
%octave makes us roll our own string compare because
%strings are arrays of numerics
len = length(letters_gt);
for i = 1:len
if letters_gt(i) > letters_lt(i)
ret = 1;
return
elseif letters_gt(i) < letters_lt(i)
ret = -1;
return
endif
end;
endfunction
%Assuming that myarray is already sorted, (it must be for binary
%search to finish in logarithmic time `O(log-n))` worst case, then do
myarray = [ 30, 40, 50.15 ]; %already sorted list
binsearch(myarray, 30, 1, 3) %item 30 is in slot 1
binsearch(myarray, 40, 1, 3) %item 40 is in slot 2
binsearch(myarray, 50, 1, 3) %50 does not exist so return 0
binsearch(myarray, 50.15, 1, 3) %50.15 is in slot 3
%same but for strings:
myarray2 = [ "abc"; "def"; "ghi"]; %already sorted list
binsearch_str(myarray2, "abc", 1, 3) %item abc is in slot 1
binsearch_str(myarray2, "def", 1, 3) %item def is in slot 2
binsearch_str(myarray2, "zzz", 1, 3) %zzz does not exist so return 0
binsearch_str(myarray2, "ghi", 1, 3) %item ghi is in slot 3
To sort your array if it isn't already:
Complexity of sorting depends on the kind of data you have and whatever sorting algorithm GNU octave language writers selected, it's somewhere between O(n*log(n)) and O(n*n).
myarray = [ 9, 40, -3, 3.14, 20 ]; %not sorted list
myarray = sort(myarray)
myarray2 = [ "the"; "cat"; "sat"; "on"; "the"; "mat"]; %not sorted list
myarray2 = sortrows(myarray2)
Code buffs to make this backward compatible with GNU Octave 3. 5. and 7. goes to #Paulo Carvalho in the other answer here.
Yes check this: http://www.obihiro.ac.jp/~suzukim/masuda/octave/html3/octave_36.html#SEC75
a = ["hello"; "world"];
c = cellstr (a)
⇒ c =
{
[1,1] = hello
[2,1] = world
}
>>> cellidx(c, 'hello')
ans = 1
>>> cellidx(c, 'world')
ans = 2
The cellidx solution does not meet the OP's efficiency requirement, and is deprecated (as noted by help cellidx).
Håvard Geithus in a comment suggested using the lookup() function on a sorted cell array of strings, which is significantly more efficient than cellidx. It's still a binary search though, whereas most modern languages (and even many 20 year old ones) give us easy access to associative arrays, which would be a much better approach.
While Octave doesn't obviously have associated arrays, that's effectively what the interpreter is using for ocatve's variables, including structs, so you can make us of that, as described here:
http://math-blog.com/2011/05/09/associative-arrays-and-cellular-automata-in-octave/
Built-in Function: struct ("field", value, "field", value,...)
Built-in Function: isstruct (expr)
Built-in Function: rmfield (s, f)
Function File: [k1,..., v1] = setfield (s, k1, v1,...)
Function File: [t, p] = orderfields (s1, s2)
Built-in Function: fieldnames (struct)
Built-in Function: isfield (expr, name)
Function File: [v1,...] = getfield (s, key,...)
Function File: substruct (type, subs,...)
Converting Matlab to Octave is there a containers.Map equivalent? suggests using javaObject("java.util.Hashtable"). That would come with some setup overhead, but would be a performance win if you're using it a lot. It may even be viable to link in some library written in C or C++? Do think about whether this is a maintainable option though.
Caveat: I'm relatively new to Octave, and writing this up as I research it myself (which is how I wound up here). I haven't yet run tests on the efficiency of these techniques, and while I've got a fair knowledge of the underlying algorithms, I may be making unreasonable assumptions about what's actually efficient in Octave.
This is a version of mystrcmp() that works in Octave of recent version (7.1.0):
function ret = mystrcmp(a, b)
%this function is just an octave string compare, its behavior follows the
%strcmp(str1,str2)'s in C and java.lang.String.compareTo(...)'s in Java,
%that is:
% -returns 1 if string a > b
% -returns 0 if string a == b
% -return -1 if string a < b
% The gt() operator does not support cell array. If the single word
% is passed as an one-element cell array, converts it to a string.
a_as_string = a;
if iscellstr( a )
a_as_string = a{1}; %a was passed as a single-element cell array.
endif
% The gt() operator does not support cell array. If the single word
% is passed as an one-element cell array, converts it to a string.
b_as_string = b;
if iscellstr( b )
b_as_string = b{1}; %b was passed as a single-element cell array.
endif
% Space-pad the shortest word so as they can be used with gt() and lt() operators.
if length(a_as_string) > length( b_as_string )
b_as_string( length( b_as_string ) + 1 : length( a_as_string ) ) = " ";
elseif length(a_as_string) < length( b_as_string )
a_as_string( length( a_as_string ) + 1 : length( b_as_string ) ) = " ";
endif
letters_gt = gt(a_as_string, b_as_string); %list of boolean a > b
letters_lt = lt(a_as_string, b_as_string); %list of boolean a < b
ret = 0;
%octave makes us roll our own string compare because
%strings are arrays of numerics
len = length(letters_gt);
for i = 1:len
if letters_gt(i) > letters_lt(i)
ret = 1;
return
elseif letters_gt(i) < letters_lt(i)
ret = -1;
return
endif
end;
endfunction

Resources