Counting cyclic shifts of a string - string

I need to write a function that will return the number of possible different cyclic shifts of an input string.
Could you please give me some tips on where should I start in order to create an efficient (in terms of time complexity) algorithm? Should I begin with "preprocessing" the string and creating some data structure to help count the shifts later?

Simply store the string consecutively twice somewhere e.g.:
"this_is_my_longish_stringthis_is_my_longish_string"
^ ^
| |
|<-string length apart->|
and then just move two index's from beginning to end of the first string "string length apart" along that double string, returning the interim string each time. Alternatively, you can work on your homework problems yourself.

Related

Is there anything else used instead of slicing the String?

This is one of the practice problems from Problem solving section of Hackerrank. The problem statement says
Steve has a string of lowercase characters in range ascii[ā€˜aā€™..ā€™zā€™]. He wants to reduce the string to its shortest length by doing a series of operations. In each operation he selects a pair of adjacent lowercase letters that match, and he deletes them.
For example : 'aaabbccc' -> 'ac' , 'abba' -> ''
I have tried solving this using slicing of strings but this gives me timeout runtime error on larger strings. Is there anything else to be used?
My code:
s = list(input())
i=1
while i<len(s):
if s[i]==s[i-1]:
s = s[:i-1]+s[i+1:]
i = i-2
i+=1
if len(s)==0:
print("Empty String")
else:
print(''.join(s))
This gives me terminated due to timeout message.
Thanks for your time :)
Interning each new immutable string can be expensive,
as it has O(N) linear cost with the length of the string.
Consider processing "aa" * int(1e6).
You will write on the order of 1e12 characters to memory
by the time you're finished.
Take a moment (well, take linear time) to
copy each character over to a mutable list element:
[c for c in giant_string]
Then you can perform dup processing by writing a tombstone
of "" to each character you wish to delete,
using just constant time.
Finally, in linear time you can scan through the survivors using "".join( ... )
One other possible solution is to use regex. The pattern ([a-z])\1 matches a duplicate lowercase letter. The implementation would involve something like this:
import re
pattern = re.compile(r'([a-z])\1')
while pattern.search(s): # While match is found
s = pattern.sub('', s) # Remove all matches from "s"
I'm not an expert at efficiency, but this seems to write fewer strings to memory than your solution. For the case of "aa" * int(1e6) that J_H mentioned, it will only write one, thanks to pattern.sub replacing all occurances at once.

Find the minimal lexographical string formed by merging two strings

Suppose we are given two strings s1 and s2(both lowercase). We have two find the minimal lexographic string that can be formed by merging two strings.
At the beginning , it looks prettty simple as merge of the mergesort algorithm. But let us see what can go wrong.
s1: zyy
s2: zy
Now if we perform merge on these two we must decide which z to pick as they are equal, clearly if we pick z of s2 first then the string formed will be:
zyzyy
If we pick z of s1 first, the string formed will be:
zyyzy which is correct.
As we can see the merge of mergesort can lead to wrong answer.
Here's another example:
s1:zyy
s2:zyb
Now the correct answer will be zybzyy which will be got only if pick z of s2 first.
There are plenty of other cases in which the simple merge will fail. My question is Is there any standard algorithm out there used to perform merge for such output.
You could use dynamic programming. In f[x][y] store the minimal lexicographical string such that you've taken x charecters from the first string s1 and y characters from the second s2. You can calculate f in bottom-top manner using the update:
f[x][y] = min(f[x-1][y] + s1[x], f[x][y-1] + s2[y]) \\ the '+' here represents
\\ the concatenation of a
\\ string and a character
You start with f[0][0] = "" (empty string).
For efficiency you can store the strings in f as references. That is, you can store in f the objects
class StringRef {
StringRef prev;
char c;
}
To extract what string you have at certain f[x][y] you just follow the references. To udapate you point back to either f[x-1][y] or f[x][y-1] depending on what your update step says.
It seems that the solution can be almost the same as you described (the "mergesort"-like approach), except that with special handling of equality. So long as the first characters of both strings are equal, you look ahead at the second character, 3rd, etc. If the end is reached for some string, consider the first character of the other string as the next character in the string for which the end is reached, etc. for the 2nd character, etc. If the ends for both strings are reached, then it doesn't matter from which string to take the first character. Note that this algorithm is O(N) because after a look-ahead on equal prefixes you know the whole look-ahead sequence (i.e. string prefix) to include, not just one first character.
EDIT: you look ahead so long as the current i-th characters from both strings are equal and alphabetically not larger than the first character in the current prefix.

String matching without using builtin functions

I want to search for a query (a string) in a subject (another string).
The query may appear in whole or in parts, but will not be rearranged. For instance, if the query is 'da', and the subject is 'dura', it is still a match.
I am not allowed to use string functions like strfind or find.
The constraints make this actually quite straightforward with a single loop. Imagine you have two indices initially pointing at the first character of both strings, now compare them - if they don't match, increment the subject index and try again. If they do, increment both. If you've reached the end of the query at that point, you've found it. The actual implementation should be simple enough, and I don't want to do all the work for you ;)
If this is homework, I suggest you look at the explanation which precedes the code and then try for yourself, before looking at the actual code.
The code below looks for all occurrences of chars of the query string within the subject string (variables m; and related ii, jj). It then tests all possible orders of those occurrences (variable test). An order is "acceptable" if it contains all desired chars (cond1) in increasing positions (cond2). The result (variable result) is affirmative if there is at least one acceptable order.
subject = 'this is a test string';
query = 'ten';
m = bsxfun(#eq, subject.', query);
%'// m: test if each char of query equals each char of subject
[ii jj] = find(m);
jj = jj.'; %'// ii: which char of query is found within subject...
ii = ii.'; %'// jj: ... and at which position
test = nchoosek(1:numel(jj),numel(query)).'; %'// test all possible orders
cond1 = all(jj(test) == repmat((1:numel(query)).',1,size(test,2)));
%'// cond1: for each order, are all chars of query found in subject?
cond2 = all(diff(ii(test))>0);
%// cond2: for each order, are the found chars in increasing positions?
result = any(cond1 & cond2); %// final result: 1 or 0
The code could be improved by using a better approach as regards to test, i.e. not testing all possible orders given by nchoosek.
Matlab allows you to view the source of built-in functions, so you could always try reading the code to see how the Matlab developers did it (although it will probably be very complex). (thanks Luis for the correction)
Finding a string in another string is a basic computer science problem. You can read up on it in any number of resources, such as Wikipedia.
Your requirement of non-rearranging partial matches recalls the bioinformatics problem of mapping splice variants to a genomic sequence.
You may solve your problem by using a sequence alignment algorithm such as Smith-Waterman, modified to work with all English characters and not just DNA bases.
Is this question actually from bioinformatics? If so, you should tag it as such.

Ada Return Concatenated String of Strings

I am trying to finish up a homework assignment, and am down to the last part. First, I'll show you the type that I am dealing with:
TYPE Book_Collection IS
RECORD
Books : Book_Collection_Array;
Max_Size : Integer;
Size : Integer;
END RECORD;
TYPE Book_Type IS
RECORD
Title,
Author,
Publisher : Title_Str;
Year : Year_Type;
Edition : Natural;
Isbn : Isbn_Type;
Price : Dollars;
Stock : Natural;
Format : Format_Type;
END RECORD;
Book_Collection_Array is an array of book_type. These are private types, so the array is bounded (1..200).
There is a function called ToString in a separate package that was provided to us, that takes a book_type as input, and returns a string of all the elements of book_type. What I need to create is a function that takes book_collection is a parameter, and returns a string concatenating all of the strings that are returned by the ToString function that was provided, for the book_types that exist in that book_collection. I have made multiple attempts, but am constantly getting range check failures. Can anyone point me in the right direction?
*Edit:
Thank you to both of you for your help. I went the route of using an unbounded string, and appending each string to it, then declaring an output string and setting it as a constant string equal to the the To_String of the unbounded_string.*
I'll give you a hint.
Ada strings ideally aren't treated or handled much like C or Java strings at all. C strings count on a trailing nul (0) character to designate the end of data in a buffer. Java strings keep track of their own length, and will dynamically reallocate themselves to keep to the proper length if need be. So typical string-handling idioms in those languages think nothing of progressively modifying a string variable.
Ada strings instead are expected to be perfectly-sized when created. Most routines will assume that every element in a string array contains valid character data, and any destination string you assign data into will be perfectly sized to hold it. If that isn't the case, usually an exception is raised (and most likely your program crashes).
There are several ways to deal with this when you are building a string. One way is to create a really big string object as a buffer, and keep a separate length variable to tell your code how much data is really in there at all times. Then when you call Ada string routines you can feed them just the slice of data from the string that is valid. eg: Put_line (My_New_String(1..My_String_Length));
A better way is to just deal with perfectly-sized constant strings. For example, if you want to tack String1 and String2 together, the safe Ada way to do this is:
My_New_String : constant String := String1 & String2;
Then if you later want a string that is this string with String3 tacked on:
My_New_New_String : constant String := My_New_String & String3;
For more info on this, I suggest you look at some of the links over on the right side of this browser window under the heading "Related". I see a lot of good stuff in there.

How to find all cyclic shifted strings in a given input?

This is a coding exercise. Suppose I have to decide if one string is created by a cyclic shift of another. For example: cab is a cyclic shift of abc but cba is not.
Given two strings s1 and s2 we can do that as follows:
if (s1.length != s2.length)
return false
for(int i = 0; i < s1.length(); i++)
if ((s1.substring(i) + s1.substring(0, i)).equals(s2))
return true
return false
Now what if I have an array of strings and want to find all strings that are cyclic shift of one another? For example: ["abc", "xyz", "yzx", "cab", "xxx"] -> ["abc", "cab"], ["xyz", "yzx"], ["xxx"]
It looks like I have to check all pairs of the strings. Is there a "better" (more efficient) way to do that?
As a start, you can know if a string s1 is a rotation of a string s2 with a single call to contains(), like this:
public boolean isRotation(String s1, String s2){
String s2twice = s2+s2;
return s2twice.contains(s1);
}
Namely, if s1 is "rotation" and s2 is "otationr", the concat gives you "otationrotationr", which contains s1 indeed.
Now, even if we assume this is linear, or close to it (which is not impossible using Rabin-Karp, for instance), you are still left with O(n^2) pair comparisons, which may be too much.
What you could do is build an hashtable where the sorted word is the key, and the posting list contains all the words from your list that, if sorted, give the key (ie. key("bca") and key("cab") both should return "abc"):
private Map<String, List<String>> index;
/* ... */
public void buildIndex(String[] words){
for(String word : words){
String sortedWord = sortWord(word);
if(!index.containsKey(sortedWord)){
index.put(sortedWord, new ArrayList<String>());
}
index.get(sortedWord).add(word);
}
}
CAVEAT: The hashtable will contain, for each key, all the words that have exactly the same letters occurring the same amount of times (not just the rotations, ie. "abba" and "baba" will have the same key but isRotation("abba", "baba") will return false).
But once you have built this index, you can significantly reduce the number of pairs you need to consider: if you want all the rotations for "bca" you just need to sort("bca"), look it up in the hashtable, and check (using the isRotation method above, if you want) if the words in the posting list are the result of a rotation or not.
If strings are short compared to the number of strings in the list, you can do significantly better by rotating all strings to some normal form (lexicographic minimum, for example). Then sort lexicographically and find runs of the same string. That's O(n log n), I think... neglecting string lengths. Something to try, maybe.
Concerning the way to find the pairs in the table, there could be many better way, but what I came up as a first thought is to sort the table and apply the check per adjacent pair.
This is much better and simpler that checking every string with every other string in the table
Consider building an automaton for each string against which you wish to test.
Each automaton should have one entry point for each possible character in the string, and transitions for each character, plus an extra transition from the end to the start.
You could improve performance even further if you amalgated the automata.
I think a combination of the answers by Patrick87 and savinos would make a fair amount of sense. Specifically, in a Java-esque pseudo-code:
List<String> inputs = ["abc", "xyz", "yzx", "cab", "xxx"];
Map<String,List<String>> uniques = new Map<String,List<String>>();
for(String value : inputs) {
String normalized = normalize(value);
if(!uniques.contains(normalized)) {
unqiues.put(normalized, new List<String>());
}
uniques.get(normalized).add(value);
}
// you now have a Map of normalized strings to every string in the input
// that is "equal to" that normalized version
Normalizing the string, as stated by Patrick87 might be best done by picking the rotation of the string that results in the lowest lexographic ordering.
It's worth noting, however, that the "best" algorithm probably relies heavily on the inputs... the number of strings, the length of those string, how many duplicates there are, etc.
You can rotate all the strings to a normalized form using Booth's algorithm (https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation) in O(s) time, where s is the length of the string.
You can then use the normalized form as a key in a HashMap (where the value is the set of rotations seen in the input). You can populate this HashMap in a single pass over the data. i.e., for each string
calculate the normalized form
check if the HashMap contains the normalized form as a key - if not insert the empty Set at this key
add the string to the Set in the HashMap
You then just need to output the values of the HashMap. This makes the total runtime of the algorithm O(n * s) - where n is the number of words and s is the average word length. The total space usage is also O(n * s).

Resources