Go's LeftStr, RightStr, SubStr - string

I believe there are no LeftStr(str,n) (take at most n first characters), RightStr(str,n) (take at most n last characters) and SubStr(str,pos,n) (take first n characters after pos) function in Go, so I tried to make one
// take at most n first characters
func Left(str string, num int) string {
if num <= 0 {
return ``
}
if num > len(str) {
num = len(str)
}
return str[:num]
}
// take at most last n characters
func Right(str string, num int) string {
if num <= 0 {
return ``
}
max := len(str)
if num > max {
num = max
}
num = max - num
return str[num:]
}
But I believe those functions will give incorrect output when the string contains unicode characters. What's the fastest solution for those function, is using for range loop is the only way?

As mentioned in already in comments,
combining characters, modifying runes, and other multi-rune
"characters"
can cause difficulties.
Anyone interested in Unicode handling in Go should probably read the Go Blog articles
"Strings, bytes, runes and characters in Go"
and "Text normalization in Go".
In particular, the later talks about the golang.org/x/text/unicode/norm package which can help in handling some of this.
You can consider several levels increasingly of more accurate (or increasingly more Unicode aware) spiting the first (or last) "n characters" from a string.
Just use n bytes.
This may split in the middle of a rune but is O(1), is very simple, and in many cases you know the input consists of only single byte runes.
E.g. str[:n].
Split after n runes.
This may split in the middle of a character. This can be done easily, but at the expense of copying and converting with just string([]rune(str)[:n]).
You can avoid the conversion and copying by using the unicode/utf8 package's DecodeRuneInString (and DecodeLastRuneInString) functions to get the length of each of the first n runes in turn and then return str[:sum] (O(n), no allocation).
Split after the n'th "boundary".
One way to do this is to use
norm.NFC.FirstBoundaryInString(str) repeatedly
or norm.Iter to find the byte position to split at and then return str[:pos].
Consider the displayed string "cafés" which could be represented in Go code as: "cafés", "caf\u00E9s", or "caf\xc3\xa9s" which all result in the identical six bytes. Alternative it could represented as "cafe\u0301s" or "cafe\xcc\x81s" which both result in the identical seven bytes.
The first "method" above may split those into "caf\xc3"+"\xa9s" and cafe\xcc"+"\x81s".
The second may split them into "caf\u00E9"+"s" ("café"+"s") and "cafe"+"\u0301s" ("cafe"+"́s").
The third should split them into "caf\u00E9"+"s" and "cafe\u0301"+"s" (both shown as "café"+"s").

Related

Ocaml String.sub: How to get all characters until end of string

So I'm using the String.sub function and I'm wondering if there's any way to get all the characters in a string until the end of the string. Since String.sub takes in a string, int (the index of where to start getting chars), and then a number of how many chars, I'm not sure what the easiest way of doing all chars since we want a possibly positive infinite amount
For String.sub you just have to subtract from the length of the string. There's no simpler way, i.e., there's no value for length that has a special meaning. A value larger than the remainder of the string is an error in OCaml (which tends to be strict when checking parameters).
Assume i is >= 0 and < length of the string:
String.sub s i (String.length s - i)
You can use Str.last_chars, but you still need to know how many characters you want. I.e., you still have to subtract from the length of the string.
Str.last_chars s (String.length s - i)

Easy way to get a sub-string/sub-slice of up to N characters/elements in Go

In Python I can slice a string to get a sub-string of up to N characters and if the string is too short it will simply return the rest of the string, e.g.
"mystring"[:100] # Returns "mystring"
What's the easiest way to do the same in Go? Trying the same thing panics:
"mystring"[:100] // panic: runtime error: slice bounds out of range
Of course, I can write it all manually:
func Substring(s string, startIndex int, count int) string {
maxCount := len(s) - startIndex
if count > maxCount {
count = maxCount
}
return s[startIndex:count]
}
fmt.Println(Substring("mystring", 0, n))
But that's rather a lot of work for something so simple and (I would have thought) common. What's more, I don't know how to generalise this function to slices of other types, since Go doesn't support generics. I'm hoping there is a better way. Even Math.Min() doesn't easily work here, because it expects and returns float64.
Note that while a function remains the recommended solution (even if it has to be implemented for slices with different type), it wouldn't work well with string.
fmt.Println(Substring("世界mystring", 0, 5)) would actually print 世�� instead of 世界mys.
See "Code points, characters, and runes": a character may be represented by a number of different sequences of code points, and therefore different sequences of UTF-8 bytes.
And in Go, a "code point" is a rune (as seen here).
Using rune would be more robust (again, in case of strings)
func SubstringRunes(s string, startIndex int, count int) string {
runes := []rune(s)
length := len(runes)
maxCount := length - startIndex
if count > maxCount {
count = maxCount
}
return string(runes[startIndex:count])
}
See it in action in this playground.

find number of repeating substrings in a string

I am looking for an algorithm that will find the number of repeating substrings in a single string.
For this, I was looking for some dynamic programming algorithms but didn't find any that would help me. I just want some tutorial on how to do this.
Let's say I have a string ABCDABCDABCD. The expected output for this would be 3, because there is ABCD 3 times.
For input AAAA, output would be 4, since A is repeated 4 times.
For input ASDF, output would be 1, since every individual character is repeated 1 time only.
I hope that someone can point me in the right direction. Thank you.
I am taking the following assumptions:
The repeating substrings must be consecutive. That is, in case of ABCDABC, ABC would not count as a repeating substring, but it would in case of ABCABC.
The repeating substrings must be non-overalpping. That is, in case of ABCABC, ABC would not count as a repeating substring.
In case of multiple possible answers, we want the one with the maximum value. That is, in the case of AAAA, the answer should be 4 (a is the substring) rather than 2 (aa is the substring).
Under these assumptions, the algorithm is as follows:
Let the input string be denoted as inputString.
Calculate the KMP failure function array for the input string. Let this array be denoted as failure[]. This operation if of linear time complexity with respect to the length of the string. So, by definition, failure[i] denotes the length of the longest proper-prefix of the substring inputString[0....i] that is also a proper-suffix of the same substring.
Let len = inputString.length - failure.lastIndexValue. At this point, we know that if there is any repeating string at all, then it has to be of this length len. But we'll need to check for that; First, just check if len perfectly divides inputString.length (that is, inputString.length % len == 0). If yes, then check if every consecutive (non-overlapping) substring of len characters is the same or not; this operation is again of linear time complexity with respect to the length of the input string.
If it turns out that every consecutive non-overlapping substring is the same, then the answer would be = inputString.length/ len. Otherwise, the answer is simply inputString.length, as there is no such repeating substring present.
The overall time complexity would be O(n), where n is the number of characters in the input string.
A sample code for calculating the KMP failure array is given here.
For example,
Let the input string be abcaabcaabca.
Its KMP failure array would be - [0, 0, 0, 1, 1, 2, 3, 4, 5, 6, 7, 8].
So, our len = (12 - 8) = 4.
And every consecutive non-overlapping substring of length 4 is the same (abca).
Therefore the answer is 12/4 = 3. That is, abca is repeated 3 times repeatedly.
The solution for this with C# is:
class Program
{
public static string CountOfRepeatedSubstring(string str)
{
if (str.Length < 2)
{
return "-1";
}
StringBuilder substr = new StringBuilder();
// Length of the substring cannot be greater than half of the actual string
for (int i = 0; i < str.Length / 2; i++)
{
// We will iterate through half of the actual string and
// create a new string by appending the current character to the previous character
substr.Append(str[i]);
String clearedOfNewSubstrings = str.Replace(substr.ToString(), "");
// We will remove the newly created substring from the actual string and
// check if the length of the actual string, cleared of the newly created substring, is 0.
// If 0 it tells us that it is only made of its substring
if (clearedOfNewSubstrings.Length == 0)
{
// Next we will return the count of the newly created substring in the actual string.
var countOccurences = Regex.Matches(str, substr.ToString()).Count;
return countOccurences.ToString();
}
}
return "-1";
}
static void Main(string[] args)
{
// Input: {"abcdaabcdaabcda"}
// Output: 3
// Input: { "abcdaabcdaabcda" }
// Output: -1
// Input: {"barrybarrybarry"}
// Output: 3
var s = "asdf"; // Output will be -1
Console.WriteLine(CountOfRepeatedSubstring(s));
}
}
How do you want to specify the "repeating string"? Is it simply the first group of characters up until either a) the first character is found again, b) the pattern begins to repeat, or c) some other criteria?
So, if your string is "ABBAABBA", is that a 2 because "ABBA" repeats twice or is it 1 because you have "ABB" followed by "AAB"? What about "ABCDABCE" -- does "ABC" count (despite the "D" in between repetitions?) In "ABCDABCABCDABC", is the repeating string "ABCD" (1) or "ABCDABC" (2)?
What about "AAABBAAABB" -- is that 3 ("AAA") or 2 ("AAABB")?
If the end of the repeating string is another instance of the first letter, it's pretty simple:
Work your way through the string character by character, putting each character into another variable as you go, until the next character matches the first one. Then, given the length of the substring in your second variable, check the next bit of your string to see if it matches. Continue until it doesn't match or you hit the end of the string.
If you just want to find any length pattern that repeats regardless of whether the first character is repeated within the pattern, it gets more complicated (but, fortunately, it's the sort of thing computers are good at).
You'll need to go character by character building a pattern in another variable as above, but you'll also have to watch for the first character to reappear and start building a second substring as you go, to see if it matches the first. This should probably go in an array as you might encounter a third (or more) instance of the first character which would trigger the need to track yet another possible match.
It's not difficult but there is a lot to keep track of and it's a rather annoying problem. Is there a particular reason you're doing this?

Unique Substrings in wrap around strings

I have been given an infinite wrap around of the string str="abcdefghijklmnopqrstuvwxyz" so it looks like
"..zabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcd...." and another string p.
I need to find out how many unique non-empty substrings of p are present in the infinite wraparound string str?
For example: "zab"
There are 6 substrings "z", "a", "b", "za", "ab", "zab" of string "zab" in str.
I tried finding all suffixes of p in a particular concatenation of the string str say for example: "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
and as soon as i get a suffix which is a part of the above i add all its substrings to my result, as:
for (int i=0;i<length;i++) {
String suffix = p.substring(i,length);
if(isPresent(suffix)) {
sum += (suffix.length()*(suffix.length()+1))/2;
break;
} else {
sum++;
}
}
And my isPresent function is:
private boolean isPresent(String s) {
if(s.length()==1) {
return true;
}
String main = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcde
fghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz";
return main.contains(s);
}
If the length of p is greater than my assumed concatenated string assumed in isPresent function, my algorithm fails!!
So how should i find the substrings irrespective of the the wrap around string str? Is there a better approach for this problem?
Some ideas/suggestions (not a full algo)
you don't need to consider an infinite repetition of the wrap around string but only len(p)/len(repeating-fragment) + 1 (integral division) repetitions. Let's denote this string with S **
if a substring sp of p is a substring of S, than any substrings of sp will be substrings of S
So the problem seems to reduce to:
find sp (substring of both p and S) with the maximal length. This is called longest common substring and admits a dynamic programming solution with the complexity of O(n*m) (lengths of the two strings). The cited has a pseudo-code algo.
repeat the above recursively with the 'remnants' of p after eliminating the longest common substring.
Now, you have a sequence of "longest common substrings". How many do you need to retain? I feel that the "longest common substring" may be used to trim down the need of brute-forcing every substring of any and all the above, but I'd need more time than I have available now.
I hope the sketch above helps.
** I might be wrong on the number of repetitions which need to be considered. If I am, then in any case there will be a maximal number of repetitions to be considered and there will be an S of minimal length that is sufficient for the purpose.

How do I find the largest sequence in a string that is repeated at least once?

Trying to solve the following problem:
Given a string of arbitrary length, find the longest substring that occurs more than one time within the string, with no overlaps.
For example, if the input string was ABCABCAB, the correct output would be ABC. You couldn't say ABCAB, because that only occurs twice where the two substrings overlap, which is not allowed.
Is there any way to solve this reasonably quickly for strings containing a few thousand characters?
(And before anyone asks, this is not homework. I'm looking at ways to optimize the rendering of Lindenmayer fractals, because they tend to take excessive amounts of time to draw at high iteration levels with a naive turtle graphics system.)
Here's an example for a string of length 11, which you can generalize
Set chunk length to floor(11/2) = 5
Scan the string in chunks of 5 characters left to looking for repeats. There will be 3 comparisons
Left Right
Offset Offset
0 5
0 6
1 5
If you found a duplicate you're done. Otherwise reduce the chunk length to 4 and repeat until chunk length goes to zero.
Here's some (obviously untested) pseudocode:
String s
int len = floor(s.length/2)
for int i=len; i>0; i--
for j=0; j<=len-(2*i); j++
for k=j+i; k<=len-i; k++
if s.substr(j,j+i) == s.substr(k,k+i)
return s.substr(j,j+i)
return null
There may be an off-by-one error in there, but the approach should be sound (and minimal).
it looks like a suffix tree problem. Create the suffix tree, then find the biggest compressed branch with more than one child (occurs more than once in the original string). The number of letters in that compressed branch should be the size of the biggest subsequence.
i found something similar here: http://www.coderanch.com/t/370396/java/java/Algorithm-wanted-longest-repeating-substring
Looks like it can be done in O(n).
First we need to define the start symbol of our substring and define the length. Iterate all possible start positions then figure out the length doing binary search for the length (if you can find substr with lenght a, you may find with the longer length, function looks monotonous so bin search should be fine). Then find equal substring is N, using KMP or Rabin-Karp any linear algo is fine. Total N*N*log(N). Is that too much complexity?
The code is something like:
for(int i=0;i<input.length();++i)
{
int l = i;
int r = input.length();
while(l <= r)
{
int middle = l + ((r - l) >> 1);
Check if string [i;middle] can be found in initial string. Should be done in O(n); You need to check parts of initial string [0,i-1], [middle+1;length()-1];
if (found)
l = middle + 1;
else
r = middle - 1;
}
}
Make sense?
This type of analysis is often done in genome sequences. have a look at this paper. it has an efficient implemention (c++) for solving repeats: http://www.complex-systems.com/pdf/17-4-4.pdf
might be what you are looking for

Resources