Match first letter of a string in Tcl - string

I want to compare the first letter of a string with a known character. For example, I want to check if the string "example"'s first letter matches with "e" or not. I'm sure there must be a very simple way to do it, but I could not find it.

One way is to get the first character with string index:
if {[string index $yourstring 0] eq "e"} {
...

I think it's a good idea to collect the different methods in a single answer.
Assume
set mystring example
set mychar e
The goal is to test whether the first character in $mystring is equal to $mychar.
My suggestion was (slightly edited):
if {[string match $mychar* $mystring]} {
...
This invocation does a glob-style match, comparing $mystring to the character $mychar followed by a sequence of zero or more arbitrary characters. Due to shortcuts in the algorithm, the comparison stops after the first character and is quite efficient.
Donal Fellows:
if {[string index $mystring 0] eq $mychar} {
...
This invocation specifically compares a string consisting of the first character in $mystring with the string $mychar. It uses the efficient eq operator rather than the == operator, which is the only one available in older versions of Tcl.
Another way to construct a string consisting of the first character in $mystring is by invoking string range $mystring 0 0.
Mark Kadlec:
if {[string first $mychar $mystring] == 0 }
...
This invocation searches the string $mystring for the first occurrence of the character $mychar. If it finds any, it returns with the index where the character was found. This index number is then compared to 0. If they are equal the first character of $mystring was $mychar.
This solution is rather inefficient in the worst case, where $mystring is long and $mychar does not occur in it. The command will then examine the whole string even though only the first character is of interest.
One more string-based solution:
if {[string compare -length 1 $mychar $mystring] == 0} {
...
This invocation compares the first n characters of both strings (n being hardcoded to 1 here): if there is a difference the command will return -1 or 1 (depending on alphabetical order), and if they are equal 0 will be returned.
Another solution is to use a regular expression match:
if {[regexp -- ^$mychar.* $mystring]} {
...
This solution is similar to the string match solution above, but uses regular expression syntax rather than glob syntax. Don't forget the ^ anchor, otherwise the invocation will return true if $mychar occurs anywhere in $mystring.
Documentation: eq and ==, regexp, string

if { [string first e $yourString] == 0 }
...

set mychar "e"
if { [string first $mychar $myString] == 0}{
....

Related

What is the difference between string "match" and string "equal" in TCL

In TCL what is the difference between string "match" and string "equal".
They are almost same so I am not able to detect the difference between them.
string equal compares two strings character by character and returns 1 if they both contain the same characters (case sensitive: can be overridden).
string match compares a string against a glob-style pattern and returns 1 if the string matches the pattern.
In a degenerate case, a string match with only non-special characters in the pattern is equivalent to a string equal.
Documentation:
string
Syntax of Tcl string matching:
* matches a sequence of zero or more characters
? matches a single character
[chars] matches a single character in the set given by chars (^ does not negate; a range can be given as a-z)
\x matches the character x, even if that character is special (one of *?[]\)
already answered in
TCL string match vs regexps
Regexp are slower than base function. So you should avoid regex for equal check

find number of repeating substrings in a string

I am looking for an algorithm that will find the number of repeating substrings in a single string.
For this, I was looking for some dynamic programming algorithms but didn't find any that would help me. I just want some tutorial on how to do this.
Let's say I have a string ABCDABCDABCD. The expected output for this would be 3, because there is ABCD 3 times.
For input AAAA, output would be 4, since A is repeated 4 times.
For input ASDF, output would be 1, since every individual character is repeated 1 time only.
I hope that someone can point me in the right direction. Thank you.
I am taking the following assumptions:
The repeating substrings must be consecutive. That is, in case of ABCDABC, ABC would not count as a repeating substring, but it would in case of ABCABC.
The repeating substrings must be non-overalpping. That is, in case of ABCABC, ABC would not count as a repeating substring.
In case of multiple possible answers, we want the one with the maximum value. That is, in the case of AAAA, the answer should be 4 (a is the substring) rather than 2 (aa is the substring).
Under these assumptions, the algorithm is as follows:
Let the input string be denoted as inputString.
Calculate the KMP failure function array for the input string. Let this array be denoted as failure[]. This operation if of linear time complexity with respect to the length of the string. So, by definition, failure[i] denotes the length of the longest proper-prefix of the substring inputString[0....i] that is also a proper-suffix of the same substring.
Let len = inputString.length - failure.lastIndexValue. At this point, we know that if there is any repeating string at all, then it has to be of this length len. But we'll need to check for that; First, just check if len perfectly divides inputString.length (that is, inputString.length % len == 0). If yes, then check if every consecutive (non-overlapping) substring of len characters is the same or not; this operation is again of linear time complexity with respect to the length of the input string.
If it turns out that every consecutive non-overlapping substring is the same, then the answer would be = inputString.length/ len. Otherwise, the answer is simply inputString.length, as there is no such repeating substring present.
The overall time complexity would be O(n), where n is the number of characters in the input string.
A sample code for calculating the KMP failure array is given here.
For example,
Let the input string be abcaabcaabca.
Its KMP failure array would be - [0, 0, 0, 1, 1, 2, 3, 4, 5, 6, 7, 8].
So, our len = (12 - 8) = 4.
And every consecutive non-overlapping substring of length 4 is the same (abca).
Therefore the answer is 12/4 = 3. That is, abca is repeated 3 times repeatedly.
The solution for this with C# is:
class Program
{
public static string CountOfRepeatedSubstring(string str)
{
if (str.Length < 2)
{
return "-1";
}
StringBuilder substr = new StringBuilder();
// Length of the substring cannot be greater than half of the actual string
for (int i = 0; i < str.Length / 2; i++)
{
// We will iterate through half of the actual string and
// create a new string by appending the current character to the previous character
substr.Append(str[i]);
String clearedOfNewSubstrings = str.Replace(substr.ToString(), "");
// We will remove the newly created substring from the actual string and
// check if the length of the actual string, cleared of the newly created substring, is 0.
// If 0 it tells us that it is only made of its substring
if (clearedOfNewSubstrings.Length == 0)
{
// Next we will return the count of the newly created substring in the actual string.
var countOccurences = Regex.Matches(str, substr.ToString()).Count;
return countOccurences.ToString();
}
}
return "-1";
}
static void Main(string[] args)
{
// Input: {"abcdaabcdaabcda"}
// Output: 3
// Input: { "abcdaabcdaabcda" }
// Output: -1
// Input: {"barrybarrybarry"}
// Output: 3
var s = "asdf"; // Output will be -1
Console.WriteLine(CountOfRepeatedSubstring(s));
}
}
How do you want to specify the "repeating string"? Is it simply the first group of characters up until either a) the first character is found again, b) the pattern begins to repeat, or c) some other criteria?
So, if your string is "ABBAABBA", is that a 2 because "ABBA" repeats twice or is it 1 because you have "ABB" followed by "AAB"? What about "ABCDABCE" -- does "ABC" count (despite the "D" in between repetitions?) In "ABCDABCABCDABC", is the repeating string "ABCD" (1) or "ABCDABC" (2)?
What about "AAABBAAABB" -- is that 3 ("AAA") or 2 ("AAABB")?
If the end of the repeating string is another instance of the first letter, it's pretty simple:
Work your way through the string character by character, putting each character into another variable as you go, until the next character matches the first one. Then, given the length of the substring in your second variable, check the next bit of your string to see if it matches. Continue until it doesn't match or you hit the end of the string.
If you just want to find any length pattern that repeats regardless of whether the first character is repeated within the pattern, it gets more complicated (but, fortunately, it's the sort of thing computers are good at).
You'll need to go character by character building a pattern in another variable as above, but you'll also have to watch for the first character to reappear and start building a second substring as you go, to see if it matches the first. This should probably go in an array as you might encounter a third (or more) instance of the first character which would trigger the need to track yet another possible match.
It's not difficult but there is a lot to keep track of and it's a rather annoying problem. Is there a particular reason you're doing this?

split string by char

scala has a standard way of splitting a string in StringOps.split
it's behaviour somewhat surprised me though.
To demonstrate, using the quick convenience function
def sp(str: String) = str.split('.').toList
the following expressions all evaluate to true
(sp("") == List("")) //expected
(sp(".") == List()) //I would have expected List("", "")
(sp("a.b") == List("a", "b")) //expected
(sp(".b") == List("", "b")) //expected
(sp("a.") == List("a")) //I would have expected List("a", "")
(sp("..") == List()) // I would have expected List("", "", "")
(sp(".a.") == List("", "a")) // I would have expected List("", "a", "")
so I expected that split would return an array with (the number a separator occurrences) + 1 elements, but that's clearly not the case.
It is almost the above, but remove all trailing empty strings, but that's not true for splitting the empty string.
I'm failing to identify the pattern here. What rules does StringOps.split follow?
For bonus points, is there a good way (without too much copying/string appending) to get the split I'm expecting?
For curious you can find the code here.https://github.com/scala/scala/blob/v2.12.0-M1/src/library/scala/collection/immutable/StringLike.scala
See the split function with the character as an argument(line 206).
I think, the general pattern going on over here is, all the trailing empty splits results are getting ignored.
Except for the first one, for which "if no separator char is found then just send the whole string" logic is getting applied.
I am trying to find if there is any design documentation around these.
Also, if you use string instead of char for separator it will fall back to java regex split. As mentioned by #LRLucena, if you provide the limit parameter with a value more than size, you will get your trailing empty results. see http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String,%20int)
You can use split with a regular expression. I´m not sure, but I guess that the second parameter is the largest size of the resulting array.
def sp(str: String) = str.split("\\.", str.length+1).toList
Seems to be consistent with these three rules:
1) Trailing empty substrings are dropped.
2) An empty substring is considered trailing before it is considered leading, if applicable.
3) First case, with no separators is an exception.
split follows the behaviour of http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String)
That is split "around" the separator character, with the following exceptions:
Regardless of anything else, splitting the empty string will always give Array("")
Any trailing empty substrings are removed
Surrogate characters only match if the matched character is not part of a surrogate pair.

What does this code do? (awk)

here I have a part of my awk code to parse a file but the output is not 100% what I want.
match($0,/root=[^,]*/){
n=split(substr($0,RSTART+5,RLENGTH-5),N,/:/)
My Problem is that I can not tell by 100% what this piece of code is exactly doing ...
Can someone just tell me what this two lines exactly do?
EDIT:
I just want to know what the code does so I can fix it myself, so please do not ask something like: how the file you parse looks like? ..
match(s, r [, a])
Returns the position in s where the regular expression r occurs, or 0
if r is not present, and sets the values of RSTART and RLENGTH. Note
that the argument order is the same as for the ~ operator: str ~ re.
If array a is provided, a is cleared and then elements 1 through n are
filled with the portions of s that match the corresponding
parenthesized subexpression in r. The 0'th element of a contains the
portion of s matched by the entire regular expression r. Subscripts
a[n, "start"], and a[n, "length"] provide the starting index in the
string and length respectively, of each matching substring.
substr(s, i [, n])
Returns the at most n-character substring of s starting at i. If n is
omitted, the rest of s is used.
split(s, a [, r])
Splits the string s into the array a on the regular expression r, and
returns the number of fields. If r is omitted, FS is used instead. The
array a is cleared first. Splitting behaves identically to field
splitting, described above.
So when match finds something that matches /root=[^,]*/ in the line ($0) it will return that position (non-zero integers are truth-y for awk) and the action will execute.
The action then uses RSTART and RLENGTH as set by match to get the substring of the line that matched (minus root= because of the +5/-5) and then splits that into the array N on : and saves the number of fields split into n.
That could probably be changed to match($0, /root=([^,]*)/, N) as the pattern and then use N[1,"start"] in the action instead of substr if you wanted.

Check whether a string contains a substring

How can I check whether a given string contains a certain substring, using Perl?
More specifically, I want to see whether s1.domain.example is present in the given string variable.
To find out if a string contains substring you can use the index function:
if (index($str, $substr) != -1) {
print "$str contains $substr\n";
}
It will return the position of the first occurrence of $substr in $str, or -1 if the substring is not found.
Another possibility is to use regular expressions which is what Perl is famous for:
if ($mystring =~ /s1\.domain\.example/) {
print qq("$mystring" contains "s1.domain.example"\n);
}
The backslashes are needed because a . can match any character. You can get around this by using the \Q and \E operators.
my $substring = "s1.domain.example";
if ($mystring =~ /\Q$substring\E/) {
print qq("$mystring" contains "$substring"\n);
}
Or, you can do as eugene y stated and use the index function.
Just a word of warning: Index returns a -1 when it can't find a match instead of an undef or 0.
Thus, this is an error:
my $substring = "s1.domain.example";
if (not index($mystring, $substr)) {
print qq("$mystring" doesn't contains "$substring"\n";
}
This will be wrong if s1.domain.example is at the beginning of your string. I've personally been burned on this more than once.
Case Insensitive Substring Example
This is an extension of Eugene's answer, which converts the strings to lower case before checking for the substring:
if (index(lc($str), lc($substr)) != -1) {
print "$str contains $substr\n";
}

Resources