Use scanf to split a string on a non-whitespace separator - string

I aim to scan a string containing a colon as a division and save both parts of it in a tuple.
For example:
input: "a:b"
output: ("a", "b")
My approach so far keeps getting the error message:
"scanf: bad input at char number 9: looking for ':', found '\n'".
Scanf.bscanf Scanf.Scanning.stdin "%s:%s" (fun x y -> (x,y));;
Additionally, my approach works with integers, I'm confused why it is not working with strings.
Scanf.bscanf Scanf.Scanning.stdin "%d:%d" (fun x y -> (x,y));;
4:3
- : int * int = (4, 3)

The reason for the issue you're seeing is that the first %s is going to keep consuming input until one of the following conditions hold:
a whitespace has been found,
a scanning indication has been encountered,
the end-of-input has been reached.
Note that seeing a colon isn't going to satisfy any of these (if you don't use a scanning indication). This means that the first %s is going to consume everything up to, in your case, the newline character in the input buffer, and then the : is going to fail.
You don't have this same issue for %d:%d because %d isn't going to consume the colon as part of matching an integer.
You can fix this by instead using a format string which will not consume the colon, e.g., %[^:]:%s. You could also use a scanning indication, like so: %s#:%s.
Additionally, your current method won't consume any trailing whitespace in the buffer, which might result in newlines being added to the first element on subsequent use of this, so you might prefer %s#:%s\n to consume the newline.
So, in all,
Scanf.bscanf Scanf.Scanning.stdin "%s#:%s\n" (fun x y -> (x,y));;

The %s specifier is greedy and it will read the string up to whitespace or a scanning indicator. The indicator could be specified using #<indicator> just after the %s specifier, where <indicator> is a single character, e.g.,
let split str =
Scanf.sscanf str "%s#:%s" (fun x y -> x,y)
This will instruct scanf to read everything up to : into the first string, drop : and then read the rest into the second string.

The string specifier %s is eager by default and will swallow all your content until the next space. You need to add a scanning indication(https://ocaml.org/api/Scanf.html#indication) to explain to Scanf.sscanf that you expect the first string to end on the first : :
For instance,
Scanf.sscanf "a:b"
"%s#:%s"
(fun x y -> x,y)
returns "a", "b". Here the scanning indication is the #: specifier just after the first %s specifier. In general, scanning indication are written #c for a character c.

Related

return only chars from the string in python

I am looking to extract only chars from the given string. but my query is doing exactly opposite
s= "A man, a plan, a canal: Panama"
newS = ''.join(re.findall("[^a-zA-Z]*", s))
print(newS) // my o/p: , , :
expected o/p string is:
"A man a plan a canal Panama"
Your regular expression is inverting the match - that's what the caret symbol (^) does inside square brackets (negated character class). You first need to remove that.
Next, you should be matching a sequence of one or more characters (+) rather than zero or more characters (*) -- using * will match the empty string, which you don't want in this case.
Finally your join should join with a space to get the intended output, rather than an empty string -- which won't retain the spaces between the words.
newS = ' '.join(re.findall(r'[a-zA-Z]+', s))
Though not essential in this case, its advised to use raw strings for regular expressions (r). More in this post.
Full working code:
import re
s = 'A man, a plan, a canal: Panama'
newS = ' '.join(re.findall(r'[a-zA-Z]+', s))
print(newS)

Find the minimal lexographical string formed by merging two strings

Suppose we are given two strings s1 and s2(both lowercase). We have two find the minimal lexographic string that can be formed by merging two strings.
At the beginning , it looks prettty simple as merge of the mergesort algorithm. But let us see what can go wrong.
s1: zyy
s2: zy
Now if we perform merge on these two we must decide which z to pick as they are equal, clearly if we pick z of s2 first then the string formed will be:
zyzyy
If we pick z of s1 first, the string formed will be:
zyyzy which is correct.
As we can see the merge of mergesort can lead to wrong answer.
Here's another example:
s1:zyy
s2:zyb
Now the correct answer will be zybzyy which will be got only if pick z of s2 first.
There are plenty of other cases in which the simple merge will fail. My question is Is there any standard algorithm out there used to perform merge for such output.
You could use dynamic programming. In f[x][y] store the minimal lexicographical string such that you've taken x charecters from the first string s1 and y characters from the second s2. You can calculate f in bottom-top manner using the update:
f[x][y] = min(f[x-1][y] + s1[x], f[x][y-1] + s2[y]) \\ the '+' here represents
\\ the concatenation of a
\\ string and a character
You start with f[0][0] = "" (empty string).
For efficiency you can store the strings in f as references. That is, you can store in f the objects
class StringRef {
StringRef prev;
char c;
}
To extract what string you have at certain f[x][y] you just follow the references. To udapate you point back to either f[x-1][y] or f[x][y-1] depending on what your update step says.
It seems that the solution can be almost the same as you described (the "mergesort"-like approach), except that with special handling of equality. So long as the first characters of both strings are equal, you look ahead at the second character, 3rd, etc. If the end is reached for some string, consider the first character of the other string as the next character in the string for which the end is reached, etc. for the 2nd character, etc. If the ends for both strings are reached, then it doesn't matter from which string to take the first character. Note that this algorithm is O(N) because after a look-ahead on equal prefixes you know the whole look-ahead sequence (i.e. string prefix) to include, not just one first character.
EDIT: you look ahead so long as the current i-th characters from both strings are equal and alphabetically not larger than the first character in the current prefix.

split string by char

scala has a standard way of splitting a string in StringOps.split
it's behaviour somewhat surprised me though.
To demonstrate, using the quick convenience function
def sp(str: String) = str.split('.').toList
the following expressions all evaluate to true
(sp("") == List("")) //expected
(sp(".") == List()) //I would have expected List("", "")
(sp("a.b") == List("a", "b")) //expected
(sp(".b") == List("", "b")) //expected
(sp("a.") == List("a")) //I would have expected List("a", "")
(sp("..") == List()) // I would have expected List("", "", "")
(sp(".a.") == List("", "a")) // I would have expected List("", "a", "")
so I expected that split would return an array with (the number a separator occurrences) + 1 elements, but that's clearly not the case.
It is almost the above, but remove all trailing empty strings, but that's not true for splitting the empty string.
I'm failing to identify the pattern here. What rules does StringOps.split follow?
For bonus points, is there a good way (without too much copying/string appending) to get the split I'm expecting?
For curious you can find the code here.https://github.com/scala/scala/blob/v2.12.0-M1/src/library/scala/collection/immutable/StringLike.scala
See the split function with the character as an argument(line 206).
I think, the general pattern going on over here is, all the trailing empty splits results are getting ignored.
Except for the first one, for which "if no separator char is found then just send the whole string" logic is getting applied.
I am trying to find if there is any design documentation around these.
Also, if you use string instead of char for separator it will fall back to java regex split. As mentioned by #LRLucena, if you provide the limit parameter with a value more than size, you will get your trailing empty results. see http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String,%20int)
You can use split with a regular expression. I´m not sure, but I guess that the second parameter is the largest size of the resulting array.
def sp(str: String) = str.split("\\.", str.length+1).toList
Seems to be consistent with these three rules:
1) Trailing empty substrings are dropped.
2) An empty substring is considered trailing before it is considered leading, if applicable.
3) First case, with no separators is an exception.
split follows the behaviour of http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String)
That is split "around" the separator character, with the following exceptions:
Regardless of anything else, splitting the empty string will always give Array("")
Any trailing empty substrings are removed
Surrogate characters only match if the matched character is not part of a surrogate pair.

What does this code do? (awk)

here I have a part of my awk code to parse a file but the output is not 100% what I want.
match($0,/root=[^,]*/){
n=split(substr($0,RSTART+5,RLENGTH-5),N,/:/)
My Problem is that I can not tell by 100% what this piece of code is exactly doing ...
Can someone just tell me what this two lines exactly do?
EDIT:
I just want to know what the code does so I can fix it myself, so please do not ask something like: how the file you parse looks like? ..
match(s, r [, a])
Returns the position in s where the regular expression r occurs, or 0
if r is not present, and sets the values of RSTART and RLENGTH. Note
that the argument order is the same as for the ~ operator: str ~ re.
If array a is provided, a is cleared and then elements 1 through n are
filled with the portions of s that match the corresponding
parenthesized subexpression in r. The 0'th element of a contains the
portion of s matched by the entire regular expression r. Subscripts
a[n, "start"], and a[n, "length"] provide the starting index in the
string and length respectively, of each matching substring.
substr(s, i [, n])
Returns the at most n-character substring of s starting at i. If n is
omitted, the rest of s is used.
split(s, a [, r])
Splits the string s into the array a on the regular expression r, and
returns the number of fields. If r is omitted, FS is used instead. The
array a is cleared first. Splitting behaves identically to field
splitting, described above.
So when match finds something that matches /root=[^,]*/ in the line ($0) it will return that position (non-zero integers are truth-y for awk) and the action will execute.
The action then uses RSTART and RLENGTH as set by match to get the substring of the line that matched (minus root= because of the +5/-5) and then splits that into the array N on : and saves the number of fields split into n.
That could probably be changed to match($0, /root=([^,]*)/, N) as the pattern and then use N[1,"start"] in the action instead of substr if you wanted.

Error when take two numbers out of a string

I'm just playing around with Lua trying to make a calculator that uses string manipulation. Basically I take two numbers out of a string, then do something to them (+ - * /). I can successfully take a number out of x, but taking a number out of y always returns nil. Can anyone help?
local x = "5 * 75"
function calculate(s)
local x, y =
tonumber(s:sub(1, string.find(s," ")-1)),
tonumber(s:sub(string.find(s," ")+3), string.len(s))
return x * y
end
print(calculate(x))
You have a simple misplaced parenthesis, sending string.len to tonumber instead of sub.
local x, y =
tonumber(s:sub(1, string.find(s," ")-1)),
tonumber(s:sub(string.find(s," ")+3, string.len(s)))
You actually don't need the string.len, as end of string is the default value for sub if nothing is given.
EDIT:
You can actually do what you want to do way shorter by using string.match instead.
local x,y = string.match(s,"(%d+).-(%d+)")
Match looks for tries to match the string with the pattern given and returns the captured values, in this case the numbers. This pattern translates to "One or more digits, then as few as possible of any character, then one or more digits". %d is 1 digit, + means one or more. . means any character and - means as few as possible. The values within the parentheses are captured, which means that they are returned.

Resources