Use of "("?) in Lua with string.find and the value that it returns - string

a, i, c = string.find(s, '"("?)', i + 1)
What is role of ? here? I believe it was checking for double quotes but I really do not understand exactly the use of "("?).
I read that string.find returns the starting and ending index of the matched pattern. But as per above line of code, a, i and c, 3 values are being returned. What is the third value being returned here?

? matches an optional character, i.e, zero or one occurrence of a character. So the pattern "("?) matches a ", followed by an optional ", i.e, it matches either " or "". Note that the match for "?(zero or one ") is captured.
As for the return value of string.find(), from string.find():
If the pattern has captures, then in a successful match the captured values are also returned, after the two indices.
The capture is the third return value, when there is a successful match.

Related

Why doesn't this RegEx match anything?

I've been trying for about two hours now to write a regular expression which matches a single character that's not preceded or followed by the same character.
This is what I've got: (\d)(?<!\1)\1(?!\1); but it doesn't seem to work! (testing at https://regex101.com/r/whnj5M/6)
For example:
In 1111223 I would expect to match the 3 at the end, since it's not preceded or followed by another 3.
In 1151223 I would expect to match the 5 in the middle, and the 3 at the end for the same reasons as above.
The end goal for this is to be able to find pairs (and only pairs) of characters in strings (e.g. to find 11 in 112223 or 44 in 123544) and I was going to try and match single isolated characters, and then add a {2} to it to find pairs, but I can't even seem to get isolated characters to match!
Any help would be much appreciated, I thought I knew RegEx pretty well!
P.S. I'm testing in JS on regex101.com because it wouldn't let me use variable length lookbacks in Python on there, and I'm using the regex library to allow for this in my actual implementation.
Your regex is close, but by using simply (\d) you are consuming characters, which prevents the other match from occurring. Instead, you can use a positive lookahead to set the capture group and then test for any occurrences of the captured digit not being surrounded by copies of itself:
(?=.*?(.))(?<!\1)\1(?!\1)
By using a lookahead you avoid consuming any characters and so the regex can match anywhere in the string.
Note that in 1151223 this returns 5, 1 and 3 because the third 1 is not adjacent to any other 1s.
Demo on regex101 (requires JS that supports variable width lookbehinds)
The pattern you tried does not match because this part (\d)(?<!\1) can not match.
It reads as:
Capture a digit in group 1. Then, on the position after that captured
digit, assert what is captured should not be on the left.
You could make the pattern work by adding for example a dot after the backreference (?<!\1.) to assert that the value before what you have just matched is not the same as group 1
Pattern
(\d)(?<!\1.)\1(?!\1)
Regex demo | Python demo
Note that you have selected ECMAscript on regex101.
Python re does not support variable width lookbehind.
To make this work in Python, you need the PyPi regex module.
Example code
import regex
pattern = r"(\d)(?<!\1.)\1(?!\1)"
test_str = ("1111223\n"
"1151223\n\n"
"112223\n"
"123544")
matches = regex.finditer(pattern, test_str)
for matchNum, match in enumerate(matches, start=1):
print(match.group())
Output
22
11
22
11
44
#Theforthbird has provided a good explanation for why your regular explanation does not match the characters of interest.
Each character matched by the following regular expression is neither preceded nor followed by the same character (including characters at the beginning and end of the string).
r'^.$|^(.)(?!\1)|(?<=(.))(?!\2)(.)(?!\3)'
Demo
Python's re regex engine performs the following operations.
^.$ match the first char if it is the only char in the line
| or
^ match beginning of line
(.) match a char in capture group 1...
(?!\1) ...that is not followed by the same character
| or
(?<=(.)) save the previous char in capture group 2...
(?!\2) ...that is not equal to the next char
(.) match a character and save to capture group 3...
(?!\3) ...that is not equal to the following char
Suppose the string were "cat".
The internal string pointer is initially at the beginning of the line.
"c" is not at the end of the line so the first part of the alternation fails and the second part is considered.
"c" is matched and saved to capture group 1.
The negative lookahead asserting that "c" is not followed by the content of capture group 1 succeeds, so "c" is matched and the internal string pointer is advanced to a position between "c" and "a".
"a" fails the first two parts of the assertion so the third part is considered.
The positive lookbehind (?<=(.)) saves the preceding character ("c") in capture group 2.
The negative lookahead (?!\2), which asserts that the next character ("a") is not equal to the content of capture group 2, succeeds. The string pointer remains just before "a".
The next character ("a") is matched and saved in capture group 3.
The negative lookahead (?!\3), which asserts that the following character ("t") does not equal the content of capture group 3, succeeds, so "a" is matched and the string pointer advances to just before "t".
The same steps are performed when evaluating "t" as were performed when evaluating "a". Here the last token ((?!\3)) succeeds, however, because no characters follow "t".

Lua -- match strings including non-letter classes

I'm trying to find exact matches of strings in Lua including, special characters. I want the example below to return that it is an exact match, but because of the - character it returns nil
index = string.find("test-string", "test-string")
returns nil
index = string.find("test-string", "test-")
returns 1
index = string.find("test-string", "test")
also returns 1
How can I get it to do full matching?
- is a pattern operator in a Lua string pattern, so when you say test-string, you're telling find() to match the string test as few times as possible. So what happens is it looks at test-string, sees test in there, and since - isn't an actual minus sign in this case, it's really looking for teststring.
Do as Mike has said and escape it with the % character.
I found this helpful for better understanding patterns.
You can also ask for a plain substring match that ignores magic characters:
string.find("test-string", "test-string",1,true)
you need to escape special characters in the pattern with the % character.
so in this case you are looking for
local index = string.find('test-string', 'test%-string')

split string by char

scala has a standard way of splitting a string in StringOps.split
it's behaviour somewhat surprised me though.
To demonstrate, using the quick convenience function
def sp(str: String) = str.split('.').toList
the following expressions all evaluate to true
(sp("") == List("")) //expected
(sp(".") == List()) //I would have expected List("", "")
(sp("a.b") == List("a", "b")) //expected
(sp(".b") == List("", "b")) //expected
(sp("a.") == List("a")) //I would have expected List("a", "")
(sp("..") == List()) // I would have expected List("", "", "")
(sp(".a.") == List("", "a")) // I would have expected List("", "a", "")
so I expected that split would return an array with (the number a separator occurrences) + 1 elements, but that's clearly not the case.
It is almost the above, but remove all trailing empty strings, but that's not true for splitting the empty string.
I'm failing to identify the pattern here. What rules does StringOps.split follow?
For bonus points, is there a good way (without too much copying/string appending) to get the split I'm expecting?
For curious you can find the code here.https://github.com/scala/scala/blob/v2.12.0-M1/src/library/scala/collection/immutable/StringLike.scala
See the split function with the character as an argument(line 206).
I think, the general pattern going on over here is, all the trailing empty splits results are getting ignored.
Except for the first one, for which "if no separator char is found then just send the whole string" logic is getting applied.
I am trying to find if there is any design documentation around these.
Also, if you use string instead of char for separator it will fall back to java regex split. As mentioned by #LRLucena, if you provide the limit parameter with a value more than size, you will get your trailing empty results. see http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String,%20int)
You can use split with a regular expression. I´m not sure, but I guess that the second parameter is the largest size of the resulting array.
def sp(str: String) = str.split("\\.", str.length+1).toList
Seems to be consistent with these three rules:
1) Trailing empty substrings are dropped.
2) An empty substring is considered trailing before it is considered leading, if applicable.
3) First case, with no separators is an exception.
split follows the behaviour of http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String)
That is split "around" the separator character, with the following exceptions:
Regardless of anything else, splitting the empty string will always give Array("")
Any trailing empty substrings are removed
Surrogate characters only match if the matched character is not part of a surrogate pair.

What does this code do? (awk)

here I have a part of my awk code to parse a file but the output is not 100% what I want.
match($0,/root=[^,]*/){
n=split(substr($0,RSTART+5,RLENGTH-5),N,/:/)
My Problem is that I can not tell by 100% what this piece of code is exactly doing ...
Can someone just tell me what this two lines exactly do?
EDIT:
I just want to know what the code does so I can fix it myself, so please do not ask something like: how the file you parse looks like? ..
match(s, r [, a])
Returns the position in s where the regular expression r occurs, or 0
if r is not present, and sets the values of RSTART and RLENGTH. Note
that the argument order is the same as for the ~ operator: str ~ re.
If array a is provided, a is cleared and then elements 1 through n are
filled with the portions of s that match the corresponding
parenthesized subexpression in r. The 0'th element of a contains the
portion of s matched by the entire regular expression r. Subscripts
a[n, "start"], and a[n, "length"] provide the starting index in the
string and length respectively, of each matching substring.
substr(s, i [, n])
Returns the at most n-character substring of s starting at i. If n is
omitted, the rest of s is used.
split(s, a [, r])
Splits the string s into the array a on the regular expression r, and
returns the number of fields. If r is omitted, FS is used instead. The
array a is cleared first. Splitting behaves identically to field
splitting, described above.
So when match finds something that matches /root=[^,]*/ in the line ($0) it will return that position (non-zero integers are truth-y for awk) and the action will execute.
The action then uses RSTART and RLENGTH as set by match to get the substring of the line that matched (minus root= because of the +5/-5) and then splits that into the array N on : and saves the number of fields split into n.
That could probably be changed to match($0, /root=([^,]*)/, N) as the pattern and then use N[1,"start"] in the action instead of substr if you wanted.

Lua pattern for parsing strings with optional part

I have to parse a string in the form value, value, value, value, value. The two last values are optional. This is my code, but it works only for the required arguments:
Regex = "([^,])+, ([^,])+, ([^,])+"
I'm using string.match to get the value into variables.
Since you're splitting the string by a comma, use gmatch:
local tParts = {}
for sMatch in str:gmatch "([^,]+)" do
table.insert( tParts, sMatch )
end
Now, once the parts are stored inside the table; you can check if the table contains matched groups at indexes 4 and 5 by:
if tParts[4] and tParts[5] then
-- do your job
elseif tParts[3] then
-- only first three matches were there
end
In Lua you can't make a capturing group optional, and also you are not able to use a logical OR operator. So the answer is: It isn't possible.

Resources