Lua string manipulation pattern matching alternative "|" - string

Is there a way I can do a string pattern that will match "ab|cd" so it matches for either "ab" or "cd" in the input string. I know you use something like "[ab]" as a pattern and it will match for either "a" or "b", but that only works for one letter stuff.
Note that my actual problem is a lot more complicated, but essentially I just need to know if there is an OR thing in Lua's string manipulation. I would actually want to put other patterns on each sides of the OR thing, and etc. But if it works with something like "hello|world" and matches "hello, world!" with both "hello" and "world" then it's great!

Using logical operator with Lua patterns can solve most problems. For instance, for the regular expression [hello|world]%d+, you can use
string.match(str, "hello%d+") or string.match(str, "world%d+")
The shortcut circuit of or operator makes sure the string matches hello%d+ first, if if fails, then matches world%d+

Unfortunately Lua patterns are not regular expressions and are less powerful. In particular they don't support alternation (that vertical bar | operator of Java or Perl regular expressions), which is what you want to do.
A simple workaround could be the following:
local function MatchAny( str, pattern_list )
for _, pattern in ipairs( pattern_list ) do
local w = string.match( str, pattern )
if w then return w end
end
end
s = "hello dolly!"
print( MatchAny( s, { "hello", "world", "%d+" } ) )
s = "cruel world!"
print( MatchAny( s, { "hello", "world", "%d+" } ) )
s = "hello world!"
print( MatchAny( s, { "hello", "world", "%d+" } ) )
s = "got 1000 bucks"
print( MatchAny( s, { "hello", "world", "%d+" } ) )
Output:
hello
world
hello
1000
The function MatchAny will match its first argument (a string) against a list of Lua patterns and return the result of the first successful match.

Just to expand on peterm's suggestion, lpeg also provides a re module that exposes a similar interface to lua's standard string library while still preserving the extra power and flexibility offered by lpeg.
I would say try out the re module first since its syntax is a bit less esoteric compared to lpeg. Here's an example usage that can match your hello world example:
dump = require 'pl.pretty'.dump
re = require 're'
local subj = "hello, world! padding world1 !hello hello hellonomatch nohello"
pat = re.compile [[
toks <- tok (%W+ tok)*
tok <- {'hello' / 'world'} !%w / %w+
]]
res = { re.match(subj, pat) }
dump(res)
which would output:
{
"hello",
"world",
"hello",
"hello"
}
If you're interested in capturing the position of the matches just modify the grammar slightly for positional capture:
tok <- {}('hello' / 'world') !%w / %w+

Related

How to create a word set/class in Lua pattern matching?

I am trying to create a word-set (class) instead of the char-set (class) in Lua.
For Example:
local text = "hello world, hi world, hola world"
print(string.find(text, "[^hello] world"))
In this example, the program will try to match any part of the string that doesn't start with h or e or l or o characters and has a space and world next to it. But I want to make a word-set similar to this that can match the entire word and find a part of the string that doesn't start with the word hello and has the space and world next to it.
What I've tried:
local text = "hello world, hi world, hola world"
print(string.find(text, "[^h][^e][^l][^l][^o] world"))
It didn't work for some reason.
I am trying to create a word-set (class) instead of the char-set (class) in Lua.
This is not possible in the general case. Lua patterns operate at a character level: Quantifiers can only be applied to characters or character sets (and some special pattern items), but there exists no alternation, no "subexpressions" etc. Patterns don't have the expressive power required for this.
local text = "hello world, hi world, hola world"
print(string.find(text, "[^h][^e][^l][^l][^o] world"))
what this pattern translates to is: "find world preceded by a space and 5 characters where each character may not be the respective character of hello world. This means all of the following won't match:
hi world: Only three characters before world
hxxxx world: First character is the same as the first character of hello
... hola world: The l from hola is at the same position as the second l from hello
To find world not preceded by hello I would combine multiple calls to string.find to search through the string, always looking for a preceding hello :
-- str: Subject string to search
-- needle: String to search for
-- disallowed_prefix: String that may not immediately precede the needle
-- plain: Disable pattern matching
-- init: Start at a certain position
local function string_find_prefix(str, needle, disallowed_prefix, plain, init)
local needle_start, needle_end = str:find(needle, init or 1, plain)
if not needle_start then return end -- needle not found
local _, prefix_end = str:find(disallowed_prefix, init or 1, plain)
-- needle may not be prefixed with the disallowed prefix
if (not prefix_end) or needle_start > prefix_end + 1 then
-- prefix not found or needle starts after prefix, return match
return needle_start, needle_end
end
return string_find_prefix(str, needle, disallowed_prefix, plain, prefix_end + 2)
end
print(string_find_prefix("hello world, hi world, hola world", "world", "hello ")) -- 17 21: Inclusive indices of the `world` after `hi`
See string.find (s, pattern [, init [, plain]]).

Looking for efficient string replacement algorythm

I'm trying to create a string replacer that accepts multilpe replacements.
The ideia is that it would scan the string to find substrings and replace those substrings with another substring.
For example, I should be able to ask it to replace every "foo" for "bar". Doing that is trivial.
The issue starts when I'm trying to add multiple replacements for this function. Because if I ask it to replace "foo" for "bar" and "bar" for "biz", running those replacements in sequence would result in "foo" turning to "biz", and this behavior is unintended.
I tried splitting the string into words and running each replacement function in each word. However that's not bullet proof either because still results in unintended behavior, since you can ask it to replace substrings that are not whole words. Also, I find that very inefficient.
I'm thinking in some way of running each replacer once in the whole string and sort of storing those changes and merging them. However I think I'm overengineering.
Searching on the web gives me trivial results on how to use string.replace with regular expressions, it doesn't solve my problem.
Is this a problem already solved? Is there an algorithm that can be used here for this string manipulation efficiently?
If you modify your string while searching for all occurences of substrings to be replaced, you'll end up modifying incorrect states of the string. An easy way out could be to get a list of all indexes to update first, then iterate over the indexes and make replacements. That way, indexes for "bar" would've been already computed, and won't be affected even if you replace any substring with "bar" later.
Adding a rough Python implementation to give you an idea:
import re
string = "foo bar biz"
replacements = [("foo", "bar"), ("bar", "biz")]
replacement_indexes = []
offset = 0
for item in replacements:
replacement_indexes.append([m.start() for m in re.finditer(item[0], string)])
temp = list(string)
for i in range(len(replacement_indexes)):
old, new, indexes = replacements[i][0], replacements[i][1], replacement_indexes[i]
for index in indexes:
temp[offset+index:offset+index+len(old)] = list(new)
offset += len(new)-len(old)
print(''.join(temp)) # "bar biz biz"
Here's the approach I would take.
I start with my text and the set of replacements:
string text = "alpha foo beta bar delta";
Dictionary<string, string> replacements = new()
{
{ "foo", "bar" },
{ "bar", "biz" },
};
Now I create an array of parts that are either "open" or not. Open parts can have their text replaced.
var parts = new List<(string text, bool open)>
{
(text: text, open: true)
};
Now I run through each replacement and build a new parts list. If the part is open I can do the replacements, if it's closed just add it in untouched. It's this last bit that prevents double mapping of replacements.
Here's the main logic:
foreach (var replacement in replacements)
{
var parts2 = new List<(string text, bool open)>();
foreach (var part in parts)
{
if (part.open)
{
bool skip = true;
foreach (var split in part.text.Split(new[] { replacement.Key }, StringSplitOptions.None))
{
if (skip)
{
skip = false;
}
else
{
parts2.Add((text: replacement.Value, open: false));
}
parts2.Add((text: split, open: true));
}
}
else
{
parts2.Add(part);
}
}
parts = parts2;
}
That produces the following:
Now it just needs to be joined back up again:
string result = String.Concat(parts.Select(p => p.text));
That gives:
alpha bar beta biz delta
As requested.
Let's suppose your given string were
str = "Mary had fourteen little lambs"
and the desired replacements were given by the following hash (aka hashmap):
h = { "Mary"=>"Butch", "four"=>"three", "little"=>"wee", "lambs"=>"hippos" }
indicating that we want to replace "Mary" (wherever it appears in the string, if at all) with "Butch", and so on. We therefore want to return the following string:
"Butch had fourteen wee hippos"
Notice that we do not want 'fourteen' to be replaced with 'threeteen' and we want the extra spaces between 'fourteen' and 'wee' to be preserved.
First collect the keys of the hash h into an array (or list):
keys = h.keys
#=> ["Mary", "four", "little", "lambs"]
Most languages have a method or function sub or gsub that works something like the following:
str.gsub(/\w+/) do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> "Butch had fourteen wee hippos"
The regular expression /\w+/ (r'\w+' in Python, for example) matches one or more word characters, as many as possible (i.e., a greedy match). Word characters are letters, digits and the underscore ('_'). It therefore will sequentially match 'Mary', 'had', 'fourteen', 'little' and 'lambs'.
Each matched word is passed to the "block" do |word| ...end and is held by the variable word. The block calculation then computes and returns the string that is to replace the value of word in a duplicate of the original string. Different languages uses different structures and formats to do this, of course.
The first word passed to the block by gsub is 'Mary'. The following calculation is then performed:
if keys.include?("Mary") # true
# so replace "Mary" with:
h[word] #=> "Butch
else # not executed
# not executed
end
Next, gsub passes the word 'had' to the block and assigns that string to the variable word. The following calculation is then performed:
if keys.include?("had") # false
# not executed
else
# so replace "had" with:
"had"
# that is, leave "had" unchanged
end
Similar calculations are made for each word matched by the regular expression.
We see that punctuation and other non-word characters is not a problem:
str = "Mary, had fourteen little lambs!"
str.gsub(/\w+/) do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> "Butch, had fourteen wee hippos!"
We can see that gsub does not perform replacements sequentially:
h = { "foo"=>"bar", "bar"=>"baz" }
keys = h.keys
#=> ["foo", "bar"]
"foo bar".gsub(/\w+/) do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> "bar baz"
Note that a linear search of keys is required to evaluate
keys.include?("Mary")
This could be relatively time-consuming if keys has many elements.
In most languages this can be sped up by making keys a set (an unordered collection of unique elements). Determining whether a set contains a given element is quite fast, comparable to determining if a hash has a given key.
An alternative formulation is to write
str.gsub(/\b(?:Mary|four|little|lambs)\b/) { |word| h[word] }
#=> "Butch had fourteen wee hippos"
where the regular expression is constructed programmatically from h.keys. This regular expression reads, "match one of the four words indicated, preceded and followed by a word boundary (\b). The trailing word boundary prevents 'four' from matching 'fourteen'. Since gsub is now only considering the replacement of those four words the block can be simplified to { |word| h[word] }.
Again, this preserves punctuation and extra spaces.
If for some reason we wanted to be able to replace parts of words (e.g., to replace 'fourteen' with 'threeteen'), simply remove the word boundaries from the regular expression:
str.gsub(/Mary|four|little|lambs/) { |word| h[word] }
#=> "Butch had threeteen wee hippos"
Naturally, different languages provide variations of this approach. In Ruby, for example, one could write:
g = Hash.new { |h,k| k }.merge(h)
The creates a hash g that has the same key-value pairs as h but has the additional property that if g does not have a key k, g[k] (the value of key k) returns k. That allows us to write simply:
str.gsub(/\w+/, g)
#=> "Butch had fourteen wee hippos"
See the second version of String#gsub.
A different approach (which I will show is problematic) is to construct an array (or list) of words from the string, replace those words as appropriate and then rejoin the resulting words to form a string. For example,
words = str.split
#=> ["Mary", "had", "fourteen", "little", "lambs"]
arr = words.map do |word|
if keys.include?(word)
h[word]
else
word
end
end
["Butch", "had", "fourteen", "wee", "hippos"]
arr.join(' ')
#=> "Butch had fourteen wee hippos"
This produces similar results except the extra spaces have been removed.
Now suppose the string contained punctuation:
str = "Mary, had fourteen little lambs!"
words = str.split
#=> ["Mary,", "had", "fourteen", "little", "lambs!"]
arr = words.map do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> ["Mary,", "had", "fourteen", "wee", "lambs!"]
arr.join(' ')
#=> "Mary, had fourteen wee lambs!"
We could deal with punctuation by writing
words = str.scan(/\w+/)
#=> ["Mary", "had", "fourteen", "little", "lambs"]
arr = words.map do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> ["Butch", "had", "fourteen", "wee", "hippos"]
Here str.scan returns an array of all matches of the regular expression /\w+/ (one or more word characters). The obvious problem is that all punctuation has been lost when arr.join(' ').
You can achieve in a simple way, by using regular expressions:
import re
replaces = {'foo' : 'bar', 'alfa' : 'beta', 'bar': 'biz'}
original_string = 'foo bar, alfa foo. bar other.'
expected_string = 'bar biz, beta bar. biz other.'
replaced = re.compile(r'\w+').sub(lambda m: replaces[m.group()] if m.group() in replaces else m.group(), original_string)
assert replaced == expected_string
I haven't checked the performance, but I believe it is probably faster than using "nested for loops".

How to write the result of a function in a slice

In the example, everything works fine. But they do not use the variable a and immediately display it https://play.golang.org/p/O0XwtQJRej
But I have a problem:
package main
import (
"fmt"
"strings"
)
func main() {
str := "fulltext"
var slice []string
slice = strings.Split(str , "")
fmt.Printf("anwer: ", slice)
}
Whence in the answer there are superfluous characters, for example
%! (EXTRA [] string =
P.S. I know that I need to use append to add elements to the slice, but now I do not understand how to apply append here.
UP:
Now I have the answer:
anwer: %!(EXTRA []string=[f u l l t e x t])
But I need just:
[f u l l t e x t]
But I do not understand how I should change my code?
The problem is not with the assignment of the return value of strings.Split() to the local variable slice, which is totally fine.
The problem is that you used fmt.Printf() which expects a format string, and based on that format string it formats / substitutes expected parameters. Since your format string does not contain any verbs, that fmt.Printf() call expects no parameters, yet you pass it one, so it signals this with those extra characters (kind of error string).
Provide a valid format string where you indicate you will supply 1 parameter, a slice:
fmt.Printf("answer: %v", slice)
With this, the output is:
answer: [f u l l t e x t]
Or alternatively use fmt.Println(), which does not expect a format string:
fmt.Println("answer:", slice)
(Note that there is no space after the colon, as fmt.Println() adds a space between 2 values if one of them is of type string).
Output is the same. Try the examples on the Go Playground.
Staying with fmt.Printf(), when the parameter involves string values, the %q verb is often more useful, as that will print quoted string values, much easier to spot certain mistakes (e.g. invisible characters, or if a string contains spaces, it will become obvious):
fmt.Printf("answer: %q\n", slice)
Output of this (try it on the Go Playground):
answer: ["f" "u" "l" "l" "t" "e" "x" "t"]
If you'd wanted to append the result of a function call, this is how it could look like:
slice := []string{"initial", "content"}
slice = append(slice, strings.Split(str, "")...)
fmt.Printf("answer: %q\n", slice)
And now the output (try it on the Go Playground):
answer: ["initial" "content" "f" "u" "l" "l" "t" "e" "x" "t"]
Give to printf the expected format, in most cases, %v is fine.
package main
import (
"fmt"
"strings"
)
func main() {
str := "fulltext"
var slice []string
slice = strings.Split(str, "")
fmt.Printf("anwer: %v", slice)
}
see https://golang.org/pkg/fmt/ for more info.

Get words from a file using groovy

Using groovy how can I get the words/texts from a file which enclosed with parentheses.
Example:
George (a programmer) used to think much.
words to get: a programmer
Here you have an example program solving the issue:
String inp = 'George (a programmer) used to think much.'
def matcher = inp =~ /\(([^\)]+)\)/ // Try to find a match
if (matcher) { // Something found
String str = matcher[0][1] // Get the 1st capture group
printf("Found: %s.\n", str)
def words = str.tokenize() // Create a list of words
words.eachWithIndex{ it, i -> printf("%d: %s.\n", i, it)}
} else {
print("Not found")
}
Note the meaning of parentheses in the regular expression:
Outer (backslash quoted) parentheses are literal parentheses (we are
looking for these chars).
Unquoted parentheses (between them) are delimiters of the capture group.
The remaining (quoted) closing parenthesis between them is the char
that should not be present within the capture group.

Lua: how do I split a string (of a varying length) into multiple parts?

I have a string, starting with a number, then a space, then a word of an unknown amount of letters, a space again, and then sometimes another piece of text (which may or may not contain more than one word).
EDIT: the last piece of text is sometimes left out (see example #2)
Using the methods mentioned in the comments, str:find(...) on #2 would return nil.
Example:
"(number) (text) [more text]"
1: "10 HELLO This is a string"
2: "88 BYE"
What I want is to split these strings into a table, inside a table containing more of these split strings, like this:
{
[(number)] = { [1] = (text), [2] = (more text) }
[10] = { [1] = "HELLO", [2] = "This is a string" }
}
I have tried several methods, but none of them give me the desired result.
One of the methods I tried, for example, was splitting the string on whitespaces. But that resulted in:
{
[10] = { [1] = "HELLO", [2] = "This", ... [4] = "string" }
}
Thanks in advance.
Using various Lua string patterns, achieving the desired result is quite easy.
For eg.
function CustomMatching( sVar )
local tReturn = {}
local _, _, iNumber, sWord, sRemain = sVar:find( "^(%d+)%s(%a+)%s(.+)" )
tReturn[tonumber(iNumber)] = { sWord, sRemain }
return tReturn
end
And to call it:
local sVar = "10 HELLO This is a string"
local tMyTable = CustomMatching( sVar )
In the find() method the pattern "^(%d+)%s(%a+)%s(.+)" means:
Find and store all digits(%d) until a space is encountered.
Find and store all letters(%a) until a space is encountered.
Find and store all characters until the end of string is reached.
EDIT
Changed tReturn[iNumber] to tReturn[tonumber(iNumber)] as per the discussion in comments.
You can use the string.match method with an appropriate pattern:
local n, w, str = ('10 HELLO This is a string'):match'^(%d+)%s+(%S+)%s+(.*)$'
your_table[tonumber(n)] = {w, str}

Resources