How to create a word set/class in Lua pattern matching? - string

I am trying to create a word-set (class) instead of the char-set (class) in Lua.
For Example:
local text = "hello world, hi world, hola world"
print(string.find(text, "[^hello] world"))
In this example, the program will try to match any part of the string that doesn't start with h or e or l or o characters and has a space and world next to it. But I want to make a word-set similar to this that can match the entire word and find a part of the string that doesn't start with the word hello and has the space and world next to it.
What I've tried:
local text = "hello world, hi world, hola world"
print(string.find(text, "[^h][^e][^l][^l][^o] world"))
It didn't work for some reason.

I am trying to create a word-set (class) instead of the char-set (class) in Lua.
This is not possible in the general case. Lua patterns operate at a character level: Quantifiers can only be applied to characters or character sets (and some special pattern items), but there exists no alternation, no "subexpressions" etc. Patterns don't have the expressive power required for this.
local text = "hello world, hi world, hola world"
print(string.find(text, "[^h][^e][^l][^l][^o] world"))
what this pattern translates to is: "find world preceded by a space and 5 characters where each character may not be the respective character of hello world. This means all of the following won't match:
hi world: Only three characters before world
hxxxx world: First character is the same as the first character of hello
... hola world: The l from hola is at the same position as the second l from hello
To find world not preceded by hello I would combine multiple calls to string.find to search through the string, always looking for a preceding hello :
-- str: Subject string to search
-- needle: String to search for
-- disallowed_prefix: String that may not immediately precede the needle
-- plain: Disable pattern matching
-- init: Start at a certain position
local function string_find_prefix(str, needle, disallowed_prefix, plain, init)
local needle_start, needle_end = str:find(needle, init or 1, plain)
if not needle_start then return end -- needle not found
local _, prefix_end = str:find(disallowed_prefix, init or 1, plain)
-- needle may not be prefixed with the disallowed prefix
if (not prefix_end) or needle_start > prefix_end + 1 then
-- prefix not found or needle starts after prefix, return match
return needle_start, needle_end
end
return string_find_prefix(str, needle, disallowed_prefix, plain, prefix_end + 2)
end
print(string_find_prefix("hello world, hi world, hola world", "world", "hello ")) -- 17 21: Inclusive indices of the `world` after `hi`
See string.find (s, pattern [, init [, plain]]).

Related

Matching and replacing newline character using Lua pattern patch

This is a follow up to a question I asked in the LaTeX community regarding how to format items in the itemize environment. It turns out I got a response to that question using lua, but now I want to extend the lua code so I have a more lua programming centered question.
The answer proposes using string.gsub to replace pattern-matched parts of the string to something else. For example, the below code:
s = string.gsub ( s , '\\sitem%s+(.+)' , '\\item\\makefirstuc{%1},' )
will replace \item hello world to \item\makefirstuc{hello world}.
Here's the problem though, sometimes I have new lines in the string after item, for instance:
\item hello
world
I would like to replace that with:
\item\makefirstuc hello world
Does anyone know how I can do that?
Edit
I just tried the solution proposed by Wiktor but it wouldn't work for the case:
\item hello
world
\end{itemize}
Here's a full script to demonstrate:
-- test.lua
s = "\\sitem Hello\n\\end{itemize}"
print(s)
result = string.gsub ( s, '\\item%s+(.+)' , function(x) return
'\\item\\makefirstuc{' .. string.gsub(x, '\n', ' ') .. '},'
end )
print("\nAfter gsub")
print(result)
The above script outputs
\sitem Hello
\end{itemize}
After gsub
\sitem Hello
\end{itemize}
But I want it to output:
\sitem Hello
\end{itemize}
After gsub
\item\makefirstuc {Hello},
\end{itemize}
No need for complicate lua constructs, you can simply use the getitems package:
\documentclass{article}
\usepackage{getitems}
\usepackage{mfirstuc}
% borrowed from biblatex
\makeatletter
\newcommand{\unspace}{%
\ifbool{hmode}
{\ifdimgreater\lastskip\z#
{\unskip\unspace}
{\ifnumgreater\lastpenalty\z#
{\unpenalty\unspace}
{}}}
{}}
\makeatother
\def\doitem#1{\item \makefirstuc{#1}\unspace\ifnum\thecurrentitemnumber=\thenumgathereditems.\else,\fi}%
\let\origitemize\itemize
\let\origenditemize\enditemize
\usepackage{environ}
\RenewEnviron{itemize}{%
\expandafter\gatheritems\expandafter{\BODY}%
\gathereditem{0}%
\origitemize%
\loopthroughitemswithcommand{\doitem}%
\origenditemize%
}
\begin{document}
\begin{itemize}
\item test
\item test
\item test
\end{itemize}
\end{document}
You can use a function as a replacement argument:
result = string.gsub ( s, '\\sitem%s+(.-)(\n\\)' , '\\item\\makefirstuc {%1},%2')
See the online demo.
Details:
\\sitem - a \sitem fixed string
%s+ - one or more whitespaces
(.-) - Group 1 (%1): any zero or more chars as few as possible
(\n\\) - Group 2 (%2): a newline and a \.
We don't want the . pattern match to be greedy, so this uses the frontier pattern to stop at the next backslash or at the end of the string. This will also mean that it can match multiple items, since the backslash isn't contained in the match. (see Lua Manual ยง 6.4.1 - Patterns)
%f[set], a frontier pattern; such item matches an empty string at any position such that the next character belongs to set and the previous character does not belong to set. The set set is interpreted as previously described. The beginning and the end of the subject are handled as if they were the character '\0'.
s = s:gsub(
"\\sitem%s+(.-)\n?%f[\\\0]",
function(it)
return "\\item\\makefirstuc{" .. it:gsub("\n", " ") .. "},\n"
end
)
Input:
\sitem Hello
World
\sitem Something else
\end{itemize}
Output:
\item\makefirstuc{Hello World},
\item\makefirstuc{Something else},
\end{itemize}

return only chars from the string in python

I am looking to extract only chars from the given string. but my query is doing exactly opposite
s= "A man, a plan, a canal: Panama"
newS = ''.join(re.findall("[^a-zA-Z]*", s))
print(newS) // my o/p: , , :
expected o/p string is:
"A man a plan a canal Panama"
Your regular expression is inverting the match - that's what the caret symbol (^) does inside square brackets (negated character class). You first need to remove that.
Next, you should be matching a sequence of one or more characters (+) rather than zero or more characters (*) -- using * will match the empty string, which you don't want in this case.
Finally your join should join with a space to get the intended output, rather than an empty string -- which won't retain the spaces between the words.
newS = ' '.join(re.findall(r'[a-zA-Z]+', s))
Though not essential in this case, its advised to use raw strings for regular expressions (r). More in this post.
Full working code:
import re
s = 'A man, a plan, a canal: Panama'
newS = ' '.join(re.findall(r'[a-zA-Z]+', s))
print(newS)

Python - Replacing repeated consonants with other values in a string

I want to write a function that, given a string, returns a new string in which occurences of a sequence of the same consonant with 2 or more elements are replaced with the same sequence except the first consonant - which should be replaced with the character 'm'.
The explanation was probably very confusing, so here are some examples:
"hello world" should return "hemlo world"
"Hannibal" should return "Hamnibal"
"error" should return "emror"
"although" should return "although" (returns the same string because none of the characters are repeated in a sequence)
"bbb" should return "mbb"
I looked into using regex but wasn't able to achieve what I wanted. Any help is appreciated.
Thank you in advance!
Regex is probably the best tool for the job here. The 'correct' expression is
test = """
hello world
Hannibal
error
although
bbb
"""
output = re.sub(r'(.)\1+', lambda g:f'm{g.group(0)[1:]}', test)
# '''
# hemlo world
# Hamnibal
# emror
# although
# mbb
# '''
The only real complicated part of this is the lambda that we give as an argument. re.sub() can accept one as its 'replacement criteria' - it gets passed a regex object (which we call .group(0) on to get the full match, i.e. all of the repeated letters) and should output a string, with which to replace whatever was matched. Here, we use it to output the character 'm' followed by the second character onwards of the match, in an f-string.
The regex itself is pretty straightforward as well. Any character (.), then the same character (\1) again one or more times (+). If you wanted just alphanumerics (i.e. not to replace duplicate whitespace characters), you could use (\w) instead of (.)

Lua string manipulation pattern matching alternative "|"

Is there a way I can do a string pattern that will match "ab|cd" so it matches for either "ab" or "cd" in the input string. I know you use something like "[ab]" as a pattern and it will match for either "a" or "b", but that only works for one letter stuff.
Note that my actual problem is a lot more complicated, but essentially I just need to know if there is an OR thing in Lua's string manipulation. I would actually want to put other patterns on each sides of the OR thing, and etc. But if it works with something like "hello|world" and matches "hello, world!" with both "hello" and "world" then it's great!
Using logical operator with Lua patterns can solve most problems. For instance, for the regular expression [hello|world]%d+, you can use
string.match(str, "hello%d+") or string.match(str, "world%d+")
The shortcut circuit of or operator makes sure the string matches hello%d+ first, if if fails, then matches world%d+
Unfortunately Lua patterns are not regular expressions and are less powerful. In particular they don't support alternation (that vertical bar | operator of Java or Perl regular expressions), which is what you want to do.
A simple workaround could be the following:
local function MatchAny( str, pattern_list )
for _, pattern in ipairs( pattern_list ) do
local w = string.match( str, pattern )
if w then return w end
end
end
s = "hello dolly!"
print( MatchAny( s, { "hello", "world", "%d+" } ) )
s = "cruel world!"
print( MatchAny( s, { "hello", "world", "%d+" } ) )
s = "hello world!"
print( MatchAny( s, { "hello", "world", "%d+" } ) )
s = "got 1000 bucks"
print( MatchAny( s, { "hello", "world", "%d+" } ) )
Output:
hello
world
hello
1000
The function MatchAny will match its first argument (a string) against a list of Lua patterns and return the result of the first successful match.
Just to expand on peterm's suggestion, lpeg also provides a re module that exposes a similar interface to lua's standard string library while still preserving the extra power and flexibility offered by lpeg.
I would say try out the re module first since its syntax is a bit less esoteric compared to lpeg. Here's an example usage that can match your hello world example:
dump = require 'pl.pretty'.dump
re = require 're'
local subj = "hello, world! padding world1 !hello hello hellonomatch nohello"
pat = re.compile [[
toks <- tok (%W+ tok)*
tok <- {'hello' / 'world'} !%w / %w+
]]
res = { re.match(subj, pat) }
dump(res)
which would output:
{
"hello",
"world",
"hello",
"hello"
}
If you're interested in capturing the position of the matches just modify the grammar slightly for positional capture:
tok <- {}('hello' / 'world') !%w / %w+

Lua frontier pattern match (whole word search)

can someone help me with this please:
s_test = "this is a test string this is a test string "
function String.Wholefind(Search_string, Word)
_, F_result = string.gsub(Search_string, '%f[%a]'..Word..'%f[%A]',"")
return F_result
end
A_test = String.Wholefind(s_test,"string")
output: A_test = 2
So the frontier pattern finds the whole word no problem and gsub counts the whole words no problem but what if the search string has numbers?
s_test = " 123test 123test 123"
B_test = String.Wholefind(s_test,"123test")
output: B_test = 0
seems to work with if the numbers aren't at the start or end of the search string
Your pattern doesn't match because you are trying to do the impossible.
After including your variable value, the pattern looks like this: %f[%a]123test%f[%A]. Which means:
%f[%a] - find a transition from a non letter to a letter
123 - find 123 at the position after transition from a non letter to a letter. This itself is a logical impossibility as you can't match a transition to a letter when a non-letter follows it.
Your pattern (as written) will not work for any word that starts or ends with a non-letter.
If you need to search for fragments that include letters and numbers, then your pattern needs to be changed to something like '%f[%S]'..Word..'%f[%s]'.

Resources