Lua - match only words outside {} braces in string and replace or append the words with substring - string

I have various strings with forms similar to:
This is a sentence outside braces{sentence{} with some words. {This is a
sentence inside braces with some words.}{This is a second sentence
inside braces.} Maybe some more words here for another sentence.
With Lua, I want to only match specific words in the string which are outside the "{}" braces. For example, I might want to match the word "sentence" outside the braces but not the occurrences of "sentence" inside the braces. I want to only match the bolded occurrences of the word not the italicized ones.
How to do it?
EDIT: What if I want append or replace the matched words while keeping the substrings inside the braces intact?
Example: append "word" to sentence:
This is a sentenceword outside braces{sentence{} with some words. {This is a
sentence inside braces with some words.}{This is a second sentence
inside braces.} Maybe some more words here for another sentenceword.

The simplest way to do this would be to replace all the brackets with a zero length strings in a temporary variable which you can then use to search for whatever you like.
You can easily do this using Lua's pattern matching and the following simple gsub code:
local tempStr = startStr:gsub("{.-}","")
The .- is the part that makes it grab everything between the { and } and gsub then replaces it all with a blank string.
Edit: The issue with the above method, as DarkWiiPlayer has pointed out is that the first open brace mathces with the first close brace which is incorrect.
The way around that is to use balanced braces (%b) as DarkWiiPlayer has recommended in his answer, like so:
local tempStr = startStr:gsub("%b{}","")

local function weird_match(word, str)
return str:gsub("%b{}", ''):match(word)
end
Replace balanced pairs of { and } with the empty string
Find the desired pattern (word) in the resulting string
Return the matched word (or its captures, if it has any)

Related

How do i find/count number of variable in string using Python

Here is example of string
Hi {{1}},
The status of your leave application has changed,
Leaves: {{2}}
Status: {{3}}
See you soon back at office by Management.
Expected Result:
Variables Count = 3
i tried python count() using if/else, but i'm looking for sustainable solution.
You can use regular expressions:
import re
PATTERN = re.compile(r'\{\{\d+\}\}', re.DOTALL)
def count_vars(text: str) -> int:
return sum(1 for _ in PATTERN.finditer(text))
PATTERN defines the regular expression. The regular expression matches all strings that contain at least one digit (\d+) within a pair of curly brackets (\{\{\}\}). Curly brackets are special characters in regular expressions, so we must add \. re.DOTALL makes sure that we don't skip over new lines (\n). The finditer method iterates over all matches in the text and we simply count them.

Regex Pattern Matching -a substring in words in CSV File

'Neighborhood,eattend10,eattend11,eattend12,eattend13,mattend10,mattend11,mattend12,mattend13,
hsattend10,hsattend11,hsattend12,hsattend13,eenrol11,eenrol12,eenrol13,menrol11,menrol12,
menrol13,hsenrol11,hsenrol12,hsenrol13,aastud10,aastud11,aastud12,aastud13,wstud10,wstud11,
wstud12,wstud13,hstud10,hstud11,hstud12,hstud13,abse10,abse11,abse12,abse13,absmd10,absmd11,
absmd12,absmd13,abshs10,abshs11,abshs12,abshs13,susp10,susp11,susp12,susp13,farms10,farms11,
farms12,farms13,sped10,sped11,sped12,sped13,ready11,ready12,ready13,math310,math311,math312,
math313,read310,read311,read312,read313,math510,math511,math512,math513,read510,read511,read512,
read513,math810,math811,math812,math813,read810,read811,read812,read813,hsaeng10,hsaeng11,
hsaeng12,hsaeng13,hsabio10,hsabio11,hsabio12,hsabio13,hsagov10,hsagov11,hsagov13,hsaalg10,
hsaalg11,hsaalg12,hsaalg13,drop10,drop11,drop12,drop13,compl10,compl11,compl12,compl13,
sclsw11,sclsw12,sclsw13,sclemp13\
I have this data set. I need to know how many drop words are there and print them.
Or similarly for any word like mattend and print those.
I tried using findall but I think that's not correct
I assume we can use re.search or re.match.
How can I do it in RegEx?
You can use len() on re.findall() to get the length of the returned list:
import re
with open('example.csv') as f:
data = f.read().strip()
print(len(re.findall('drop',data)))
I think re.findall should be correct.
From python re module documentation:
Search:
Scan through string looking for the first location where this regular expression produces a match, and return a corresponding match object.
Match:
If zero or more characters at the beginning of string match this regular expression, return a corresponding match object.
Findall:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
I tried it on your example and it worked for me:
re.findall("drop", str)
If you want to see digits after it you can try something like:
re.findall("drop\d*", str)
If you want to count the words you can use:
len(re.findall("drop\d*", str))

How to match a part of string before a character into one variable and all after it into another

I have a problem with splitting string into two parts on special character.
For example:
12345#data
or
1234567#data
I have 5-7 characters in first part separated with "#" from second part, where are another data (characters,numbers, doesn't matter what)
I need to store two parts on each side of # in two variables:
x = 12345
y = data
without "#" character.
I was looking for some Lua string function like splitOn("#") or substring until character, but I haven't found that.
Use string.match and captures.
Try this:
s = "12345#data"
a,b = s:match("(.+)#(.+)")
print(a,b)
See this documentation:
First of all, although Lua does not have a split function is its standard library, it does have string.gmatch, which can be used instead of a split function in many cases. Unlike a split function, string.gmatch takes a pattern to match the non-delimiter text, instead of the delimiters themselves
It is easily achievable with the help of a negated character class with string.gmatch:
local example = "12345#data"
for i in string.gmatch(example, "[^#]+") do
print(i)
end
See IDEONE demo
The [^#]+ pattern matches one or more characters other than # (so, it "splits" a string with 1 character).

Matlab: How to delete prefix from strings

Problem: From TrajCompact, i find all the prefix and the value after prefix, using regexp, with this code:
[digits{1:2}] = ndgrid(0:4);
for k=1:25
matches(:,k)=regexp(TrajCompact(:,1),sprintf('%d%d.*',digits{1}(k),digits{2}(k)),'match','once');
end
I want only the postfix of matches, how can delete the prefix from matches?
Method using regular expressions
You can put the .* section in a group by enclosing it in parenthesis (i.e. (.*)). Matlab has some peculiar 'token' nomenclature for this. In any case, an example of how it works:
[match, group] = regexp('25blah',sprintf('%d%d(.*)',2,5),'match','once','tokens');
Then:
match would be a char array containing '25blah'
group would be a 1x1 cell array containing the string 'blah'.
That is, the variable group would hold what you're looking for.
Hack method
Since your prefix is always two digits, you could also just take everything from the 3rd character of the match onwards:
my_string = match(3:end);
other comments
You may want to require the prefix to occur at the beginning of the string by adding ^ to the beginning of your regular expression. Eg., make the line:
[match, group] = regexp('25blah',sprintf('^%d%d(.*)',2,5),'match','once','tokens');
As it is, your current regular expression would match strings like zzzzzzzzz25stuff. I'm not sure if you want that (assuming it can occur in your data).

How to split a string into a list of words in TCL, ignoring multiple spaces?

Basically, I have a string that consists of multiple, space-separated words. The thing is, however, that there can be multiple spaces instead of just one separating the words. This is why [split] does not do what I want:
split "a b"
gives me this:
{a {} {} {} b}
instead of this:
{a b}
Searching Google, I found a page on the Tcler's wiki, where a user asked more or less the same question.
One proposed solution would look like this:
split [regsub -all {\s+} "a b" " "]
which seems to work for simple string. But a test string such as [string repeat " " 4] (used string repeat because StackOverflow strips multiple spaces) will result in regsub returning " ", which split would again split up into {{} {}} instead of an empty list.
Another proposed solution was this one, to force a reinterpretation of the given string as a list:
lreplace "a list with many spaces" 0 -1
But if there's one thing I've learned about TCL, it is that you should never use list functions (starting with l) on strings. And indeed, this one will choke on strings containing special characters (namely { and }):
lreplace "test \{a b\}"
returns test {a b} instead of test \{a b\} (which would be what I want, every space-separated word split up into a single element of the resulting list).
Yet another solution was to use a 'filter':
proc filter {cond list} {
set res {}
foreach element $list {if [$cond $element] {lappend res $element}}
set res
}
You'd then use it like this:
filter llength [split "a list with many spaces"]
Again, same problem. This would call llength on a string, which might contain special characters (again, { and }) - passing it "\{a b\}" would result in TCL complaining about an "unmatched open brace in list".
I managed to get it to work by modifying the given filter function, adding a {*} in front of $cond in the if, so I could use it with string length instead of llength, which seemed to work for every possible input I've tried to use it on so far.
Is this solution safe to use as it is now? Would it choke on some special input I didn't test so far? Or, is it possible to do this right in a simpler way?
The easiest way is to use regexp -all -inline to select and return all words. For example:
# The RE matches any non-empty sequence of non-whitespace characters
set theWords [regexp -all -inline {\S+} $theString]
If instead you define words to be sequences of alphanumerics, you instead use this for the regular expression term: {\w+}
You can use regexp instead:
From tcl wiki split:
Splitting by whitespace: the pitfalls
split { abc def ghi}
{} abc def {} ghi
Usually, if you are splitting by whitespace and do not want those blank fields, you are better off doing:
regexp -all -inline {\S+} { abc def ghi}
abc def ghi

Resources