Lua frontier pattern match (whole word search) - string

can someone help me with this please:
s_test = "this is a test string this is a test string "
function String.Wholefind(Search_string, Word)
_, F_result = string.gsub(Search_string, '%f[%a]'..Word..'%f[%A]',"")
return F_result
end
A_test = String.Wholefind(s_test,"string")
output: A_test = 2
So the frontier pattern finds the whole word no problem and gsub counts the whole words no problem but what if the search string has numbers?
s_test = " 123test 123test 123"
B_test = String.Wholefind(s_test,"123test")
output: B_test = 0
seems to work with if the numbers aren't at the start or end of the search string

Your pattern doesn't match because you are trying to do the impossible.
After including your variable value, the pattern looks like this: %f[%a]123test%f[%A]. Which means:
%f[%a] - find a transition from a non letter to a letter
123 - find 123 at the position after transition from a non letter to a letter. This itself is a logical impossibility as you can't match a transition to a letter when a non-letter follows it.
Your pattern (as written) will not work for any word that starts or ends with a non-letter.
If you need to search for fragments that include letters and numbers, then your pattern needs to be changed to something like '%f[%S]'..Word..'%f[%s]'.

Related

Wrong matching regex

So I'm using re module to compile my regex, and my regex looks like this:
"(^~\w+?[ & ~\w+?]*?$)"
So I compile it using pattern = re.compile(regex) and then I use re.findall(pattern, string) to find if the given string is matching and to give me the group if it is.
String that I'm matching is "v1 V ~v2_ V ~~v3".
I'd expect to not have a match but it says that it matches the regular expression. I suspect that \w+ matches white spaces so that it matches the whole string but I could not find in the documentation that is correct. What am I missing?
Here this is minimum reproductible example:
import re
test_string = "v1 V ~v2_ V ~~v3"
regex = "(^~*\w+?[ & ~*\w+?]*?$)"
pattern = re.compile(regex)
for elem in re.findall(regex, test_string):
print(elem)
If you expect to not match I think your problem is with [ & ~*\w+?]* part.
The characters between square brackets means one occurrence of, in this case one occurrence of &, ~, *, ?, word and space. And the asterisk (*) at the end makes zero or many occurrences of what is in the brackets.
If what you wanted is to match this sub-regex & ~*\w+? zero or more times use parenthesis.
So I would say that you wanted this regex: (^~*\w+?( & ~*\w+?)*?$) (just change brackets for parenthesis.

java String.format - how to put a space between two characters

I am searching for a way to use a formatter to put a space between two characters. i thought it would be easy with a string formatter.
here is what i am trying to accomplish:
given: "AB" it will produce "A B"
Here is what i have tried so far:
"AB".format("%#s")
but this keep returning "AB" i want "A B". i thought the number sign could be used for space.
i also tried this:
"26".format("%#d") but its still prints "26"
is there anyway to do this with string.formatter.
It is kind of possible with the string formatter although not directly with a pattern.
jshell> String.format("%1$c %2$c", "AB".chars().boxed().toArray())
$10 ==> "A B"
We need to turn the string into an object array so it can be passed in as varargs and the formatter pattern can extract characters based on index (1$ and 2$) and format them as characters (c).
A much simpler regex solution is the following which scales to any number of characters:
jshell> "ABC^&*123".replaceAll(".", "$0 ").trim()
$3 ==> "A B C ^ & * 1 2 3"
All single characters are replaced with them-self ($0) followed by a space. Then the last extra space is removed with the trim() call.
I could not find way to do this using String#format. But here is a way to accomplish this using regex replacement:
String input = "AB";
String output = input.replaceAll("(?<=[A-Z])(?=[A-Z])", " ");
System.out.println(output);
The regex pattern (?<=[A-Z])(?=[A-Z]) will match every position in between two capital letters, and interpolate a space at that point. The above script prints:
A B

How to split sentence including punctuation

If I had the sentence sentence = 'There is light!' and I was to split this sentence with mysentence = sentence.split(), how would I have the output as 'There, is, light, !' of print(mysentence)? What I specifically wanted to do was split the sentence including all punctuation, or just a list of selected punctuation. I got some code but the program is recognizing the characters in the word, not the word.
out = "".join(c for c in punct1 if c not in ('!','.',':'))
out2 = "".join(c for c in punct2 if c not in ('!','.',':'))
out3 = "".join(c for c in punct3 if c not in ('!','.',':'))
How would I use this without recognizing each character in a word, but the word itself. Therefore, the output of "Hello how are you?" should become "Hello, how, are, you, ?" Any way of doing this
You may use a \w+|[^\w\s]+ regex with re.findall to get those chunks:
\w+|[^\w\s]
See the regex demo
Pattern details:
\w+ - 1 or more word chars (letters, digits or underscores)
| - or
[^\w\s] - 1 char other than word / whitespace
Python demo:
import re
p = re.compile(r'\w+|[^\w\s]')
s = "There is light!"
print(p.findall(s))
NOTE: If you want to treat an underscore as punctuation, you need to use something like [a-zA-Z0-9]+|[^A-Za-z0-9\s] pattern.
UPDATE (after comments)
To make sure you match an apostrophe as part of the words, add (?:'\w+)* or (?:'\w+)? to the \w+ in the pattern above:
import re
p = re.compile(r"\w+(?:'\w+)*|[^\w\s]")
s = "There is light!? I'm a human"
print(p.findall(s))
See the updated demo
The (?:'\w+)* matches zero or more (*, if you use ?, it will match 1 or 0) occurrences of an apostrophe followed with 1+ word characters.

SAS finding an uppercase word within a string

I have a string which contains one word in uppercase somewhere within it. I want to extract that one word into a new variable using SAS.
I think I need to find a way to code up finding a word which contains two or more uppercase letters (as the start of a sentence would begin with an uppercase letter).
i.e. How do I create the variable 'word':
data example;
length txtString $50;
length word $20;
infile datalines dlm=',';
input txtString $ word $;
datalines;
This is one EXAMPLE. Of what I need.,EXAMPLE
THIS is another.,THIS
etc ETC,ETC
;
run;
Hope someone can help and the question is clear
Thanks in advance
Consider a regex match/replace with a negative lookbehind to include two types of matches:
consecutive upper case words followed by a space with at least two characters (to avoid title cases at beginning of sentence): (([A-Z ]){2,})
consecutive upper case words followed by a period with at least two characters: (to avoid title cases at beginning of sentence): (([A-Z.]){2,})
CAVEAT: This solution works except the I article is also matched which technically is a valid match as it is also an all uppercase one-word. Being the only type in English language, consider a tranwrd() replace for such a special case. In fact, relatedly, this solution matches ALL uppercase words.
data example;
length txtString $50;
length word $20;
infile datalines dlm=',';
input txtString $ word $;
datalines;
This is one EXAMPLE. Of what I need.,EXAMPLE
THIS is another.,THIS
etc ETC,ETC
;
run;
data example;
set example;
pattern_num = prxparse("s/(?!(([A-Z ]){2,})|(([A-Z.]){2,})).//");
wordextract = prxchange(pattern_num, -1, txtString);
wordextract = tranwrd(wordextract, " I ", "");
drop pattern_num;
run;
txtString word wordextract
This is one EXAMPLE. Of what I need. EXAMPLE EXAMPLE
THIS is another. THIS THIS
etc ETC ETC ETC
SAS has a prxsubstr() function call that finds the starting position and length of a substring that matches a given regex pattern within a given string. Here's a sample solution using the prxsubstr() function call:
data solution;
set example;
/* Build a regex pattern of the word to search for, and hang on to it */
/* (The regex below means: word boundary, then two or more capital letters,
then word boundary. Word boundary here means the start or the end of a string
of letters, digits and/or underscores.) */
if _N_ = 1 then pattern_num = prxparse("/\b[A-Z]{2,}\b/");
retain pattern_num;
/* Get the starting position and the length of the word to extract */
call prxsubstr(pattern_num, txtString, mypos, mylength);
/* If a word matching the regex pattern is found, extract it */
if mypos ^= 0 then word = substr(txtString, mypos, mylength);
run;
SAS prxsubstr() documentation: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002295971.htm
Regex word boundary info: http://www.regular-expressions.info/wordboundaries.html

Compare two strings and extract common word out

In MATLAB how can we compare 2 strings and print the common word out. For Example string1 = "hello my name is bob"; and string2 = "today bob went to the park"; the word bob is common in both. What is the structure to follow.
Use intersect with strsplit for a one-liner -
common_word = intersect(strsplit(string1),strsplit(string2))
strsplit splits each string to cells of words and then intersect finds the common one out.
If you would like to avoid strsplit, you can use regexp instead -
common_word =intersect(regexp(string1,'\s','Split'),regexp(string2,'\s','Split'))
Bonus: Removing stop-words from the common words
Let's add some stop-words that are common to these two strings -
string1 = 'hello my name is bob and I am going to the zoo'
string2 = 'today bob went to the park'
Using the solution presented earlier, you would get -
common_word =
'bob' 'the' 'to'
Now, these words - 'the' and 'to' are part of the stop-words. If you would like to have them removed, let me suggest this - Removing stop words from single string
and it's accepted solution.
The final output would be 'bob', whom you were looking for!
If you are looking for matching words only, that are separated by spaces, you can use strsplit to change each string into cell arrays of words, then loop through and search for each one.
str1 = 'test if this works';
str2 = 'does this work?';
cell1 = strsplit(str1);
cell2 = strsplit(str2);
for n = 1:length(cell1)
for m = 1:length(cell2)
if strcmp(cell1{n},cell2{m})
disp(cell1{n});
end
end
end
Notice that in my example the last member of cell2 is 'work?' so if you have punctuation in your strings, you'll have to do a check for that (isletter might help).

Resources