Design pattern to evaluate a boolean expression - string

Is there a common/defined design pattern which will help program an evaluator for boolean expressions.
I am writing a string matching algorithm for such expressions and looking for a design pattern which will help structure the algorithm.
Sample Expected Strings -
"nike AND (tshirt OR jerseys OR jersey OR tshirts OR (t AND shirt)) AND black"

Your expression is in the infix notation. To evaluate it, convert it to the postfix notation.
Infix expression looks like:
<operand><operator><operand>
Postfix expression looks like:
<operand><operand><operator>
You can convert your expression using Shunting Yard Algorithm.
As the expression is converted, evaluate it using this approach (pseudocode):
Begin
for each character ch in the postfix expression, do
if ch is an operator ⨀ , then
a := pop first element from stack
b := pop second element from the stack
res := b ⨀ a
push res into the stack
else if ch is an operand, then
add ch into the stack
done
return element of stack top
End

I don't know of a design pattern per se which would fit your problem, but if your programming language has regex support, we can easily enough write a pattern such as this:
(?=.*\bnike\b)(?=.*\b(?:tshirts?|jerseys?|t\b.*\bshirt|shirt\b.*\bt))(?=.*\bblack\b).*
The pattern can be explained as:
(?=.*\bnike\b) match "nike" AND
(?=.*\b(?:tshirts?|jerseys?|t\b.*\bshirt|shirt\b.*\bt))
match tshirt(s), jersey(s) or "t" and "shirt" AND
(?=.*\bblack\b) match "black"
.* then consume the entire line
Demo

Related

If any mathematical expression in a string form and in list, how are you going to solve it in Python?

s = ['2','+','3']
for i in s:
if i == '+':
w = int(i+1)+int(i-1)
print(w)
The error occurred which is obvious.
If you want to solve any mathematical expression which is in string and list format. You first should know that how the Infix/Postfix notations works for mathematical expression evaluations using stack. If you don't want to use them then just scan list from left to right and do operations like below and you can modify it for multiple operators according to your requirements.
s = ['2','+','3','-','5']
ans=int(s[0])
i=1
while(i<len(s)-1):
if(s[i]=='+'):
ans +=int(s[i+1])
elif(s[i]=='-'):
ans -=int(s[i+1])
elif(s[i]=='*'):
ans *=int(s[i+1])
elif(s[i]=='/'):
ans /=int(s[i+1])
i+=1
print(ans)
If you have confidence the string is something that can be evaluated then one possibility is a concatenation of your list back into string form by ''.join() followed by an evaluation.
eval(''.join(s))
will do. But if you have more complicated expressions, e.g.
s = ['e','^','2']
then you'll need rules to interpret ^, say with np.power, and e as np.exp.

Tag operations for tagged transitions in TDFA(tagged DFA) for sub regex matching

I am a beginner at Haskell programming. I was trying to match strings for regex using Haskell programming + TDFA for regex backend. And, I did successfully sub-regex matching operations, but not using tags concept proposed in TDFA. For example please see the following code.
{-# LANGUAGE MagicHash #-}
import Control.Monad
import Data.Array
import qualified Data.Text as T
import Text.Regex.TDFA
import Text.Regex.TDFA.Common
str = "bbbb" :: String
regex = "(b|bb|bbb|bbbb)*" :: String
/* regex = "(tag1 b|tag2 bb|tag3 bbb|tag4 bbbb)*" :: String ---->interested now */
main = do
if str =~ regex then putStrLn "matched" else putStrLn "no matches"
let matches = getAllTextMatches (str =~ regex) :: Array Int (Array Int String)
print matches
-- Output: array (0,1) [(0,array (0,1) [(0,"bbbb"),(1,"bbbb")]),(1,array (0,1) [(0,""),(1,"")])]
let matches = getAllTextMatches $ str =~ regex :: [Array Int String]
print matches
-- Output: [array (0,1) [(0,"bbbb"),(1,"bbbb")],array (0,1) [(0,""),(1,"")]]
let matches = getAllTextMatches $ str =~ regex :: Array Int [String]
print matches
-- Output: array (0,1) [(0,["bbbb","bbbb"]),(1,["",""])]
let matches = getAllTextMatches $ str =~ regex :: Array Int String
print matches
-- Output: array (0,1) [(0,"bbbb"),(1,"")]
let matches = getAllTextMatches $ str =~ regex :: [String]
print matches
-- Output: ["bbbb",""]
let matches = str =~ regex :: [[String]]
print matches
-- Output: [["bbbb","bbbb"],["",""]]
let matches = getAllTextMatches $ str =~ regex :: Array Int (MatchText String)
print matches
-- Output: array (0,1) [(0,array (0,1) [(0,("bbbb",(0,4))),(1,("bbbb",(0,4)))]),(1,array (0,1) [(0,("",(4,0))),(1,("",(-1,0)))])]
-- Using getAllMatches
let matches = getAllMatches $ str =~ regex :: Array Int MatchArray
print matches
-- Output: array (0,1) [(0,array (0,1) [(0,(0,4)),(1,(0,4))]),(1,array (0,1) [(0,(4,0)),(1,(-1,0))])]
let matches = getAllMatches $ str =~ regex :: Array Int (MatchOffset,MatchLength)
print matches
-- Output: array (0,1) [(0,(0,4)),(1,(4,0))]
let matches = getAllMatches $ str =~ regex :: [(MatchOffset,MatchLength)]
print matches
-- Output: [(0,4),(4,0)]
Anyways, now, I am actually very much interested to see how tags operations on TDFA are performing while matching the subexpression for the accepted input strings. For example,
Let's we have a regex, R = (b|bb|bbb|bbbb)* and input string = "bbbb". So, using TDFA concept if I rewrite the regex R = (tag1 b| tag2 bb | tag3 bbb| tag4 bbbb)* and then if I try to match it for the input string = "bbbb" then, my interest is to see how many times these tags{1,2,3,4} are performing to match the regex R. As, on the given input string "bbbb", first b will be the extent of matched tag1, same as input substring "bb" of "bbbb" will be the extent of tag2, and so on. This is how, if we now consider the full given input string "bbbb" then tag4 will give the extent of it also at the same time will give the result such that "bbb" of "bbbb" matched the extent of tag3, same as "bb" of "bbbb" matched the extent of tag2 and, finally "b" of "bbbb" matched the extent of tag1. Thus, I want to see these operations using the TDFA module. That's it. I mean, how many times these tags have to be updated in order to match the sub-regex withing the regex for the given accepted input string. That's it.
Thus, any kind of help would be a lot to me...:)
P.S.: It's a challenge for the Haskell beginner, thus looking for Haskell hacker. I mean sagacious one...:) Anyways,, hope for the best...:)
It is not possible to write explicit tags into the regexp passed to Regex-TDFA. Regex-TDFA supports POSIX regular expressions, and in POSIX submatch extraction concerns capturing groups (that is, parenthesized subexpressions). You can use capturing groups with Regex-TDFA as follows:
Prelude Text.Regex.TDFA> "bbbb" =~ "((b)|(bb)|(bbb)|(bbbb))*" :: MatchArray
array (0,5) [(0,(0,4)),(1,(0,4)),(2,(-1,0)),(3,(-1,0)),(4,(-1,0)),(5,(0,4))]
Here you see that your expression has 6 capturing groups: (b), (bb), (bbb), (bbbb), ((b)|(bb)|(bbb)|(bbbb)) and the whole regexp, which in POSIX is always the implicit first group.
the whole regexp matches at offset 0 and spans 4 symbols
((b)|(bb)|(bbb)|(bbbb)) matches at offset 0 and spans 4 symbols
(b) does not match -- hence the starting offset -1 and match length 0
likewise (bb) does not match
likewise (bbb) does not match
finally, (bbbb) matches at offset 0 and spans 4 symbols.
You can use submatch extraction with other interfaces as well:
Prelude Text.Regex.TDFA> "bbbb" =~ "((b)|(bb)|(bbb)|(bbbb))*" :: [[String]]
[["bbbb","bbbb","","","","bbbb"],["","","","","",""]]
The tags are added implicitly by Regex-TDFA internal algorithm -- they are an implementation detail hidden from the user. If what you want is submatch extraction with Haskell regexp, then you should stop reading at this point. If, however, you are interested in Regex-TDFA theory of operation, than the answer to your question is much more involved.
Regex-TDFA is based on the concept of Tagged DFA, invented by Ville Laurikari in 2000.
Chris Kuklewicz, the author of Regex-TDFA, extended Ville's algorithm to support POSIX disambiguation semantics. He informally described his disambiguation algorithm on Haskell wiki in 2007, and recently showed little interest in its formalization or development.
Kuklewicz disambiguation algorithm was adopted in lexer generator RE2C and formalized in this unpublished paper (of which I happen to be the author). RE2C also supports leftmost greedy disambiguation semantics and allows you to use explicit tags. See also the simple example of parsing IPv4 address or the more complex URI RFC-3986 parsing example to get the idea.
Back to your question:
how many times these tags have to be updated in order to match the sub-regex withing the regex for the given accepted input string
The answer is, it depends on the non-determinism degree of the given regular expression (see the paper, page 16, for the explanation). For a simple tag-deterministic regexp it is an insignificant constant-time overhead. For pathological cases with bounded repetition, see example 5 in the paper (page 21). See also the benchmarks (pages 27-29), they show that in real-world tests the overhead on submatch extraction is quite modest.
Note also that Regex-TDFA uses lazy derminization, that is, all the overhead on determinization and POSIX disambiguation is in run-time, therefore the overall overhead on submatch extraction is greater than in RE2C case.
Finally, you can explore Regex-TDFA internals by using examineDFA debug function defined in /Text/Regex/TDFA/TDFA.hs and further tweaking it to print the information you need.

Can SLR grammar have empty productions?

I've wrote following grammar:
S->S ( S ) S
S->e
e stands for "empty string"
So the language this grammar recognizes includes all strings with matching left and right parenthesis, like (), (()), (()()), etc.
And this grammar is not SLR, here is how I construct SLR parse table:
Augment this grammar:
S1->S
S->S(S)S
S->e
Then construct LR(0) automaton for it:
I0:
S1->.S
S->.S(S)S
S->.e
I1:
S1->S.
S->S.(S)S
...
Please note that for I0, there is no shift or reduce action for input symbol '(', which is the first token of any string this grammar generates.
So SLR parse table will generate error since on state I0, it doesn't know what to do when parsing string: (()).
My question is:
What is the culprit that makes this grammar NOT SLR? Is it the empty string production? That is:
S->e. ?
And in a general sense, can SLR grammar have empty productions? like, S->e in this example.
Thanks.
The answer is OK, if no shift/reduce action is available for current input and there is a shift on empty product, we choose to shift on this empty terminal.

attribute grammar for regular expression

how to write this attribute grammar?
I am not sure about the production with star.
Design a context-free grammar for regular expressions. Make this an attribute grammar with a setvalued attribute attached to the start symbol that is the language (set of strings) denoted by the regular
expression. A regular expressions can be empty, a symbol, the concatenation of two regular expressions, two regular expressions separated by a vertical bar, a regular expression followed by a star,
or a regular expression in parentheses. E.g., for the regular expression ‘l(l|d)*’ your attribute
grammar should construct the (infinite) set of all strings consisting of an l followed by zero or more
occurrences of either l or d.
Thanks.
Hint: there's a generalized form of set union involving an index set and a set-valued expression involving the index. It's written something like the following:
U i in I f(i)
For example, the set of rational numbers is equal to
U i in Z { i / j | j in Z, j != 0 }
(Z, usually written in "blackboard bold", is the set of all integers.)

Representing the strings we use in programming in math notation

Now I'm a programmer who's recently discovered how bad he is when it comes to mathematics and decided to focus a bit on it from that point forward, so I apologize if my question insults your intelligence.
In mathematics, is there the concept of strings that is used in programming? i.e. a permutation of characters.
As an example, say I wanted to translate the following into mathematical notation:
let s be a string of n number of characters.
Reason being I would want to use that representation in find other things about string s, such as its length: len(s).
How do you formally represent such a thing in mathematics?
Talking more practically, so to speak, let's say I wanted to mathematically explain such a function:
fitness(s,n) = 1 / |n - len(s)|
Or written in more "programming-friendly" sort of way:
fitness(s,n) = 1 / abs(n - len(s))
I used this function to explain how a fitness function for a given GA works; the question was about finding strings with 5 characters, and I needed the solutions to be sorted in ascending order according to their fitness score, given by the above function.
So my question is, how do you represent the above pseudo-code in mathematical notation?
You can use the notation of language theory, which is used to discuss things like regular languages, context free grammars, compiler theory, etc. A quick overview:
A set of characters is known as an alphabet. You could write: "Let A be the ASCII alphabet, a set containing the 128 ASCII characters."
A string is a sequence of characters. ε is the empty string.
A set of strings is formally known as a language. A common statement is, "Let s ∈ L be a string in language L."
Concatenating alphabets produces sets of strings (languages). A represents all 1-character strings, AA, also written A2, is the set of all two character strings. A0 is the set of all zero-length strings and is precisely A0 = {ε}. (It contains exactly one string, the empty string.)
A* is special notation and represents the set of all strings over the alphabet A, of any length. That is, A* = A0 ∪ A1 ∪ A2 ∪ A3 ... . You may recognize this notation from regular expressions.
For length use absolute value bars. The length of a string s is |s|.
So for your statement:
let s be a string of n number of characters.
You could write:
Let A be a set of characters and s ∈ An be a string of n characters. The length of s is |s| = n.
Mathematically, you have explained fitness(s, n) just fine as long as len(s) is well-defined.
In CS texts, a string s over a set S is defined as a finite ordered list of elements of S and its length is often written as |s| - but this is only notation, and doesn't change the (mathematical) meaning behind your definition of fitness, which is pretty clear just how you've written it.

Resources