Python Search for Pattern in list of string elements - python-3.x

I'm searching for a pattern in a list of string elements.
As far my code is working fine, but some data is unable to produce required result.
Code
ss = '''
X A
B A
A C
A D
E A
A F
'''.strip()
lst = []
for r in ss.split('\n'):
lst.append(r.split())
paths = []
for e in lst:
# each row in source data
pnew = [] # new path
for p in paths:
if e[0] in p: # if start in existing path
if p.index(e[0]) == len(p)-1: # if end of path
p.append(e[1]) # add to path
else:
pnew.append(p[:p.index(e[0])+1]+[e[1]]) # copy path then add
break
else: # loop completed, not found
paths.append(list(e)) # create new path
if len(pnew): # copied path
paths.extend(pnew) # add copied path
print('\n'.join([' -> '.join(e) for e in paths]))
what i'm getting is
X -> A -> C
B -> A
X -> A -> D
E -> A
X -> A -> F
what my requried result is
B -> A -> C
X -> A -> D
E -> A -> F
X -> A -> C
B -> A -> D
B -> A -> F
X -> A- > F
Based on Cr & Dr I'm Trying to get the pattern (Cr & Dr are optional)
X A Cr
B A Cr
A C Dr
A D Dr
E A Cr
A F Dr

It's easier to handle this with pandas:
import pandas as pd
from io import StringIO
ss = '''
X A
B A
A C
A D
E A
A F
'''.strip()
df = pd.read_csv(StringIO(ss), sep=' ', names=['source', 'target'])
df = df.merge(df, how='inner', left_on='target', right_on='source')
df = df[['source_x', 'target_x', 'target_y']]
df.apply(lambda x: ' -> '.join(x), axis=1).sort_values()

Related

python dictionary and deque to print required output based on some condition

I have CSV File which contains some data Produced from Mining
I wanted to print it as shown in required format
Required Format
A -> B -> C -> D -> E -> F
A -> B -> C -> I
X -> Y -> Z
X -> Y -> P -> Q
A -> B -> K -> L
a.csv File
## code
from collections import deque
import pandas as pd
data = pd.read_csv("a.csv")
data['Start'] = data['Start'].str.replace(' ','_')
data['End'] = data['End'].str.replace(' ','_')
fronts = dict()
backs = dict()
sequences = []
position_counter = 0
selector = data.apply(lambda row: row.str.extractall("([\w+\d]+)"), axis=1)
for relation in selector:
front, back = relation[0]
llist = deque((front, back))
finb = front in backs.keys()
if finb:
position = backs[front]
llist2 = sequences[position]
back_llist2 = llist2.pop()
llist = llist2 + llist
sequences[position] = llist
backs[llist[-1]] = position
if front in fronts.keys():
del fronts[front]
if back_llist2 in backs.keys():
del backs[back_llist2]
if not finb:
sequences.append(llist)
fronts[front] = position_counter
backs[back] = position_counter
position_counter += 1
data = []
for s in sequences:
data.append(' -> '.join(str(el) for el in s))
data
What I'm Getting is:
'A -> B -> C -> D -> E -> F'
'C -> I'
'A -> N -> A'
'X -> Y -> Z'
'Y -> P -> Q'
'B -> K -> L'
'X1 -> Y1'
You need to search the existing paths for the starting element of the new row. If found, append to the existing path or copy the path and append the new end element.
Try this code:
ss = '''
A B
B C
C D
D E
E F
C I
A N
N A
X Y
Y Z
Y P
P Q
B K
K L
X1 Y1
'''.strip()
lst = []
for r in ss.split('\n'):
lst.append(r.split())
################
paths = []
for e in lst: # each row in source data
pnew = [] # new path
for p in paths:
if e[0] in p: # if start in existing path
if p.index(e[0]) == len(p)-1: # if end of path
p.append(e[1]) # add to path
else:
pnew.append(p[:p.index(e[0])+1]+[e[1]]) # copy path then add
break
else: # loop completed, not found
paths.append(list(e)) # create new path
if len(pnew): # copied path
paths.extend(pnew) # add copied path
print('\n'.join([' => '.join(e) for e in paths]))
Output
A => B => C => D => E => F
A => B => C => I
A => N => A
X => Y => Z
X => Y => P => Q
A => B => K => L
X1 => Y1
The A->N->A and X1->Y1 are correct based on the source data. I don't know why they would be excluded in the desired output.

Search for Pattern in list : python Regex

After the Data Analysis & getting the Required Result I'm appending that result to a List
Now I Need to Retrieve Or Separate the Result (Search For Pattern & Obtain It)
Code:
data = []
data.append('\n'.join([' -> '.join(e) for e in paths]))
List Contais This data:
CH_Trans -> St_1 -> WDL
TRANSFER_Trn -> St_1
Access_Ltd -> MPL_Limited
IPIPI -> TLC_Pvt_Ltd
234er -> Three_Star_Services -> Asian_Pharmas -> PPP_Channel
Sonata_Ltd -> Three_Star_Services
Arc_Estates -> Russian_Hosp
A -> B -> C -> D -> E -> F
G -> H
ZN_INTBNKOUT_SET -> -2008_1 -> X
ZZ_1_ -> AA_2 -> AA_3 -> ZZ_1_
XYZ- -> ABC -> XYZ-
SSS -> BBB -> SSS
Rock_8CC -> Russ -> By_sus -> Rock_8CC
Note : Display or Retrieve Pattern Which has more than two symbol of type[->]
( Txt -> txt -> txt )
I'm Trying to get it Done by Regex
for i in data:
regex = ("\w+\s->\s\w+\s->\s\w+")
match = re.findall(regex, i,re.MULTILINE)
print(match)
Regex Expression I Tried But Unable to get Requried Result
#\w+\s->\s\w+\s->\s\w+
#\w+\s[-][>]\s\w+\s[-][>]\s\w+
#\w+\s[-][>]\s\w+\s[-][>]\s\w+\s[-][>]\s\w+
Result I Got
['CH_Trans-> St_1-> WDL', '234er -> Three_Star_Services -> Asian_Pharmas',
'A -> B -> C', 'D -> E -> F', 'ZZ_1_ -> AA_2 -> AA_3',
'SSS -> BBB -> SSS', 'Rock_8CC -> Russ -> By_sus']
Requried Result What I want to Obtain is
----Pattern I------
CH_Trans -> St_1 -> WDL
234er -> Three_Star_Services -> Asian_Pharmas -> PPP_Channel
A -> B -> C -> D -> E -> F
ZN_INTBNKOUT_SET -> -2008_1 -> X
# Pattern II Consists of Patterns which are same i.e[ Fist_ele & Last_Ele Is Same]
----Pattern II------
ZZ_1_ -> AA_2 -> AA_3 -> ZZ_1_
XYZ- -> ABC -> XYZ-
SSS -> BBB -> SSS
Rock_8CC -> Russ -> By_sus -> Rock_8CC
Would you please try the following as a starting point:
regex = r'^\S+(?:\s->\s\S+){2,}$'
for i in data:
m = re.match(regex, i)
if (m):
print(m.group())
Results (Pattern I + Pattern II):
CH_Trans -> St_1 -> WDL
234er -> Three_Star_Services -> Asian_Pharmas -> PPP_Channel
A -> B -> C -> D -> E -> F
ZN_INTBNKOUT_SET -> -2008_1 -> X
ZZ_1_ -> AA_2 -> AA_3 -> ZZ_1_
XYZ- -> ABC -> XYZ-
SSS -> BBB -> SSS
Rock_8CC -> Russ -> By_sus -> Rock_8CC
Explanation of the regex ^\S+(?:\s->\s\S+){2,}$:
^\S+ start with non-blank string
(?: ... ) grouping
\s->\s\S+ a blank followed by "->" followed by a blank and non-blank string
{2,} repeats the previous pattern (or group) two or more times
$ end of the string
As of pattern II please say:
regex = r'^(\S+)(?:\s->\s\S+){1,}\s->\s\1$'
for i in data:
m = re.match(regex, i)
if (m):
print(m.group())
Results:
ZZ_1_ -> AA_2 -> AA_3 -> ZZ_1_
XYZ- -> ABC -> XYZ-
SSS -> BBB -> SSS
Rock_8CC -> Russ -> By_sus -> Rock_8CC
Explanation of regex r'^(\S+)(?:\s->\s\S+){1,}\s->\s\1$':
- ^(\S+) captures the 1st element and assigns \1 to it
- (?: ... ) grouping
- \s->\s\S+ a blank followed by "->" followed by a blank and non-blank string
- {1,} repeats the previous pattern (or group) one or more times
- \s->\s\1 a blank followed by "->" followed by a blank and the 1st element \1
- $ end of the string
In order to obtain the result of pattern I, we may need to subtract the list of pattern II from the 1st results.
If we could say:
regex = r'^(\S+)(?:\s->\s\S+){2,}(?<!\1)$'
it will exclude the string whose last element differs from the 1st element then we could obtain the result of pattern I directry but the regex causes the error saying "group references in lookbehind assertions" so far.

How to add line in code in python console

How to add new line in python. for example, I would like to print the rest of a sentence in a new line. but instead of putting "\n", I will automate it to type to a new line for every six words.
Morse code translator
sth like:
def wrapper(words, n):
to_print = ''
for i in range(0, len(words.split()), n):
to_print += ' '.join(words.split()[i:i+n]) + '\n'
return to_print
and result is:
print(wrapper('a b c d e f g h i j k l m n o p r s t u w x y z', 6))
a b c d e f
g h i j k l
m n o p r s
t u w x y z

Identifying input values for which a function does NOT generate a specific output

I built a data structure in form of a function that outputs certain strings in response to certain input strings like this:
type mydict = String -> String
emptydict :: mydict
emptydict _ = "not found"
Now I can add entries into this dictionary by doing the following:
addentry :: String -> String -> mydict -> mydict
addentry s1 s2 d s
| s1 == s = s2
| otherwise = d s
To look for s2's I can simply enter s1 and look in my dictionary
looky :: String -> mydict -> String
looky s1 d = d s1 --gives s2
My goal is now to create another function patternmatch in which I can check which s1's are associated with an s2 that starts with a certain pattern. Now the pattern matching itself isn't the problem, but I am not sure how can I keep track of the entries I entered, i.e. for which input is the output not "not found" ?
My idea was to try to keep track of all the s1's I entered in the addentry function and add them to a separate list. In patternmatch I would feed the list elements to looky, such that I can get back the associated s2's and check whether they match the pattern.
So my questions:
1) Is this list building approach good or is there a better way of identifying the inputs for which a function is defined as something other than "not found"?
2) If it is the right approach, how would I keep track of the s1's? I was thinking something like:
addentry s1 s2 d s
| last (save s1) == s = s2
| otherwise = d s1
And then save s1 being a function generating the list with all s1's. last (save s1) would then return the most recent s1. Would appreciate any help on implementing save s1 or other directions going from here. Thanks a lot.
Your design is hard-coded such that the only criteria for finding a key is by presenting the same exact key. What you need is a more flexible approach that lets you provide a criteria other than equality. I took the liberty of making your code more general and using more conventional names for the functions:
import Prelude hiding (lookup)
-- instead of k -> Maybe v, we represent the dictionary as
-- (k -> Bool) -> Maybe v where k -> Bool is the criteria
-- on which to match the key. by using Maybe v we can signal
-- that no qualifying key was found by returning Nothing
-- instead of "not found"
newtype Dict k v = Dict ((k -> Bool) -> Maybe v)
empty :: Dict k v
empty = Dict $ const Nothing
-- insert a new key/value pair
insert :: k -> v -> Dict k v -> Dict k v
insert k v d = Dict $ \f -> if f k then Just v else lookupBy f d
-- lookup using the given criteria
lookupBy :: (k -> Bool) -> Dict k v -> Maybe v
lookupBy f (Dict d) = d f
-- lookup using the default criteria (equality with some given key)
lookup :: Eq k => k -> Dict k v -> Maybe v
lookup k = lookupBy (k==)
-- your criteria
startsWith :: String -> String -> Bool
startsWith s = undefined -- TODO
lookupByPrefix :: String -> Dict String v -> Maybe v
lookupByPrefix = lookupBy . startsWith
I should mention that while this is a great exercise for functional programming practice and general brain-expansion, it's a terrible way to implement a map. A list of pairs is equivalent and easier to understand.
As a side note, we can easily define an instance of Functor for this type:
instance Functor (Dict k) where
fmap f d = Dict $ \g -> fmap f (lookupBy g d)

Simple grammar give ValueError in Python

I'm new to Python, nltk and nlp. I have written simple grammar. But when running the program it gives below error. Please help me to solve this error
Grammar:-
S -> NP
NP -> PN|PRO|D[NUM=?n] N[NUM=?n]|D[NUM=?n] A N[NUM=?n]|D[NUM=?n] N[NUM=?n] PP|QP N[NUM=?n]|A N[NUM=?n]|D[NUM=?n] NOM PP|D[NUM=?n] NOM
PP -> P NP
D[NUM=sg] -> 'a'
D -> 'the'
N[NUM=sg] -> 'boy'|'girl'|'room'|'garden'|'hair'
N[NUM=pl] -> 'dogs'|'cats'
PN -> 'saumya'|'dinesh'
PRO -> 'she'|'he'|'we'
A -> 'tall'|'naughty'|'long'|'three'|'black'
P -> 'with'|'in'|'from'|'at'
QP -> 'some'
NOM -> A NOM|N[NUM=?n]
Code:-
import nltk
grammar = nltk.data.load('file:english_grammer.cfg')
rdparser = nltk.RecursiveDescentParser(grammar)
sent = "a dogs".split()
trees = rdparser.parse(sent)
for tree in trees: print (tree)
Error:-
ValueError: Expected a nonterminal, found: [NUM=?n] N[NUM=?n]|D[NUM=?n] A N[NUM=?n]|D[NUM=?n] N[NUM=?n] PP|QP N[NUM=?n]|A N[NUM=?n]|D[NUM=?n] NOM PP|D[NUM=?n] NOM
I don't think NLTK CFG grammar readers can read the format of your CFG with square brackets.
First let's try a CFG grammar without the square brackets:
from nltk.grammar import CFG
grammar_string = '''
S -> NP
PP -> P NP
D -> 'the'
PN -> 'saumya'|'dinesh'
PRO -> 'she'|'he'|'we'
A -> 'tall'|'naughty'|'long'|'three'|'black'
P -> 'with'|'in'|'from'|'at'
QP -> 'some'
'''
grammar = CFG.fromstring(grammar_string)
print grammar
[out]:
Grammar with 18 productions (start state = S)
S -> NP
PP -> P NP
D -> 'the'
PN -> 'saumya'
PN -> 'dinesh'
PRO -> 'she'
PRO -> 'he'
PRO -> 'we'
A -> 'tall'
A -> 'naughty'
A -> 'long'
A -> 'three'
A -> 'black'
P -> 'with'
P -> 'in'
P -> 'from'
P -> 'at'
QP -> 'some'
Now let's put the square brackets in:
from nltk.grammar import CFG
grammar_string = '''
S -> NP
PP -> P NP
D -> 'the'
PN -> 'saumya'|'dinesh'
PRO -> 'she'|'he'|'we'
A -> 'tall'|'naughty'|'long'|'three'|'black'
P -> 'with'|'in'|'from'|'at'
QP -> 'some'
N[NUM=sg] -> 'boy'|'girl'|'room'|'garden'|'hair'
N[NUM=pl] -> 'dogs'|'cats'
'''
grammar = CFG.fromstring(grammar_string)
print grammar
[out]:
Traceback (most recent call last):
File "test.py", line 33, in <module>
grammar = CFG.fromstring(grammar_string)
File "/usr/local/lib/python2.7/dist-packages/nltk/grammar.py", line 519, in fromstring
encoding=encoding)
File "/usr/local/lib/python2.7/dist-packages/nltk/grammar.py", line 1273, in read_grammar
(linenum+1, line, e))
ValueError: Unable to parse line 10: N[NUM=sg] -> 'boy'|'girl'|'room'|'garden'|'hair'
Expected an arrow
Going back to your grammar, it seems like you're using the square brackets to denote constraints or uncontraints, so the solution would be:
Using underscore for contrainted non-terminals and
to make a rule for unconstrainted non-terminals
So your cfg rules will look as such:
from nltk.parse import RecursiveDescentParser
from nltk.grammar import CFG
grammar_string = '''
S -> NP
NP -> PN | PRO | D N | D A N | D N PP | QP N | A N | D NOM PP | D NOM
PP -> P NP
PN -> 'saumya'|'dinesh'
PRO -> 'she'|'he'|'we'
A -> 'tall'|'naughty'|'long'|'three'|'black'
P -> 'with'|'in'|'from'|'at'
QP -> 'some'
D -> D_def | D_sg
D_def -> 'the'
D_sg -> 'a'
N -> N_sg | N_pl
N_sg -> 'boy'|'girl'|'room'|'garden'|'hair'
N_pl -> 'dogs'|'cats'
'''
grammar = CFG.fromstring(grammar_string)
rdparser = RecursiveDescentParser(grammar)
sent = "a dogs".split()
trees = rdparser.parse(sent)
for tree in trees:
print (tree)
[out]:
(S (NP (D (D_sg a)) (N (N_pl dogs))))
It looks like you're trying to use NLTK's feature grammars, which do use the square bracket syntax to denote features and feature agreement. NLTK's parser to use feature grammars is the FeatureEarleyChartParser (as opposed to RecursiveDescentParser).
From the NLTK documentation:
>>> from __future__ import print_function
>>> import nltk
>>> from nltk import grammar, parse
>>> g = """
... % start DP
... DP[AGR=?a] -> D[AGR=?a] N[AGR=?a]
... D[AGR=[NUM='sg', PERS=3]] -> 'this' | 'that'
... D[AGR=[NUM='pl', PERS=3]] -> 'these' | 'those'
... D[AGR=[NUM='pl', PERS=1]] -> 'we'
... D[AGR=[PERS=2]] -> 'you'
... N[AGR=[NUM='sg', GND='m']] -> 'boy'
... N[AGR=[NUM='pl', GND='m']] -> 'boys'
... N[AGR=[NUM='sg', GND='f']] -> 'girl'
... N[AGR=[NUM='pl', GND='f']] -> 'girls'
... N[AGR=[NUM='sg']] -> 'student'
... N[AGR=[NUM='pl']] -> 'students'
... """
>>> grammar = grammar.FeatureGrammar.fromstring(g)
>>> tokens = 'these girls'.split()
>>> parser = parse.FeatureEarleyChartParser(grammar)
>>> trees = parser.parse(tokens)
>>> for tree in trees: print(tree)
(DP[AGR=[GND='f', NUM='pl', PERS=3]]
(D[AGR=[NUM='pl', PERS=3]] these)
(N[AGR=[GND='f', NUM='pl']] girls))
store the grammar with .fcfg extension and use load_parser in nltk package.
eg: english_grammer.fcfg
I used following code to load it.
import nltk
from nltk import load_parser
chart = load_parser('file:english_grammer.fcfg')
sent = 'the girl gave the dog a bone'.split()
trees = chart.nbest_parse(sent)
for tree in trees: print tree
That solve the issue for me.

Resources