LookBehind Regex - python-3.x

I am facing some trouble by creating a regex pattern in Python that will lookbehind & find some char.
Ex. x = " ? asdasdjkh khdsjkhas What???<!##%^&*()ROOT"
in the above string ex. i am trying to find (double quote "). like the pattern start looking behind the ROOT & stop if it's find (" OR ' OR >) & print what it found.
Example Test:
x = " ? asdasdjkh khdsjkhas What???<!##%^&*()ROOT" ==> OUTPUT = "
x = " ? asdasdjkh ' khdsjkhas What???<!##%^&*()ROOT" ==> OUTPUT = '
x = " ? asdasdjkh > khdsjkhas What???<!##%^&*()ROOT" ==> OUTPUT = >
ANOTHER EXAMPLE:
Input
<select id="idAddCommune" name="idAddCommune"
data-rule-required="true"
data-msg="[Key delivery.pickup.front.select.commune.required.message Not Found]"
aria-required="true"
class="cform-control d-block-import idAddCommune"
onchange="selectCommune(this, 'HELLO? ROOT')" shippingGroup-id=""><option value="">Selecciona Comuna</option><option value="19">ARICA</option><option value="2023">BELEN</option><option value="2039">CAMARONES</option><option value="2046">CAQUENA</option><option value="2080">CODPA</option><option value="2092">COSAPILLA</option><option value="2107">CUYA</option><option value="2134">ESQUINA</option><option value="2137">GENERAL LAGOS</option><option value="2251">MOLINOS</option><option value="2272">PACHAMA</option><option value="2308">POCONCHILE</option><option value="2342">PUTRE</option><option value="2411">SOCOROMA</option><option value="2414">SORA</option><option value="2421">TIGNAMAR</option><option value="2447">VISVIRI</option></select><input type="hidden" value="" id="mapcityCommuneSelected" />
Output
' because if you search for ROOT keyword you will see it is inside the ' single quote

I think it can be done much easier using different logic: you want to match last occurence of one of characters from ', ", > before ROOT. So, I suggest such pattern: ['">](?=[^'">]*ROOT)
Explanation:
['">] = match one of characters inside character class ' or " or >
(?=...) - positive lookahead
[^'">]* - match zero or more of cany character other than ' or " or >
ROOT - match ROOT literally
Regex demo

You don't need or want a lookbehind here.
>>> import re
>>> p = re.compile(r"('|>|\")[^'\">]+?ROOT")
>>> p.search("foo ' bar ROOT").group(1)
"'"
>>> p.search('foo " bar ROOT').group(1)
'"'
>>> p.search("foo > bar ROOT").group(1)
'>'

Related

Unable to update ' and "" in the stop_word list

I tried to update ' and " in my stop_word list.
> stop_words.update(["'","""])
> stop_words
I got the following error.
> File "<ipython-input-85-54a2b8b08201>", line 2
> stop_words
>
>
> SyntaxError: EOF while scanning triple-quoted string literal--
How to update those characters in the stop_word ?
A literal " can e.g. be denoted as '"' or "\"". Your """ syntactically starts a triple-quoted string (which is never ended, causing the SyntaxError).

I need to clean seismological events from a text file

The question here is related to the same type of file I asked another question about, almost one month ago (I need to split a seismological file so that I have multiple subfiles).
My goal now is to delete events which in their first line contain the string 'RSN 3'. So far I have tried editing the aforementioned question's best answer code like this:
with open(refcatname) as fileContent:
for l in fileContent:
check_rsn_3 = l[45:51]
if check_rsn_3 == "RSN 3":
line = l[:-1]
check_event = line[1:15]
print(check_event, check_rsn_3)
if not check_rsn_3 == "RSN 3":
# Strip white spaces to make sure is an empty line
if not l.strip():
subFile.write(
eventInfo + "\n"
) # Add event to the subfile
eventInfo = "" # Reinit event info
eventCounter += 1
if eventCounter == 700:
subFile.close()
fileId += 1
subFile = open(
os.path.join(
catdir,
"Paquete_Continental_"
+ str(fileId)
+ ".out",
),
"w+",
)
eventCounter = 0
else:
eventInfo += l
subFile.close()
Expected results: Event info of earthquakes with 'RSN N' (where N≠3)
Actual results: First line of events with 'RSN 3' is deleted but not the remaining event info.
Thanks in advance for your help :)
I'd advise against checking if the string is at an exact location (e.g. l[45:51]) since a single character can mess that up, you can instead check if the entire string contains "RSN 3" with if "RSN 3" in l
With the line = l[:-1] you only get the last character of the line, so the line[1:15] won't work since it's not an array.
But if you need to delete several lines, you could just check if the current line contains "RSN 3", and then read line after line until one contains "RSN " while skipping the ones in between.
skip = False
for line in fileContent:
if "RSN 3" in line:
skip = True
continue
if "RSN " in line and "RSN 3" not in line:
skip = False
# rest of the logic
if skip:
continue
This way you don't even parse the blocks whose first line contains "RSN 3".

Python- Trying to print a variable and using (str) in front of number, but still getting a syntax error

I created a variable called "age = 23". I try and add it to another variable called "message" and specify it as a string using the (str) tag in front of the "age" variable.
age = 23
message = "Happy " + (str)age + "rd
birthday!"
print(message)
But whenever I try and run the code it comes back with a snytax error that looks like this:
line 7
message = "Happy " + (str)age + "rd
birthday!"
^
SyntaxError: invalid syntax
1|potter:/ $
You have the brackets in the wrong place, I'm not going to use your example as you've added it as an image, as AChampion mentioned, but an example is:
number = 34
message = "The number is " + str(number)
print(message)
I'd recommend taking some time to read the Python documentation, as this can help get your head around the language and its more basic uses.
message = "Happy " + str(age) + "rd birthday!"
print(message)
If you are learning Python from other languages such as C, you might know this method:
age = 23
print ('Happy %srd birthday!' % (age))
print ('Happy {}rd birthday!'.format(age)) # Use this
Might help: https://pyformat.info/

How to convert cmudict-0.7b or cmudict-0.7b.dict in to FST format to use it with phonetisaurus?

I am looking for a simple procedure to generate FST (finite state transducer) from cmudict-0.7b or cmudict-0.7b.dict, which will be used with phonetisaurus.
I tried following set of commands (phonetisaurus Aligner, Google NGramLibrary and phonetisaurus arpa2wfst) and able to generate FST but it didn't work. I am not sure where I did a mistake or miss any step. I guess very first command ie phonetisaurus-align, is not correct.
phonetisaurus-align --input=cmudict.dict --ofile=cmudict/cmudict.corpus --seq1_del=false
ngramsymbols < cmudict/cmudict.corpus > cmudict/cmudict.syms
/usr/local/bin/farcompilestrings --symbols=cmudict/cmudict.syms --keep_symbols=1 cmudict/cmudict.corpus > cmudict/cmudict.far
ngramcount --order=8 cmudict/cmudict.far > cmudict/cmudict.cnts
ngrammake --v=2 --bins=3 --method=kneser_ney cmudict/cmudict.cnts > cmudict/cmudict.mod
ngramprint --ARPA cmudict/cmudict.mod > cmudict/cmudict.arpa
phonetisaurus-arpa2wfst-omega --lm=cmudict/cmudict.arpa > cmudict/cmudict.fst
I tried fst with phonetisaurus-g2p as follows:
phonetisaurus-g2p --model=cmudict/cmudict.fst --nbest=3 --input=HELLO --words
But it didn't return anything....
Appreciate any help on this matter.
It is very important to keep dictionary in the right format. Phonetisaurus is very sensitive about that, it requires word and phonemes to be tab separated, spaces would not work then. It also does not allow pronunciation variant numbers CMUSphinx uses like (2) or (3). You need to cleanup dictionary with simple python script for example before feeding it into phonetisaurus. Here is the one I use:
#!/usr/bin/python
import sys
if len(sys.argv) != 3:
print "Split the list on train and test sets"
print
print "Usage: traintest.py file split_count"
exit()
infile = open(sys.argv[1], "r")
outtrain = open(sys.argv[1] + ".train", "w")
outtest = open(sys.argv[1] + ".test", "w")
cnt = 0
split_count = int(sys.argv[2])
for line in infile:
items = line.split()
if items[0][-1] == ')':
items[0] = items[0][:-3]
if items[0].find("_") > 0:
continue
line = items[0] + '\t' + " ".join(items[1:]) + '\n'
if cnt % split_count == 3:
outtest.write(line)
else:
outtrain.write(line)
cnt = cnt + 1

is there a way to define auto-escaped string in lua (raw)?

The following lines are arbitrary regexp which I need to use in lua.
['\";=]
!^(?:(?:[a-z]{3,10}\s+(?:\w{3,7}?://[\w\-\./]*(?::\d+)?)?/[^?#]*(?:\?[^#\s]*)?(?:#[\S]*)?|connect (?:\d{1,3}\.){3}\d{1,3}\.?(?::\d+)?|options \*)\s+[\w\./]+|get /[^?#]*(?:\?[^#\s]*)?(?:#[\S]*)?)$
'(?i:(?:c(?:o(?:n(?:t(?:entsmartz|actbot/)|cealed defense|veracrawler)|mpatible(?: ;(?: msie|\.)|-)|py(?:rightcheck|guard)|re-project/1.0)|h(?:ina(?: local browse 2\.|claw)|e(?:rrypicker|esebot))|rescent internet toolpak)|w(?:e(?:b(?: (?:downloader|by mail)|(?:(?:altb|ro)o|bandi)t|emailextract?|vulnscan|mole)|lls search ii|p Search 00)|i(?:ndows(?:-update-agent| xp 5)|se(?:nut)?bot)|ordpress(?: hash grabber|\/4\.01)|3mir)|m(?:o(?:r(?:feus fucking scanner|zilla)|zilla\/3\.mozilla\/2\.01$|siac 1.)|i(?:crosoft (?:internet explorer\/5\.0$|url control)|ssigua)|ailto:craftbot\#yahoo\.com|urzillo compatible)|p(?:ro(?:gram shareware 1\.0\.|duction bot|webwalker)|a(?:nscient\.com|ckrat)|oe-component-client|s(?:ycheclone|urf)|leasecrawl\/1\.|cbrowser|e 1\.4|mafind)|e(?:mail(?:(?:collec|harves|magne)t|(?: extracto|reape)r|(siphon|spider)|siphon|wolf)|(?:collecto|irgrabbe)r|ducate search vxb|xtractorpro|o browse)|t(?:(?: ?h ?a ?t ?' ?s g ?o ?t ?t ?a ? h ?u ?r ?|his is an exploi|akeou)t|oata dragostea mea pentru diavola|ele(?:port pro|soft)|uring machine)|a(?:t(?:(?:omic_email_hunt|spid)er|tache|hens)|d(?:vanced email extractor|sarobot)|gdm79\#mail\.ru|miga-aweb\/3\.4|utoemailspider| href=)|^(?:(google|i?explorer?\.exe|(ms)?ie( [0-9.]+)?\ ?(compatible( browser)?)?)$|www\.weblogs\.com|(?:jakart|vi)a|microsoft url|user-Agent)|s(?:e(?:archbot admin#google.com|curity scan)|(?:tress tes|urveybo)t|\.t\.a\.l\.k\.e\.r\.|afexplorer tl|itesnagger|hai)|n(?:o(?:kia-waptoolkit.* googlebot.*googlebot| browser)|e(?:(?:wt activeX; win3|uralbot\/0\.)2|ssus)|ameofagent|ikto)|f(?:a(?:(?:ntombrows|stlwspid)er|xobot)|(?:ranklin locato|iddle)r|ull web bot|loodgate|oobar/)|i(?:n(?:ternet(?: (?:exploiter sux|ninja)|-exprorer)|dy library)|sc systems irc search 2\.1)|g(?:ameBoy, powered by nintendo|rub(?: crawler|-client)|ecko\/25)|(myie2|libwen-us|murzillo compatible|webaltbot|wisenutbot)|b(?:wh3_user_agent|utch__2\.1\.1|lack hole|ackdoor)|d(?:ig(?:imarc webreader|out4uagent)|ts agent)|(?:(script|sql) inject|$botname/$botvers)ion|(msie .+; .*windows xp|compatible \; msie)|h(?:l_ftien_spider|hjhj#yahoo|anzoweb)|(?:8484 boston projec|xmlrpc exploi)t|u(?:nder the rainbow 2\.|ser-agent:)|(sogou develop spider|sohu agent)|(?:(?:d|e)browse|demo bot)|zeus(?: .*webster pro)?|[a-z]surf[0-9][0-9]|v(?:adixbot|oideye)|larbin#unspecified|\bdatacha0s\b|kenjin spider|; widows|rsync|\\\r))'
And there are many others where these came from.....
Point as you might noticed, the first case only the " is escaped with \" bot not the '
Hence,
rex_pcre.new('['\";=]')
Won't work.
rex_pcre.new("['\";=]")
Should work, however, parts in the regex such as \-.
I also cannot use
[[ ]]
as there are regexp which ends with ] (first example)
breaking the lines as in
rex_pcre.new( [[
['\";=]
]])
won't work for me in cases such as the third one which ends with ) and also raised an error of unexpected symbol.
in sum I am searching for such for the r"UNESCAPED STRING" of Python or the #"UNESCAPED STRING" of C#..
I assume there isn't such, but wonder what is the way to get a similar functionality, given the fact, I only consume those value (regular expression) and have no control on how to compose them originally..
Here is my current solution
I simply try to compile the line, with [[ ]], if fail, move to " and then to "'"/
EscapeRegEx = function (xp)
-- try with [[ ]]
local opening = '[['
local closing = ']]'
local codeline = "rex_pcre.new(" .. opening .. xp .. closing .. ")"
_, err = loadstring(codeline)
if not err then return codeline end
-- then try with "
opening = '"'
closing = '"'
codeline = "rex_pcre.new(" .. opening .. xp .. closing .. ")"
_, err = loadstring(codeline)
if not err then return codeline end
-- then try with '
opening = "'"
closing = "'"
codeline = "rex_pcre.new(" .. opening .. xp .. closing .. ")"
_, err = loadstring(codeline)
if not err then return codeline end
end
You can use longer versions of the long brackets:
[=========[the regex goes in here]=========]
The opening long bracket will only be matched by a closing long bracket of the same length.
See this for more details; you can also do a similar thing to get nested multi-line comments.

Resources