Unable to update ' and "" in the stop_word list - nlp

I tried to update ' and " in my stop_word list.
> stop_words.update(["'","""])
> stop_words
I got the following error.
> File "<ipython-input-85-54a2b8b08201>", line 2
> stop_words
>
>
> SyntaxError: EOF while scanning triple-quoted string literal--
How to update those characters in the stop_word ?

A literal " can e.g. be denoted as '"' or "\"". Your """ syntactically starts a triple-quoted string (which is never ended, causing the SyntaxError).

Related

LookBehind Regex

I am facing some trouble by creating a regex pattern in Python that will lookbehind & find some char.
Ex. x = " ? asdasdjkh khdsjkhas What???<!##%^&*()ROOT"
in the above string ex. i am trying to find (double quote "). like the pattern start looking behind the ROOT & stop if it's find (" OR ' OR >) & print what it found.
Example Test:
x = " ? asdasdjkh khdsjkhas What???<!##%^&*()ROOT" ==> OUTPUT = "
x = " ? asdasdjkh ' khdsjkhas What???<!##%^&*()ROOT" ==> OUTPUT = '
x = " ? asdasdjkh > khdsjkhas What???<!##%^&*()ROOT" ==> OUTPUT = >
ANOTHER EXAMPLE:
Input
<select id="idAddCommune" name="idAddCommune"
data-rule-required="true"
data-msg="[Key delivery.pickup.front.select.commune.required.message Not Found]"
aria-required="true"
class="cform-control d-block-import idAddCommune"
onchange="selectCommune(this, 'HELLO? ROOT')" shippingGroup-id=""><option value="">Selecciona Comuna</option><option value="19">ARICA</option><option value="2023">BELEN</option><option value="2039">CAMARONES</option><option value="2046">CAQUENA</option><option value="2080">CODPA</option><option value="2092">COSAPILLA</option><option value="2107">CUYA</option><option value="2134">ESQUINA</option><option value="2137">GENERAL LAGOS</option><option value="2251">MOLINOS</option><option value="2272">PACHAMA</option><option value="2308">POCONCHILE</option><option value="2342">PUTRE</option><option value="2411">SOCOROMA</option><option value="2414">SORA</option><option value="2421">TIGNAMAR</option><option value="2447">VISVIRI</option></select><input type="hidden" value="" id="mapcityCommuneSelected" />
Output
' because if you search for ROOT keyword you will see it is inside the ' single quote
I think it can be done much easier using different logic: you want to match last occurence of one of characters from ', ", > before ROOT. So, I suggest such pattern: ['">](?=[^'">]*ROOT)
Explanation:
['">] = match one of characters inside character class ' or " or >
(?=...) - positive lookahead
[^'">]* - match zero or more of cany character other than ' or " or >
ROOT - match ROOT literally
Regex demo
You don't need or want a lookbehind here.
>>> import re
>>> p = re.compile(r"('|>|\")[^'\">]+?ROOT")
>>> p.search("foo ' bar ROOT").group(1)
"'"
>>> p.search('foo " bar ROOT').group(1)
'"'
>>> p.search("foo > bar ROOT").group(1)
'>'

System.JSONException: Unexpected character ('i' (code 105)): was expecting comma to separate OBJECT entries at [line:1, column:18]

#SalesforceChallenge
I'm trying to escape a string but I had no success so far.
This is the response body I'm getting:
{"text":"this \"is something\" I wrote"}
Please note that there are 2 backslashes to escape the double quotes char. (This is a sample. Actually I have a big to escape with lots of "text" elements.)
When I try to deserialize it I get the following error:
System.JSONException: Unexpected character ('i' (code 105)): was expecting comma to separate OBJECT entries at [line:1, column:18]
I've tried to escape by using:
String my = '{"text":"this \"is something\" I wrote"}';
System.debug('test 0: ' + my);
System.debug('test 1: ' + my.replace('\"', '-'));
System.debug('test 2: ' + my.replace('\\"', '-'));
System.debug('test 3: ' + my.replace('\\\"', '-'));
System.debug('test 4: ' + my.replace('\\\\"', '-'));
--- Results:
[22]|DEBUG|test 0: {"text":"this "is something" I wrote"}
[23]|DEBUG|test 1: {-text-:-this -is something- I wrote-}
[23]|DEBUG|test 1: {-text-:-this -is something- I wrote-}
[24]|DEBUG|test 2: {"text":"this "is something" I wrote"}
[25]|DEBUG|test 3: {"text":"this "is something" I wrote"}
[26]|DEBUG|test 4: {"text":"this "is something" I wrote"}
--- What I need as result:
{"text":"this -is something- I wrote"}
Please, does someone has any fix to share?
Thanks a lot.
This is the problem with your test runs in Anonymous Apex:
String my = '{"text":"this \"is something\" I wrote"}';
Because \ is an escape character, you need two backslashes in an Apex string literal to produce a backslash in the actual output:
String my = '{"text":"this \\"is something\\" I wrote"}';
Since Apex quotes strings with ', you don't have to escape the quotes for Apex; you're escaping them for the JSON parser.
The same principle applies to the strings you're trying to use to do replacements: you must escape the \ for Apex.
All that said, it's unclear why you are trying to manually alter this string. The payload
{"text":"this \"is something\" I wrote"}
is valid JSON. In general, you should not perform string replacement on inbound JSON structures in Apex unless you're attempting to compensate for a payload that contains an Apex reserved word as a key so that you can use typed deserialization.

I need to clean seismological events from a text file

The question here is related to the same type of file I asked another question about, almost one month ago (I need to split a seismological file so that I have multiple subfiles).
My goal now is to delete events which in their first line contain the string 'RSN 3'. So far I have tried editing the aforementioned question's best answer code like this:
with open(refcatname) as fileContent:
for l in fileContent:
check_rsn_3 = l[45:51]
if check_rsn_3 == "RSN 3":
line = l[:-1]
check_event = line[1:15]
print(check_event, check_rsn_3)
if not check_rsn_3 == "RSN 3":
# Strip white spaces to make sure is an empty line
if not l.strip():
subFile.write(
eventInfo + "\n"
) # Add event to the subfile
eventInfo = "" # Reinit event info
eventCounter += 1
if eventCounter == 700:
subFile.close()
fileId += 1
subFile = open(
os.path.join(
catdir,
"Paquete_Continental_"
+ str(fileId)
+ ".out",
),
"w+",
)
eventCounter = 0
else:
eventInfo += l
subFile.close()
Expected results: Event info of earthquakes with 'RSN N' (where N≠3)
Actual results: First line of events with 'RSN 3' is deleted but not the remaining event info.
Thanks in advance for your help :)
I'd advise against checking if the string is at an exact location (e.g. l[45:51]) since a single character can mess that up, you can instead check if the entire string contains "RSN 3" with if "RSN 3" in l
With the line = l[:-1] you only get the last character of the line, so the line[1:15] won't work since it's not an array.
But if you need to delete several lines, you could just check if the current line contains "RSN 3", and then read line after line until one contains "RSN " while skipping the ones in between.
skip = False
for line in fileContent:
if "RSN 3" in line:
skip = True
continue
if "RSN " in line and "RSN 3" not in line:
skip = False
# rest of the logic
if skip:
continue
This way you don't even parse the blocks whose first line contains "RSN 3".

Python- Trying to print a variable and using (str) in front of number, but still getting a syntax error

I created a variable called "age = 23". I try and add it to another variable called "message" and specify it as a string using the (str) tag in front of the "age" variable.
age = 23
message = "Happy " + (str)age + "rd
birthday!"
print(message)
But whenever I try and run the code it comes back with a snytax error that looks like this:
line 7
message = "Happy " + (str)age + "rd
birthday!"
^
SyntaxError: invalid syntax
1|potter:/ $
You have the brackets in the wrong place, I'm not going to use your example as you've added it as an image, as AChampion mentioned, but an example is:
number = 34
message = "The number is " + str(number)
print(message)
I'd recommend taking some time to read the Python documentation, as this can help get your head around the language and its more basic uses.
message = "Happy " + str(age) + "rd birthday!"
print(message)
If you are learning Python from other languages such as C, you might know this method:
age = 23
print ('Happy %srd birthday!' % (age))
print ('Happy {}rd birthday!'.format(age)) # Use this
Might help: https://pyformat.info/

How to convert cmudict-0.7b or cmudict-0.7b.dict in to FST format to use it with phonetisaurus?

I am looking for a simple procedure to generate FST (finite state transducer) from cmudict-0.7b or cmudict-0.7b.dict, which will be used with phonetisaurus.
I tried following set of commands (phonetisaurus Aligner, Google NGramLibrary and phonetisaurus arpa2wfst) and able to generate FST but it didn't work. I am not sure where I did a mistake or miss any step. I guess very first command ie phonetisaurus-align, is not correct.
phonetisaurus-align --input=cmudict.dict --ofile=cmudict/cmudict.corpus --seq1_del=false
ngramsymbols < cmudict/cmudict.corpus > cmudict/cmudict.syms
/usr/local/bin/farcompilestrings --symbols=cmudict/cmudict.syms --keep_symbols=1 cmudict/cmudict.corpus > cmudict/cmudict.far
ngramcount --order=8 cmudict/cmudict.far > cmudict/cmudict.cnts
ngrammake --v=2 --bins=3 --method=kneser_ney cmudict/cmudict.cnts > cmudict/cmudict.mod
ngramprint --ARPA cmudict/cmudict.mod > cmudict/cmudict.arpa
phonetisaurus-arpa2wfst-omega --lm=cmudict/cmudict.arpa > cmudict/cmudict.fst
I tried fst with phonetisaurus-g2p as follows:
phonetisaurus-g2p --model=cmudict/cmudict.fst --nbest=3 --input=HELLO --words
But it didn't return anything....
Appreciate any help on this matter.
It is very important to keep dictionary in the right format. Phonetisaurus is very sensitive about that, it requires word and phonemes to be tab separated, spaces would not work then. It also does not allow pronunciation variant numbers CMUSphinx uses like (2) or (3). You need to cleanup dictionary with simple python script for example before feeding it into phonetisaurus. Here is the one I use:
#!/usr/bin/python
import sys
if len(sys.argv) != 3:
print "Split the list on train and test sets"
print
print "Usage: traintest.py file split_count"
exit()
infile = open(sys.argv[1], "r")
outtrain = open(sys.argv[1] + ".train", "w")
outtest = open(sys.argv[1] + ".test", "w")
cnt = 0
split_count = int(sys.argv[2])
for line in infile:
items = line.split()
if items[0][-1] == ')':
items[0] = items[0][:-3]
if items[0].find("_") > 0:
continue
line = items[0] + '\t' + " ".join(items[1:]) + '\n'
if cnt % split_count == 3:
outtest.write(line)
else:
outtrain.write(line)
cnt = cnt + 1

Resources