How to check for differences between two spaCy Doc objects?

How to check for differences between two spaCy Doc objects? - nlp

I have two lists of the same strings each, except for slight variations in the strings of the second list, i.e. no capitalization, spelling errors, etc.
I want to check whether or not spaCy does anything differently between the two strings. This means that even if the strings aren't equivalent, I want to know if there are differences in the tagging and parsing.
I tried the following:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp("foo")
doc2 = nlp("foo")
print(doc == doc2)
This prints False so == is not the way to go.
Ideally, I would want my code to find where potential differences are, but checking if anything at all is different would be a very helpful first step.
EDIT:
== was changed to work in newer SpaCy versions. However, it only compares the text level. For dependency, this is an entirely different story and it has not been answered for spaCy yet, apart from this thread now of course.

Token-Level Comparison
If you want to know whether the annotation is different, you'll have to go through the documents token by token to compare POS tags, dependency labels, etc. Assuming the tokenization is the same for both versions of the text, you can compare:
import spacy
nlp = spacy.load('en')
doc1 = nlp("What's wrong with my NLP?")
doc2 = nlp("What's wring wit my nlp?")
for token1, token2 in zip(doc1, doc2):
print(token1.pos_, token2.pos_, token1.pos1 == token2.pos1)
Output:
NOUN NOUN True
VERB VERB True
ADJ VERB False
ADP NOUN False
ADJ ADJ True
NOUN NOUN True
PUNCT PUNCT True
Visualization for Parse Comparison
If you want to visually inspect the differences, you might be looking for something like What's Wrong With My NLP?. If the tokenization is the same for both versions of the document, then I think you can do something like this to compare the parses:
First, you'd need to export your annotation into a supported format (some version of CoNLL for dependency parses), which is something textacy can do. (See: https://www.pydoc.io/pypi/textacy-0.4.0/autoapi/export/index.html#export.export.doc_to_conll)
from textacy import export
export.doc_to_conll(nlp('What's wrong with my NLP?'))
Output:
# sent_id 1
1 What what NOUN WP _ 2 nsubj _ SpaceAfter=No
2 's be VERB VBZ _ 0 root _ _
3 wrong wrong ADJ JJ _ 2 acomp _ _
4 with with ADP IN _ 3 prep _ _
5 my -PRON- ADJ PRP$ _ 6 poss _ _
6 NLP nlp NOUN NN _ 4 pobj _ SpaceAfter=No
7 ? ? PUNCT . _ 2 punct _ SpaceAfter=No
Then you need to decide how to modify things so you can see both versions of the token in the analysis. I'd suggest concatenating the tokens where there are variations, say:
1 What what NOUN WP _ 2 nsubj _ SpaceAfter=No
2 's be VERB VBZ _ 0 root _ _
3 wrong_wring wrong ADJ JJ _ 2 acomp _ _
4 with_wit with ADP IN _ 3 prep _ _
5 my -PRON- ADJ PRP$ _ 6 poss _ _
6 NLP_nlp nlp NOUN NN _ 4 pobj _ SpaceAfter=No
7 ? ? PUNCT . _ 2 punct _ SpaceAfter=No
vs. the annotation for What's wring wit my nlp?:
1 What what NOUN WP _ 3 nsubj _ SpaceAfter=No
2 's be VERB VBZ _ 3 aux _ _
3 wrong_wring wr VERB VBG _ 4 csubj _ _
4 with_wit wit NOUN NN _ 0 root _ _
5 my -PRON- ADJ PRP$ _ 6 poss _ _
6 NLP_nlp nlp NOUN NN _ 4 dobj _ SpaceAfter=No
7 ? ? PUNCT . _ 4 punct _ SpaceAfter=No
Then you need to convert both files to an older version of CoNLL supported by whatswrong. (The main issue is just removing the commented lines starting with #.) One existing option is the UD tools CoNLL-U to CoNLL-X converter: https://github.com/UniversalDependencies/tools/blob/master/conllu_to_conllx.pl, and then you have:
1 What what NOUN NOUN_WP _ 2 nsubj _ _
2 's be VERB VERB_VBZ _ 0 root _ _
3 wrong_wring wrong ADJ ADJ_JJ _ 2 acomp _ _
4 with_wit with ADP ADP_IN _ 3 prep _ _
5 my -PRON- ADJ ADJ_PRP$ _ 6 poss _ _
6 NLP_nlp nlp NOUN NOUN_NN _ 4 pobj _ _
7 ? ? PUNCT PUNCT_. _ 2 punct _ _
You can load these files (one as gold and one as guess) and compare them using whatswrong. Choose the format CoNLL 2006 (CoNLL 2006 is the same as CoNLL-X).
This python port of whatswrong is a little unstable, but also basically seems to work: https://github.com/ppke-nlpg/whats-wrong-python
Both of them seem to assume that we have gold POS tags, though, so that comparison isn't shown automatically. You could also concatenate the POS columns to be able to see both (just like with the tokens) since you really need the POS tags to understand why the parses are different.
For both the token pairs and the POS pairs, I think it would be easy to modify either the original implementation or the python port to show both alternatives separately in additional rows so you don't have to do the hacky concatenation.

Try using doc.similarity() function of spaCy.
For example:
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat banana')
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
The result would be:
Refer from: https://spacy.io

Related

regex | extract numbers preceded by defined strings

I have strings like:
Bla bla 0.75 oz. Bottle
Mugs, 8oz. White
Bowls, 4.4" dia x 2.5", 12ml. Natural
Ala bala 3.3" 30ml Bottle'
I want to extract the numeric value which occurs before my pre-defined lookaheads, in this case [oz, ml]
0.75 oz
8 oz
12 ml
30 ml
I have the below code:
import re
import pandas as pd
look_ahead = "oz|ml"
s = pd.Series(['Bla bla 0.75 oz. Bottle',
'Mugs, 8oz. White',
'Bowls, 4.4" dia x 2.5", 12ml. Natural',
'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
rf"((?!,)[0-9]+.*[0-9]* *(?={look_ahead})[a-zA-Z]+)")
print(size_and_units)
Which outputs this:
0 [0.75 oz]
1 [8oz]
2 [4.4" dia x 2.5", 12ml]
3 [3.3" 30ml]
You can see there is a mismatch between what I want as output and what I am getting from my script. I think my regex code is picking everything between first numeric value and my defined lookahead, however I only want the last numeric value before my lookahead.
I am out of my depth for regex. Can someone help fix this.
Thank you!

Making as few changes to your regex, so you know what you did wrong:
in [0-9]+.*[0-9]*, replace . with \.. . means any character. \. means a period.
s = pd.Series(['Bla bla 0.75 oz. Bottle',
'Mugs, 8oz. White',
'Bowls, 4.4" dia x 2.5", 12ml. Natural',
'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
rf"((?!,)[0-9]+\.*[0-9]* *(?={look_ahead})[a-zA-Z]+)")
gives:
0 [0.75 oz]
1 [8oz]
2 [12ml]
3 [30ml]
You don't need to use a lookahead at all though, since you also want to match the units. Just do
\d+\.*\d*\s*(?:oz|ml)
This gives the same result:
size_and_units = s.str.findall(
rf"\d+\.*\d*\s*(?:{look_ahead})")

Some notes about the pattern that you tried:
You can omit the lookahead (?!,) as it is always true because you start the next match for a digit
In this part .*[0-9]* *(?=oz|ml)[a-zA-Z]+) this is all optional .*[0-9]* * and will match until the end of the string. Then it will backtrack till it can match either oz or ml and will match 1 or more chars a-zA-Z so it could also match 0.75 ozaaaaaaa
If you want the matches, you don't need a capture group or lookarounds. You can match:
\b\d+(?:\.\d+)*\s*(?:oz|ml)\b
\b A word boundary to prevent a partial word match
\d+(?:\.\d+)* Match 1+ digits with an optional decimal part
\s*(?:oz|ml) Match optional whitespace chars and either oz or ml
\b A word boundary
Regex demo
import pandas as pd
look_ahead = "oz|ml"
s = pd.Series(['Bla bla 0.75 oz. Bottle',
'Mugs, 8oz. White',
'Bowls, 4.4" dia x 2.5", 12ml. Natural',
'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
rf"\b\d+(?:\.\d+)*\s*(?:{look_ahead})\b")
print(size_and_units)
Output
0 [0.75 oz]
1 [8oz]
2 [12ml]
3 [30ml]

I think that regex expression will work for you.
[0-9]+\.*[0-9]* *(oz|ml)

Substituting all letters from a string with a single character

I am doing an hangman game in python and I'm stuck in the part where I have a random generated word and I'm trying to hide the word by replacing all characters with dashes like this:
generated word -> 'abcd'
hide word -> _ _ _ _
I have done the following:
string = 'luis'
print (string.replace ((string[i]) for i in range (0, len (string)), '_'))
And it gives me the following error:
^
SyntaxError: Generator expression must be parenthesized
Please give me some types

You could try a very simple approach, like this:
word = "luis"
print("_" * len(word))
Output would be:
>>> word = "luis"
>>> print("_" * len(word))
____
>>> word = "hi"
>>> print("_" * len(word))
__

The simplest is:
string = "luis"
"_" * len(string)
# '____'
If you want spaces inbetween:
" ".join("_" * len(string))
# '_ _ _ _'
However, since you will need to show guessed chars later on, you are better off starting with a generator in the first place:
" ".join("_" for char in string)
# '_ _ _ _'
So that you can easily insert guessed characters:
guessed = set("eis")
" ".join(char if char in guessed else "_" for char in string)
# '_ _ i s'

Amend with multiple indices per substitution in J

In J, how do you idiomatically amend an array when you have:
substitution0 multipleIndices0
...
substitutionN multipleIndicesN
(not to be confused with:
substitution0 multipartIndex0
...
substitutionN multipartIndexN
)
For example, my attempt at the classic fizzbuzz problem looks like this:
i=.3 :'<#I.(,*./)(=<.)3 5%~"(0 1)y'
d=.1+i.20
'fizz';'buzz';'fizzbuzz' (i d)};/d
|length error
| 'fizz';'buzz';'fizzbuzz' (i d)};/d
I have created the verb m} where m is i d which is 3 boxes containing different-sized lists of 1-dimensional indices, whereas I think } expects m to be boxes containing single lists that each represent a single index with dimensions at least as few as the rank of the right argument.
How is this generally solved?

'fizz';'buzz';'fizzbuzz' (i d)};/d
This has a few problems:
the x param of } is 'fizzbuzz', not a list of boxes, as the } happens before the ;s on the left. You mean
('fizz';'buzz';'fizzbuzz') (i d)};/d
the boxed numbers in the m param of } are not interpreted as you expect:
_ _ (1 1;2 2) } i.3 3
0 1 2
3 _ 5
6 7 _
_ _ (1 1;2 2) } ,i.3 3 NB. your length error
|length error
| _ _ (1 1;2 2)},i.3 3
If you raze the m param you get the right kind of indices, but still don't have enough members in the x list to go around:
_ _ (;1 3;5 7) } i.9
|length error
| _ _ (;1 3;5 7)}i.9
_ _ _ _ (;1 3;5 7) } i.9
0 _ 2 _ 4 _ 6 _ 8
These work:
NB. raze m and extend x (fixed)
i=.3 :'<#I.(,*./)(=<.)3 5%~"(0 1)y'
d=.1+i.20
((;# each i d)#'fizz';'buzz';'fizzbuzz') (;i d)};/d
+-+-+----+-+----+----+-+-+--------+----+--+----+--+--+----+--+--+--------+--+----+
|1|2|fizz|4|fizz|buzz|7|8|fizzbuzz|buzz|11|fizz|13|14|buzz|16|17|fizzbuzz|19|fizz|
+-+-+----+-+----+----+-+-+--------+----+--+----+--+--+----+--+--+--------+--+----+
NB. repeatedly mutate d
i=.3 :'<#I.(,*./)(=<.)3 5%~"(0 1)y'
d=.;/1+i.20
('fizz';'buzz';'fizzbuzz'),.(i ;d)
+--------+--------------+
|fizz |2 5 8 11 14 17|
+--------+--------------+
|buzz |4 9 14 19 |
+--------+--------------+
|fizzbuzz|14 |
+--------+--------------+
0$(3 : 'd =: ({.y) (1{::y) }d')"1 ('fizz';'buzz';'fizzbuzz'),.(i ;d)
d
+-+-+----+-+----+----+-+-+----+----+--+----+--+--+--------+--+--+----+--+----+
|1|2|fizz|4|buzz|fizz|7|8|fizz|buzz|11|fizz|13|14|fizzbuzz|16|17|fizz|19|buzz|
+-+-+----+-+----+----+-+-+----+----+--+----+--+--+--------+--+--+----+--+----+

Hangman program: incorrect use of global variable in loop

I'm writing a program to play the game hangman, and I don't think I'm using my global variable correctly.
Once the first iteration of the program concludes after a correct guess, any successive iteration with a correct guess prints the word and all of its past values.
How can I only print the most current value of word? This chunk of code is within a while loop where each iteration gets user input. Thanks!
Code:
word=''
#lettersGuessed is a list of string values of letters guessed
def getGuessedWord(secretWord, lettersGuessed):
global word
for letter in secretWord:
if letter not in lettersGuessed:
word=word+' _'
elif letter in lettersGuessed:
word=word+' '+letter
return print(word)
The Output:
#first iteration if 'a' was guessed:
a _ _ _ _
#second iteration if 'l' was guessed:
a _ _ _ _ a _ _ l _
#third iteration if 'e' was guessed:
a _ _ _ _ a _ _ l _ a _ _ l e
#Assuming the above, for the third iteration I want:
a _ _ l e
Note: This is only a short section of my code, but I don't feel like the other chunks are relevant.

The main problem you are facing is that you are appending your global variable every time you call your function. However, I think you don't need to use a global variable, in general this is a very bad practice, you can simply use the following code considering what you are explaining in your question:
def getGuessedWord(secretWord, lettersGuessed):
return ' '.join(letter if letter in lettersGuessed else '_'
for letter in secretWord)
I also think that it is better if you use a python comprehension to make your code faster.

every time you are calling the function getGuessedWord you are adding to `word, You can not use a global:
secretWord = "myword"
def getGuessedWord(secretWord, lettersGuessed):
word = ""
for letter in secretWord:
if letter not in lettersGuessed:
word=word+' _'
elif letter in lettersGuessed:
word=word+' '+letter
return print(word)
getGuessedWord(secretWord,"")
getGuessedWord(secretWord,"m")
getGuessedWord(secretWord,"mwd")
Or you can solve this by setting word at a constant length, (not as nice and harder to follow) e.g: word='_ '*len(secretWord), then instead of adding to it, replace the letter word=word[:2*i]+letter +word[2*i+1:]
Example here:
secretWord = "myword"
word='_ '*len(secretWord)
def getGuessedWord(secretWord, lettersGuessed):
global word
for i, letter in enumerate(secretWord):
if letter in lettersGuessed:
word=word[:2*i]+letter +word[2*i+1:]
return print(word)
getGuessedWord(secretWord,"")
getGuessedWord(secretWord,"m")
getGuessedWord(secretWord,"w")
getGuessedWord(secretWord,"d")

Print on one line in python Window

I'm not sure how to get multiple outputs from a for loop to print on the same line in a window. I'm using the built in Window function from uagame with python3.x. Here's what the code looks like:
for char in a_word:
if char in user_guess:
window.draw_string(char+" ",x, y)
else:
window.draw_string('_ ',x, y)
y = y + font_height
This keeps displaying as:
_
_
_
_
And I want it to print as
_ _ _ _
Any idea how to get each character or _ to display on one line? This is for a WordPuzzle/Hangman type game.

Use this as a example, and hopefully you will implement the same to your code.
for i in range(1,10):
print(i,end=",")
print()
The output looks like
1,2,3,4,5,6,7,8,9,

Define a empty list and append your characters then print all at once
a=[]
for char in a_word:
if char in user_guess:
a.append(char)
else:
a.append(char)
print(a,end=",")
y = y + font_height

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to check for differences between two spaCy Doc objects? - nlp

Related

regex | extract numbers preceded by defined strings

Substituting all letters from a string with a single character

Amend with multiple indices per substitution in J

Hangman program: incorrect use of global variable in loop

Print on one line in python Window

Categories

Resources