What is nlp in spacy? - python-3.x

Usually we start from:
nlp = spacy.load('en_encore_web_sm') # or medium, or large
or
nlp = English()
then:
doc = nlp('my text')
Then we can do a lot of fun with that even not knowing the nature of the first line.
But what exactly is 'nlp'? What is going on under the hood? Is "nlp" a pretrained model, as understood in machine learning, and therefore some big file located somewhere on the disc?
I met an explanation, that 'nlp' is an 'object, containing process pipeline', but that only explains a little.

You can always check the type of any python objects:
nlp = spacy.load('en_encore_web_sm') # or medium, or large
print(type(nlp))
print(dir(nlp)) # view a list of attributes
You will get something like this (depending on the passed arguments)
<class 'spacy.lang.en.English'>
You are right it is something like 'pretrained' model as it contains vocabulary, binary weights, etc.
Please check the official documentation:
https://spacy.io/api/language

You could infer what nlp() is by exploring it. For example:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_lg")
text = "Elon Musk 889-888-8888 elonpie#tessa.net Jeff Bezos (345)123-1234 bezzi#zonbi.com Reshma Saujani example.email#email.com 888-888-8888 Barkevious Mingo"
text = nlp(text)
print(text)
Will print the exact same text. On the other hand if you do:
for word in text.ents:
print(word.text,word.label_)
you will get the entities of the string:
Elon Musk PERSON
889-888 CARDINAL
Jeff Bezos PERSON
345)123 CARDINAL
Reshma Saujani PERSON
It is indeed large pre-trained model for the English language and has many functions (parser, lemmatizer, tagger) as the one demonstrated above. Hope this helps a bit to clarify your question.

Related

Translating text from english to Italian using hugging face Helsinki models not fully translating

I'm a newbie going through the hugging face library trying out the Translation models for a data entry task and translating text from English to Italian.
The code I tried based on the documentation:
from transformers import MarianTokenizer, MarianMTModel
from typing import List
#src = 'en' # source language
#trg = 'it' # target language
#saved the model locally.
#model_name = f'Helsinki-NLP/opus-mt-{src}-{trg}'
#model.save_pretrained("./model_en_to_it")
#tokenizer.save_pretrained("./tokenizer_en_to_it")
model = MarianMTModel.from_pretrained('./model_en_to_it')
tokenizer = MarianTokenizer.from_pretrained('./tokenizer_en_to_it')
#Next, trying to iterate over each column - 'english_text' of the dataset and
#translate the text from English to Italian and append the translated text to the
#list 'italian'.
italian = []
for i in range(len(data)):
batch = tokenizer(dataset['english_text'][i],
return_tensors="pt",truncation=True,
padding = True)
gen = model.generate(**batch)
italian.append(tokenizer.batch_decode(gen, skip_special_tokens=True))
Two concerns over here:
Translates and appends only partial text i.e., it truncates the paragraph if it exceeds a certain length. How to translate the text given any length?
I have near about 10k data and it is taking a hell of a lot of time.
Even if any one of the problem could be solved, that's helpful. Would love to learn
Virtually all current MT systems are trained using single sentences, not paragraphs. If your input text is in paragraphs, you need to do sentence splitting first. Any NLP library will do (e.g., NLTK, Spacy, Stanza). Having multiple sentences in a single input will lead to worse translation quality (because this is not what the model was trained for). Moreover, the complexity of the Transformer model is quadratic with respect to the input length (it does not fully hold when everything is parallelized on a GPU), so it gets very slow with very long inputs.

finding semantic similarity between 2 statements

I am currently working with small application in python and my application has search functionality (currently using difflib) but I want to create Semantic Search which can give top 5 or 10 results from my database, based on user inputted text. It is same as google search engine works. I found some solutions Here.
But the problem is, below two statements from one of solution are semantically incorrect. And I don't care about this. because they are making things too hard which I don't want And also solution will be some pretrained neural network model or library from which I can implement easily.
Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station
And also I found some solutions which are showing using gensim and Glove embeddings and finding similarity between words and not sentences.
What I wanted ?
Suppose my db has statement display classes and user inputs show, showed, displayed, displayed class, show types etc are same. And if above 2 statements are given as same then also I don't care. displayed and displayed class already showing in difflib.
Points to be noted
Find from fixed set of statements but user inputted statements can differ
Must work for statements
I think it is not gensim embedding. It is word2vec embedding. Whatever it is.
You need tensorflow_hub
The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.
I believe you need here is Text Classification or Semantic Similarity because you want to find nearest top 5 or 10 statements given statement from user.
It is easy to use. But size of model is ≈ 1GB. It works with words, sentences, phrases or short paragraphs. The input is variable length English text and the output is a 512 dimensional vector. You can find more information about it Here
Code
import tensorflow_hub as hub
import numpy as np
# Load model. It will download first time.
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/5"
model = hub.load(module_url)
# first data[0] is your actual value
data = ["display classes", "show", "showed" ,"displayed class", "show types"]
# find high-dimensional vectors.
vecs = model(data)
# find distance between statements using inner product
dists = np.inner(vecs[0], vecs)
# print dists
print(dists)
Output
array([0.9999999 , 0.5633253 , 0.46475542, 0.85303843, 0.61701006],dtype=float32)
Conclusion
First value 0.999999 is distance between display classes and display classes itself. second 0.5633253 is distance between display classes and show and last 0.61701006 is distance between display classes and show types.
Using this, you can find distance between given input and statements in db. then rank them according to distance.
You can use wordnet for finding synonyms and then use these synonyms for finding similar statements.
import nltk
from nltk.corpus import wordnet as wn
nltk.download('wordnet')
def get_syn_list(gword):
syn_list = []
try:
syn_list.extend(wn.synsets(gword,pos=wn.NOUN))
syn_list.extend(wn.synsets(gword,pos=wn.VERB))
syn_list.extend(wn.synsets(gword,pos=wn.ADJ))
syn_list.extend(wn.synsets(gword,pos=wn.ADV))
except :
print("Something Wrong Happened")
syn_words = []
for i in syn_list:
syn_words.append(i.lemmas()[0].name())
return syn_words
Now use split and split your statements in db. like this
stat = ["display classes"]
syn_dict = {}
for i in stat:
tmp = []
for x in i.split(" "):
tmp.extend(get_syn_list(x))
syn_dict[i] = set(tmp)
Now you have synonyms just compare them with inputted text. And use lemmatizer before comparing words so that displayed become display.
Hey you can use spacy
This answer is from https://medium.com/better-programming/the-beginners-guide-to-similarity-matching-using-spacy-782fc2922f7c
import spacy
nlp = spacy.load("en_core_web_lg")
doc1 = nlp("display classes")
doc2 = nlp("show types")
print(doc1.similarity(doc2))
Output
0.6277548513279427
Edit
Run following command, which will download model.
!python -m spacy download en_core_web_lg

Handling compound words (2-grams) using NLTK

I'm trying to identify user similarities by comparing the keywords used in their profile (from a website). For example, Alice = pizza, music, movies, Bob = cooking, guitar, movie and Eve = knitting, running, gym. Ideally, Alice and Bob are the most similar. I put down some simple code to calculate the similarity. To account for possible plural/singular version of the keywords I use something like:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
wnl = WordNetLemmatizer()
w1 = ["movies", "movie"]
tokens = [token.lower() for token in word_tokenize(" ".join(w1))]
lemmatized_words = [wnl.lemmatize(token) for token in tokens]
So that, lemmatized_words = ["movie", "movie"].
Afterwards, I do some pairwise keywords comparison using spacy, such as:
import spacy
nlp = spacy.load('en')
t1 = nlp(u"pizza")
t2 = nlp(u"food")
sim = t1.similarity(t2)
Now, the problem starts when I have to deal with compound words such as: artificial intelligence, data science, whole food, etc. By tokenizing, I would simply split those words into 2 (e.g. artificial and intelligence), but this would affect my similarity measure. What is (would be) the best approach to take into account those type of words?
There are many ways to achieve this. One way would be to create the embeddings (vectors) yourself. This would have two advantages: first, you would be able to use bi-, tri-, and beyond (n-) grams as your tokens, and secondly, you are able to define the space that is best suited for your needs --- Wikipedia data is general, but, say, children's stories would be a more niche dataset (and more appropriate / "accurate" if you were solving problems to do with children and/or stories). There are several methods, of course word2vec being the most popular, and several packages to help you (e.g. gensim).
However, my guess is you would like something that's already out there. The best word embeddings right now are:
Numberbatch ('classic' best-in-class ensemble);
fastText, by Facebook Research (created at the character level --- some words that are out of vocabulary can be "understood" as a result);
sense2vec, by the same guys behind Spacy (created using parts-of-speech (POS) as additional information, with the objective to disambiguate).
The one we are interested in for a quick resolve of your problem is sense2vec. You should read the paper, but essentially these word embeddings were created using Reddit with additional POS information, and (thus) able to discriminate entities (e.g. nouns) that span multiple words. This blog post describes sense2vec very well. Here's some code to help you get started (taken from the prior links):
Install:
git clone https://github.com/explosion/sense2vec
pip install -r requirements.txt
pip install -e .
sputnik --name sense2vec --repository-url http://index.spacy.io install reddit_vectors
Example usage:
import sense2vec
model = sense2vec.load()
freq, query_vector = model["onion_rings|NOUN"]
freq2, query_vector2 = model["chicken_nuggets|NOUN"]
print(model.most_similar(query_vector, n=5)[0])
print(model.data.similarity(query_vector, query_vector2))
Important note, sense2vec requires spacy>=0.100,<0.101, meaning it will downgrade your current spacy install, not too much of a problem if you are only loading the en model. Also, here are the POS tags used:
ADJ ADP ADV AUX CONJ DET INTJ NOUN NUM PART PRON PROPN PUNCT SCONJ SYM VERB X
You could use spacy for POS and dependency tagging, and then sense2vec to determine the similarity of resulting entities. Or, depending on the frequency of your dataset (not too large), you could grab n-grams in descending (n) order, and sequentially check to see if each one is an entity in the sense2vec model.
Hope this helps!
There is approach using nltk:
from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer([("artificial","intelligence"), ("data","science")], separator=' ')
tokens = tokenizer.tokenize("I am really interested in data science and artificial intelligence".split())
print(tokens)
The output is given as:
['I', 'am', 'really', 'interested', 'in', 'data science', 'and', 'artificial intelligence']
For more reference you can read here.

NLP Extracting Related Phrases

Using NLP From a given sentence i am able to extract all adjectives and nouns easily using Core NLP.
But what im struggling to do is actually extract phrases out of the sentence.
For example i have the following sentences:
This person is trust worthy.
This person is non judgemental.
This person is well spoken.
For all these sentences using NLP i want to extract the phrases trust worthy, non judgemental, well spoken and so forth. I wanna extract all these related words.
How do i do this?
Thanks,
I think that you first need to think past these specific examples, and think about the structure of exactly what you want to extract. For example, in your cases you can use some simple heuristics to find any instance of a copular child, and all of its modifiers.
If the scope of what you need to extract is larger than that, you can go back to the drawing boards and rethink some rules based on basic linguistic features that are available in e.g. Stanford CoreNLP, or as another poster has linked, spaCy.
Finally, if you need the ability to generalize to other unknown examples, you may want to train a classifier (maybe start with a simple logistic regression classifier), by feeding it relevant linguistic features and tagging each token in a sentence as relevant or not relevant.
For your specific use case Open Information Extraction seems to be a suitable solution. It extracts triples containing a subject, a relation and an object. Your relation seems to be always be (infinitive of is) and your subject seems to be always person, so we are only interested in the object.
import edu.stanford.nlp.ie.util.RelationTriple;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.naturalli.NaturalLogicAnnotations;
import edu.stanford.nlp.util.CoreMap;
import java.util.Collection;
import java.util.Properties;
public class OpenIE {
public static void main(String[] args) {
// Create the Stanford CoreNLP pipeline
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,depparse,natlog,openie");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// Annotate your sentences
Annotation doc = new Annotation("This person is trust worthy. This person is non judgemental. This person is well spoken.");
pipeline.annotate(doc);
// Loop over sentences in the document
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
// Get the OpenIE triples for the sentence
Collection<RelationTriple> triples = sentence.get(NaturalLogicAnnotations.RelationTriplesAnnotation.class);
// Print the triples
for (RelationTriple triple : triples) {
triple.object.forEach(object -> System.out.print(object.get(TextAnnotation.class) + " "));
System.out.println();
}
}
}
}
The output would be the following:
trust
worthy
non judgemental
judgemental
well spoken
spoken
The OpenIE algorithm possibly extracts multiple triples per sentence. For your use case the solution might be to just take the triple with the largest number of words in the object.
Another thing to mention is that the object of your first sentence is not extracted "correctly", at least not in the way you want it. This is happening because trust is a noun and worthy is an adjective.
The easiest solution would be to write it with a hyphen (trust-worthy).
Another possible solution is to check the Part of Speech Tags and perform some additional steps when you encounter a noun followed by an adjective.
To check similarity between similar phrases, you could use word embeddings such as GLOVE. Some NLP libraries come with the embeddings, such as Spacy. https://spacy.io/usage/vectors-similarity
Note: Spacy uses cosine similarity on both a token level and a phrase level, and Spacy also offers a convenience similarity function for larger phrases/sentences.
For example:
(using spacy & python)
doc1 = nlp(u"The person is trustworthy.")
doc2 = nlp(u"The person is non judgemental.")
cosine_similarity = doc1.similarity(doc2)
And cosine_similarity could be used to show how similar two phrases/words/sentences are, ranging from 0 to 1, where 1 is very similar.

Gensim train word2vec on wikipedia - preprocessing and parameters

I am trying to train the word2vec model from gensim using the Italian wikipedia
"http://dumps.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2"
However, I am not sure what is the best preprocessing for this corpus.
gensim model accepts a list of tokenized sentences.
My first try is to just use the standard WikipediaCorpus preprocessor from gensim. This extract each article, remove punctuation and split words on spaces. With this tool each sentence would correspond to an entire model, and I am not sure of the impact of this fact on the model.
After this I train the model with default parameters. Unfortunately after training it seems that I do not manage to obtain very meaningful similarities.
What is the most appropriate preprocessing on the Wikipedia corpus for this task? (if this questions are too broad please help me by pointing to a relevant tutorial / article )
This the code of my first trial:
from gensim.corpora import WikiCorpus
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
corpus = WikiCorpus('itwiki-latest-pages-articles.xml.bz2',dictionary=False)
max_sentence = -1
def generate_lines():
for index, text in enumerate(corpus.get_texts()):
if index < max_sentence or max_sentence==-1:
yield text
else:
break
from gensim.models.word2vec import BrownCorpus, Word2Vec
model = Word2Vec()
model.build_vocab(generate_lines()) #This strangely builds a vocab of "only" 747904 words which is << than those reported in the literature 10M words
model.train(generate_lines(),chunksize=500)
Your approach is fine.
model.build_vocab(generate_lines()) #This strangely builds a vocab of "only" 747904 words which is << than those reported in the literature 10M words
This could be because of pruning infrequent words (the default is min_count=5).
To speed up computation, you can consider "caching" the preprocessed articles as a plain .txt.gz file, one sentence (document) per line, and then simply using word2vec.LineSentence corpus. This saves parsing the bzipped wiki XML on every iteration.
Why word2vec doesn't produce "meaningful similarities" for Italian wiki, I don't know. English wiki seems to work fine. See also here.
I've been working on a project to massage the wikipedia corpus and get vectors out of it.
I might generate the Italian vectors soon but in case you want to do it on your own take a look at:
https://github.com/idio/wiki2vec

Resources