I need to make a movie recommendation from text using spacy

I need to make a movie recommendation from text using spacy - python-3.x

Thank you for taking the time to read this
I AM NOT ASKING TO DO MY HOMEWORK FOR ME... Just need guidance
I have a homework problem that I can't figure out. I need to do the following with the spacy library in python.
The Homework Question
Read in the movies.txt file. Each separate line is a description of a different movie.
Your task is to create a function to return which movies a user would watch
next if they have watched Planet Hulk with the description “Will he save
their world or destroy it? When the Hulk becomes too dangerous for the
Earth, the Illuminati trick Hulk into a shuttle and launch him into space to a planet where the Hulk can live in peace. Unfortunately, Hulk land on the planet Sakaar where he is sold into slavery and trained as a gladiator.”
The function should take in the description as a parameter and return the
title of the most similar movie.
The movie.txt file contains the following:
Movie A :When Hiccup discovers Toothless isn't the only Night Fury, he must seek "The Hidden World", a secret Dragon Utopia before a hired tyrant named Grimmel finds it first.
Movie B :After the death of Superman, several new people present themselves as possible successors.
Movie C :A darkness swirls at the center of a world-renowned dance company, one that will engulf the artistic director, an ambitious young dancer, and a grieving psychotherapist. Some will succumb to the nightmare. Others will finally wake up.
Movie D :A humorous take on Sir Arthur Conan Doyle's classic mysteries featuring Sherlock Holmes and Doctor Watson.
Movie E :A 16-year-old girl and her extended family are left reeling after her calculating grandmother unveils an array of secrets on her deathbed.
Movie F :In the last moments of World War II, a young German soldier fighting for survival finds a Nazi captain's uniform. Impersonating an officer, the man quickly takes on the monstrous identity of the perpetrators he is trying to escape from.
Movie G :The world at an end, a dying mother sends her young son on a quest to find the place that grants wishes.
Movie H :A musician helps a young singer and actress find fame, even as age and alcoholism send his own career into a downward spiral.
Movie I :Corporate analyst and single mom, Jen, tackles Christmas with a business-like approach until her uncle arrives with a handsome stranger in tow.
Movie J :Adapted from the bestselling novel by Madeleine St John, Ladies in Black is an alluring and tender-hearted comedy drama about the lives of a group of department store employees in 1959 Sydney.
Things that I have tried:
I have tried looking for for a feature in spacy that does something like this but the only thing I can come across is the similarity function but that only checks if the sentence has similar values...
Yes I am new to Spacy
My code so far
from __future__ import unicode_literals
import spacy
nlp = spacy.load("en_core_web_md")
myfile = open("movies.txt").read()
NlpRead = nlp(myfile)
sentence_to_compare = "Will he save their world or destroy it? When the Hulk becomes too dangerous for the Earth, the Illuminati trick Hulk into a shuttle and launch him into space to a planet where the Hulk can live in peace. Unfortunately, Hulk land on the planet Sakaar where he is sold into slavery and trained as a gladiator"
model_sentences = nlp(sentence_to_compare)
for sentence in myfile:
similarity = nlp(sentence).similarity(model_sentences)
print(sentence + "-" + str(similarity))

Spacy has several available pre-trained models. You are using "en_core_web_md" which includes word vectors. According to the documentation these included word vectors are 'GloVe vectors trained on Common Crawl'.
As shown in the code and heatmap below, these word vectors capture semantic similarity, and can help you cluster topics.
Naturally, this is not a solution to your homework problem, but a hint about a technique which you may find useful.
import spacy
nlp = spacy.load("en_core_web_md")
tokens = nlp(u'Hulk Superman Batman dragon elf dance musical handsome romance war soldier')
for token in tokens:
print(token.text, token.has_vector, token.vector_norm, token.is_oov)
labels = [a.text for a in tokens]
print(labels)
M = np.zeros((len(tokens), len(tokens)))
for idx, token1 in enumerate(tokens):
for idy, token2 in enumerate(tokens):
M[idx, idy] = token1.similarity(token2)
%matplotlib inline
import numpy as np
import seaborn as sns
import matplotlib.pylab as plt
ax = sns.heatmap(M, cmap = "RdBu_r", xticklabels=labels, yticklabels=labels)
plt.show()
Also, Spacy also provides Part-of-speech tagging with which you can extract proper nouns and common nouns from sentences:
doc = nlp("Will he save their world or destroy it? When the Hulk becomes too dangerous for the Earth, the Illuminati trick Hulk into a shuttle and launch him into space to a planet where the Hulk can live in peace. Unfortunately, Hulk land on the planet Sakaar where he is sold into slavery and trained as a gladiator")
properNouns = [token.text for token in doc if token.pos_ =='PROPN']
commonNouns = [token.text for token in doc if token.pos_ =='NOUN']
print(properNouns)
# ['Hulk', 'Earth', 'Illuminati', 'Hulk', 'Hulk', 'Hulk', 'Sakaar']
print(commonNouns)
# ['world', 'shuttle', 'space', 'planet', 'peace', 'land', 'planet', 'slavery', 'gladiator']

Related

Information extracting from plain text using NLP

Me and my friends working on a hobby project and trying to extract data from plain text. Not something too complicated, just trying to extract name, birth date or somethings like that.
Let's say that we have a text file like this,
"Hello my name is John and I'm 22 years old. I'm living in USA and I like playing video games"
We want to fill a table like this
Name: John
Age: 22
From: USA
Looking for NLP since like last week and I don't even know where to start. Every kind of help appreciated.

It looks like NER (Named Entity Recognition) is what you are looking for.
Here a link that explains what NER is.
For the operative part, I suggest you have a look at this, but you can find a lot of free guides on the Internet.
Basically, you will have a code that looks like this, more or less:
import spacy # spaCy is a python module to work with NLP
nlp = spacy.load('en_core_web_sm') # loads english NLP model (small)
sentence = "Apple is looking at buying U.K. startup for $1 billion" # here you will type your sentence
doc = nlp(sentence) # process the sentence with the nlp model and retrieve entities
for ent in doc.ents: # for every entity, print text, start index, end index, label (what type of entity it is)
print(ent.text, ent.start_char, ent.end_char, ent.label_)

Is there a bi gram or tri gram feature in Spacy?

The below code breaks the sentence into individual tokens and the output is as below
"cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies"
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp("Cloud computing is benefiting major manufacturing companies")
for token in doc:
print(token.text)
What I would ideally want is, to read 'cloud computing' together as it is technically one word.
Basically I am looking for a bi gram. Is there any feature in Spacy that allows Bi gram or Tri grams ?

Spacy allows the detection of noun chunks. So to parse your noun phrases as single entities do this:
Detect the noun chunks
https://spacy.io/usage/linguistic-features#noun-chunks
Merge the noun chunks
Do dependency parsing again, it would parse "cloud computing" as single entity now.
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp("Cloud computing is benefiting major manufacturing companies")
>>> list(doc.noun_chunks)
[Cloud computing, major manufacturing companies]
>>> for noun_phrase in list(doc.noun_chunks):
... noun_phrase.merge(noun_phrase.root.tag_, noun_phrase.root.lemma_, noun_phrase.root.ent_type_)
...
Cloud computing
major manufacturing companies
>>> [(token.text,token.pos_) for token in doc]
[('Cloud computing', 'NOUN'), ('is', 'VERB'), ('benefiting', 'VERB'), ('major manufacturing companies', 'NOUN')]

If you have a spacy doc, you can pass it to textacy:
ngrams = list(textacy.extract.basics.ngrams(doc, 2, min_freq=2))

Warning: This is just an extension of the right answer made by Zuzana.
My reputation does not allow me to comment so I am making this answer just to answer the question of Adit Sanghvi above: "How do you do it when you have a list of documents?"
First you need to create a list with the text of the documents
Then you join the text lists in just one document
now you use the spacy parser to transform the text document in a Spacy document
You use the Zuzana's answer's to create de bigrams
This is the example code:
Step 1
doc1 = ['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code']
doc2 = ['how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy']
doc3 = ['i love to repeat phrases to make bigrams because i love make bigrams']
listOfDocuments = [doc1,doc2,doc3]
textList = [''.join(textList) for text in listOfDocuments for textList in text]
print(textList)
This will print this text:
['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code', 'how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy', 'i love to repeat phrases to make bigrams because i love make bigrams']
then step 2 and 3:
doc = ' '.join(textList)
spacy_doc = parser(doc)
print(spacy_doc)
and will print this:
all what i want is that you give me back my code because i worked a lot on it. Just give me back my code how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy i love to repeat phrases to make bigrams because i love make bigrams
Finally step 4 (Zuzana's answer)
ngrams = list(textacy.extract.ngrams(spacy_doc, 2, min_freq=2))
print(ngrams)
will print this:
[make bigrams, make bigrams, make bigrams]

I had a similar problem (bigrams, trigrams, like your "cloud computing"). I made a simple list of the n-grams, word_3gram, word_2grams etc., with the gram as basic unit (cloud_computing).
Assume I have the sentence "I like cloud computing because it's cheap". The sentence_2gram is: "I_like", "like_cloud", "cloud_computing", "computing_because" ... Comparing that your bigram list only "cloud_computing" is recognized as a valid bigram; all other bigrams in the sentence are artificial. To recover all other words you just take the first part of the other words,
"I_like".split("_")[0] -> I;
"like_cloud".split("_")[0] -> like
"cloud_computing" -> in bigram list, keep it.
skip next bi-gram "computing_because" ("computing" is already used)
"because_it's".split("_")[0]" -> "because" etc.
To also capture the last word in the sentence ("cheap") I added the token "EOL". I implemented this in python, and the speed was OK (500k words in 3min), i5 processor with 8G. Anyway, you have to do it only once. I find this more intuitive than the official (spacy-style) chunk approach. It also works for non-spacy frameworks.
I do this before the official tokenization/lemmatization, as you would get "cloud compute" as possible bigram. But I'm not certain if this is the best/right approach.

Python 3: Saving API Results into CSV

I'm writing a script which requires a daily updated CSV source file which lists many movie details and have decided to use Python3 to create and update it even though I don't know too much about it.
I believe I've got the code down to pull the information via TheMovieDB.org's API that I need, but currently can only get it to echo the results and not save in a CSV. Below are a couple of questions I have, the code that I currently have, and an example of it's current output.
Questions:
1. What do I need to do add to get the resulting data into a CSV? I've tried many things but so far haven't gotten anything to work
2. What would I need to add so that rerunning the script would completely overwrite the CSV produced from the last run? (not append or error out)
3. Optional: Unless tedious or a pain, it would be nice to have a column for each of the values provided per title within the CSV.
Thanks!!
Current Code
import http.client
import requests
import csv
conn = http.client.HTTPSConnection("api.themoviedb.org")
payload = "{}"
conn.request("GET", "/3/discover/movie?page=20&include_video=false&include_adult=false&sort_by=primary_release_date.desc&language=en-US&api_key=XXXXXXXXXXXXXXXXXXXXXXXXXXX", payload)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
Result That's Echoed from the above Current Code
{"page":20,"total_results":360846,"total_pages":18043,"results":[{"vote_count":0,"id":521662,"video":false,"vote_average":0,"title":"森のかたみ","popularity":1.098018,"poster_path":"/qmj1gJ33lF7BhEOWAvK0mt6hRGH.jpg","original_language":"ja","original_title":"森のかたみ","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":518636,"video":false,"vote_average":0,"title":"Stadtkomödie:
Geschenkt","popularity":1.189812,"poster_path":null,"original_language":"de","original_title":"Stadtkomödie:
Geschenkt","genre_ids":[35],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":520720,"video":false,"vote_average":0,"title":"Kim
Possible","popularity":1.188148,"poster_path":"/3QGHTLgNKRphu3bLvGpoTZ1Ce9U.jpg","original_language":"en","original_title":"Kim
Possible","genre_ids":[10751,28,12],"backdrop_path":null,"adult":false,"overview":"Live-action
film adaptation of the Disney Channel original series Kim
Possible.","release_date":"2019-01-01"},{"vote_count":0,"id":521660,"video":false,"vote_average":0,"title":"Speak
Low","popularity":1.098125,"poster_path":"/qYQQlizCTfD5km7GIrTWrBb4E9b.jpg","original_language":"ja","original_title":"小さな声で囁いて","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":497834,"video":false,"vote_average":0,"title":"Saturday Fiction","popularity":1.148142,"poster_path":null,"original_language":"zh","original_title":"兰心大剧院","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"An
actress working undercover for the Allies in 1941 Shanghai discovers
the Japanese plan to attack Pearl
Harbor.","release_date":"2019-01-01"},{"vote_count":0,"id":523461,"video":false,"vote_average":0,"title":"Wie
gut ist deine
Beziehung?","popularity":1.188171,"poster_path":null,"original_language":"de","original_title":"Wie
gut ist deine
Beziehung?","genre_ids":[35],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":507118,"video":false,"vote_average":0,"title":"Schwartz &
Schwartz","popularity":1.345715,"poster_path":null,"original_language":"de","original_title":"Schwartz
&
Schwartz","genre_ids":[80],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":505916,"video":false,"vote_average":0,"title":"Kuru","popularity":1.107158,"poster_path":null,"original_language":"ja","original_title":"来る","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"After
the inexplicable message, at his workplace, of a mysterious death, a
man is introduced to a freelance writer and his
girlfriend.","release_date":"2019-01-01"},{"vote_count":0,"id":521028,"video":false,"vote_average":0,"title":"Tsokos:
Zersetzt","popularity":1.115739,"poster_path":null,"original_language":"de","original_title":"Tsokos:
Zersetzt","genre_ids":[53],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":516910,"video":false,"vote_average":0,"title":"Rufmord","popularity":1.658291,"poster_path":null,"original_language":"de","original_title":"Rufmord","genre_ids":[18],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":514224,"video":false,"vote_average":0,"title":"Shadows","popularity":1.289124,"poster_path":null,"original_language":"en","original_title":"Shadows","genre_ids":[16],"backdrop_path":null,"adult":false,"overview":"Plot
kept under
wraps.","release_date":"2019-01-01"},{"vote_count":0,"id":483202,"video":false,"vote_average":0,"title":"Eli","popularity":1.118757,"poster_path":null,"original_language":"en","original_title":"Eli","genre_ids":[27],"backdrop_path":null,"adult":false,"overview":"A
boy receiving treatment for his auto-immune disorder discovers that
the house he's living isn't as safe as he
thought.","release_date":"2019-01-01"},{"vote_count":0,"id":491287,"video":false,"vote_average":0,"title":"Untitled Lani Pixels
Project","popularity":1.951231,"poster_path":null,"original_language":"en","original_title":"Untitled
Lani Pixels
Project","genre_ids":[10751,16,12,35],"backdrop_path":null,"adult":false,"overview":"Evil
forces have invaded an isolated island and have targeted Patrick and
Susan's grandfather, Mr. Campbell. Guided by Jack, a charming Irish
rogue, the siblings end up on a dangerous journey filled with magic
and
mystery.","release_date":"2019-01-01"},{"vote_count":2,"id":49046,"video":false,"vote_average":0,"title":"All
Quiet on the Western
Front","popularity":6.197559,"poster_path":"/jZWVtbxyztDTSM0LXDcE6vdVTVC.jpg","original_language":"en","original_title":"All
Quiet on the Western
Front","genre_ids":[28,12,18,10752],"backdrop_path":null,"adult":false,"overview":"A
young German soldier's terrifying experiences and distress on the
western front during World War
I.","release_date":"2018-12-31"},{"vote_count":1,"id":299782,"video":false,"vote_average":0,"title":"The
Other Side of the
Wind","popularity":4.561363,"poster_path":"/vnfNbuyPqo5zJavqlgI3J50xJSi.jpg","original_language":"en","original_title":"The
Other Side of the
Wind","genre_ids":[35,18],"backdrop_path":null,"adult":false,"overview":"Orson
Welles' unfinished masterpiece, restored and assembled based on
Welles' own notes. During the last 15 years of his life, Welles, who
died in 1985, worked obsessively on the film, which chronicles a
temperamental film director—much like him—who is battling with the
Hollywood establishment to finish an iconoclastic
work.","release_date":"2018-12-31"},{"vote_count":0,"id":289600,"video":false,"vote_average":0,"title":"The
Sandman","popularity":3.329464,"poster_path":"/eju4vLNx9sSvscowmnKNLi3sFVe.jpg","original_language":"en","original_title":"The
Sandman","genre_ids":[27],"backdrop_path":"/zo67d5klQiFR3PCyvER39IMwZ73.jpg","adult":false,"overview":"THE
SANDMAN tells the story of Nathan, a young student in the city who
struggles to forget his childhood trauma at the hands of the serial
killer dubbed \"The Sandman.\" Nathan killed The Sandman years ago, on
Christmas Eve, after he witnessed the murder of his mother... until he
sees the beautiful woman who lives in the apartment across the way
dying at the hands of that same masked killer. This brutal murder
plunges Nathan into an odyssey into the night country of his past, his
dreams... and the buried secrets of The
Sandman.","release_date":"2018-12-31"},{"vote_count":0,"id":378177,"video":false,"vote_average":0,"title":"Luxembourg","popularity":1.179703,"poster_path":null,"original_language":"en","original_title":"Luxembourg","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"The
story of a group of people living in a permanent nuclear winter in the
ruins of the old civilisation destroyed by an atomic
war.","release_date":"2018-12-31"},{"vote_count":0,"id":347392,"video":false,"vote_average":0,"title":"Slice","popularity":3.248065,"poster_path":"/ySWPZihd5ynCc1aNLQUXmiw5H2V.jpg","original_language":"en","original_title":"Slice","genre_ids":[35],"backdrop_path":"/rtL9nzXtSvo1MW05kho9oeimCdb.jpg","adult":false,"overview":"When
a pizza delivery driver is murdered on the job, the city searches for
someone to blame: ghosts? drug dealers? a disgraced
werewolf?","release_date":"2018-12-31"},{"vote_count":0,"id":438674,"video":false,"vote_average":0,"title":"Dragged
Across
Concrete","popularity":3.659627,"poster_path":"/p4tpV4nGeocuOKhp0enuiQNDvhi.jpg","original_language":"en","original_title":"Dragged
Across
Concrete","genre_ids":[18,80,53,9648],"backdrop_path":null,"adult":false,"overview":"Two
policemen, one an old-timer (Gibson), the other his volatile younger
partner (Vaughn), find themselves suspended when a video of their
strong-arm tactics becomes the media's cause du jour. Low on cash and
with no other options, these two embittered soldiers descend into the
criminal underworld to gain their just due, but instead find far more
than they wanted awaiting them in the
shadows.","release_date":"2018-12-31"},{"vote_count":0,"id":437518,"video":false,"vote_average":0,"title":"Friend
of the
World","popularity":4.189267,"poster_path":"/hf3LucIg7t7DUvgGJ9DjQyHcI4J.jpg","original_language":"en","original_title":"Friend
of the
World","genre_ids":[35,18,27,878,53,10752],"backdrop_path":null,"adult":false,"overview":"After
a catastrophic war, an eccentric general guides a filmmaker through a
ravaged bunker.","release_date":"2018-12-31"}]}

import json
import http.client
import requests
import csv
conn = http.client.HTTPSConnection("api.themoviedb.org")
payload = "{}"
conn.request("GET", "/3/discover/movie?page=20&include_video=false&include_adult=false&sort_by=primary_release_date.desc&language=en-US&api_key=XXXXXXXXXXXXXXXXXXXXXXXXXXX", payload)
res = conn.getresponse()
data = res.read()
json_data = json.loads(data)
results=json_data["results"]
for item in results:
print (item('vote_count'))
#write code to get necessary objects to write in csv
This is a way how you can do it. Comment if you have any query.

That looks like a JSON object, so you can parse it into a python dictionary using:
import json
mydict = json.loads(data)
Probably the values you want are in mydict[results] which is another set of key:value pairs. Depending on how you want these you could use a CSV library or just iterate through them and the print the contents with a tab between them.
for item in vars["results"]:
for k in item:
print("{}\t{}".format(k,item.get(k)))

Extract human name from his CV in Python

As you all know names of persons normally on the top of their resume, so i did NER(name entity recognition) tagging using spaCy library on CV's and then i extract the first tag of PERSON (hoping it should be Human Name). Some time it works for me fine but some time it gives me Other things which are not names(because spaCy don't even recognize some names with any NER tag), so it is giving me some other things which it recognize as a PERSON it may be like 'Curriculam vitae' obviously this i don't want.
Following is a Code for which i was talking above...
import spacy
import docx2txt
nlp = spacy.load('en_default')
my_text = docx2txt.process("/home/waqar/CV data/Adnan.docx")
doc_2 = nlp(my_text)
for ent in doc_2.ents:
if ent.label_ == "PERSON":
print('{}'.format(ent))
break
Is there any way through which i can add some name to NER for 'PERSON' tag in spaCy because then it will be able to recognize human names written in CV's
i think my logic is fine but something i am missing....
I would be very thankful if u peoples help me as i am Student and a beginer in python hope u peoples will definitely suggest some thing
OutPut
Abdul Ahad Ghous
but some time it giving me OutPuts like following as NER recognize it as a PERSON and don't even give any tag to the human name in this CV.
Curriculum Vitae

How do I do word Stemming or Lemmatization?

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.
My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.
See also:
Stemming algorithm that produces real words
Stemming - code examples or open source projects?

If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet.
Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by:
>>> import nltk
>>> nltk.download('wordnet')
You only have to do this once. Assuming that you have now downloaded the corpus, it works like this:
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'
There are other lemmatizers in the nltk.stem module, but I haven't tried them myself.

I use stanford nlp to perform lemmatization. I have been stuck up with a similar problem in the last few days. All thanks to stackoverflow to help me solve the issue .
import java.util.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ling.CoreAnnotations.*;
public class example
{
public static void main(String[] args)
{
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma");
pipeline = new StanfordCoreNLP(props, false);
String text = /* the string you want */;
Annotation document = pipeline.process(text);
for(CoreMap sentence: document.get(SentencesAnnotation.class))
{
for(CoreLabel token: sentence.get(TokensAnnotation.class))
{
String word = token.get(TextAnnotation.class);
String lemma = token.get(LemmaAnnotation.class);
System.out.println("lemmatized version :" + lemma);
}
}
}
}
It also might be a good idea to use stopwords to minimize output lemmas if it's used later in classificator. Please take a look at coreNlp extension written by John Conwell.

I tried your list of terms on this snowball demo site and the results look okay....
cats -> cat
running -> run
ran -> ran
cactus -> cactus
cactuses -> cactus
community -> communiti
communities -> communiti
A stemmer is supposed to turn inflected forms of words down to some common root. It's not really a stemmer's job to make that root a 'proper' dictionary word. For that you need to look at morphological/orthographic analysers.
I think this question is about more or less the same thing, and Kaarel's answer to that question is where I took the second link from.

The stemmer vs lemmatizer debates goes on. It's a matter of preferring precision over efficiency. You should lemmatize to achieve linguistically meaningful units and stem to use minimal computing juice and still index a word and its variations under the same key.
See Stemmers vs Lemmatizers
Here's an example with python NLTK:
>>> sent = "cats running ran cactus cactuses cacti community communities"
>>> from nltk.stem import PorterStemmer, WordNetLemmatizer
>>>
>>> port = PorterStemmer()
>>> " ".join([port.stem(i) for i in sent.split()])
'cat run ran cactu cactus cacti commun commun'
>>>
>>> wnl = WordNetLemmatizer()
>>> " ".join([wnl.lemmatize(i) for i in sent.split()])
'cat running ran cactus cactus cactus community community'

Martin Porter's official page contains a Porter Stemmer in PHP as well as other languages.
If you're really serious about good stemming though you're going to need to start with something like the Porter Algorithm, refine it by adding rules to fix incorrect cases common to your dataset, and then finally add a lot of exceptions to the rules. This can be easily implemented with key/value pairs (dbm/hash/dictionaries) where the key is the word to look up and the value is the stemmed word to replace the original. A commercial search engine I worked on once ended up with 800 some exceptions to a modified Porter algorithm.

Based on various answers on Stack Overflow and blogs I've come across, this is the method I'm using, and it seems to return real words quite well. The idea is to split the incoming text into an array of words (use whichever method you'd like), and then find the parts of speech (POS) for those words and use that to help stem and lemmatize the words.
You're sample above doesn't work too well, because the POS can't be determined. However, if we use a real sentence, things work much better.
import nltk
from nltk.corpus import wordnet
lmtzr = nltk.WordNetLemmatizer().lemmatize
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
def normalize_text(text):
word_pos = nltk.pos_tag(nltk.word_tokenize(text))
lemm_words = [lmtzr(sw[0], get_wordnet_pos(sw[1])) for sw in word_pos]
return [x.lower() for x in lemm_words]
print(normalize_text('cats running ran cactus cactuses cacti community communities'))
# ['cat', 'run', 'ran', 'cactus', 'cactuses', 'cacti', 'community', 'community']
print(normalize_text('The cactus ran to the community to see the cats running around cacti between communities.'))
# ['the', 'cactus', 'run', 'to', 'the', 'community', 'to', 'see', 'the', 'cat', 'run', 'around', 'cactus', 'between', 'community', '.']

http://wordnet.princeton.edu/man/morph.3WN
For a lot of my projects, I prefer the lexicon-based WordNet lemmatizer over the more aggressive porter stemming.
http://wordnet.princeton.edu/links#PHP has a link to a PHP interface to the WN APIs.

Look into WordNet, a large lexical database for the English language:
http://wordnet.princeton.edu/
There are APIs for accessing it in several languages.

This looks interesting:
MIT Java WordnetStemmer:
http://projects.csail.mit.edu/jwi/api/edu/mit/jwi/morph/WordnetStemmer.html

Take a look at LemmaGen - open source library written in C# 3.0.
Results for your test words (http://lemmatise.ijs.si/Services)
cats -> cat
running
ran -> run
cactus
cactuses -> cactus
cacti -> cactus
community
communities -> community

The top python packages (in no specific order) for lemmatization are: spacy, nltk, gensim, pattern, CoreNLP and TextBlob. I prefer spaCy and gensim's implementation (based on pattern) because they identify the POS tag of the word and assigns the appropriate lemma automatically. The gives more relevant lemmas, keeping the meaning intact.
If you plan to use nltk or TextBlob, you need to take care of finding the right POS tag manually and the find the right lemma.
Lemmatization Example with spaCy:
# Run below statements in terminal once.
pip install spacy
spacy download en
import spacy
# Initialize spacy 'en' model
nlp = spacy.load('en', disable=['parser', 'ner'])
sentence = "The striped bats are hanging on their feet for best"
# Parse
doc = nlp(sentence)
# Extract the lemma
" ".join([token.lemma_ for token in doc])
#> 'the strip bat be hang on -PRON- foot for good'
Lemmatization Example With Gensim:
from gensim.utils import lemmatize
sentence = "The striped bats were hanging on their feet and ate best fishes"
lemmatized_out = [wd.decode('utf-8').split('/')[0] for wd in lemmatize(sentence)]
#> ['striped', 'bat', 'be', 'hang', 'foot', 'eat', 'best', 'fish']
The above examples were borrowed from in this lemmatization page.

If I may quote my answer to the question StompChicken mentioned:
The core issue here is that stemming algorithms operate on a phonetic basis with no actual understanding of the language they're working with.
As they have no understanding of the language and do not run from a dictionary of terms, they have no way of recognizing and responding appropriately to irregular cases, such as "run"/"ran".
If you need to handle irregular cases, you'll need to either choose a different approach or augment your stemming with your own custom dictionary of corrections to run after the stemmer has done its thing.

The most current version of the stemmer in NLTK is Snowball.
You can find examples on how to use it here:
http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.snowball2-pysrc.html#demo

You could use the Morpha stemmer. UW has uploaded morpha stemmer to Maven central if you plan to use it from a Java application. There's a wrapper that makes it much easier to use. You just need to add it as a dependency and use the edu.washington.cs.knowitall.morpha.MorphaStemmer class. Instances are threadsafe (the original JFlex had class fields for local variables unnecessarily). Instantiate a class and run morpha and the word you want to stem.
new MorphaStemmer().morpha("climbed") // goes to "climb"

Do a search for Lucene, im not sure if theres a PHP port but I do know Lucene is available for many platforms. Lucene is an OSS (from Apache) indexing and search library. Naturally it and community extras might have something interesting to look at. At the very least you can learn how it's done in one language so you can translate the "idea" into PHP.

.Net lucene has an inbuilt porter stemmer. You can try that. But note that porter stemming does not consider word context when deriving the lemma. (Go through the algorithm and its implementation and you will see how it works)

Martin Porter wrote Snowball (a language for stemming algorithms) and rewrote the "English Stemmer" in Snowball. There are is an English Stemmer for C and Java.
He explicitly states that the Porter Stemmer has been reimplemented only for historical reasons, so testing stemming correctness against the Porter Stemmer will get you results that you (should) already know.
From http://tartarus.org/~martin/PorterStemmer/index.html (emphasis mine)
The Porter stemmer should be regarded as ‘frozen’, that is, strictly defined, and not amenable to further modification. As a stemmer, it is slightly inferior to the Snowball English or Porter2 stemmer, which derives from it, and which is subjected to occasional improvements. For practical work, therefore, the new Snowball stemmer is recommended. The Porter stemmer is appropriate to IR research work involving stemming where the experiments need to be exactly repeatable.
Dr. Porter suggests to use the English or Porter2 stemmers instead of the Porter stemmer. The English stemmer is what's actually used in the demo site as #StompChicken has answered earlier.

In Java, i use tartargus-snowball to stemming words
Maven:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-snowball</artifactId>
<version>3.0.3</version>
<scope>test</scope>
</dependency>
Sample code:
SnowballProgram stemmer = new EnglishStemmer();
String[] words = new String[]{
"testing",
"skincare",
"eyecare",
"eye",
"worked",
"read"
};
for (String word : words) {
stemmer.setCurrent(word);
stemmer.stem();
//debug
logger.info("Origin: " + word + " > " + stemmer.getCurrent());// result: test, skincar, eyecar, eye, work, read
}

Try this one here: http://www.twinword.com/lemmatizer.php
I entered your query in the demo "cats running ran cactus cactuses cacti community communities" and got ["cat", "running", "run", "cactus", "cactus", "cactus", "community", "community"] with the optional flag ALL_TOKENS.
Sample Code
This is an API so you can connect to it from any environment. Here is what the PHP REST call may look like.
// These code snippets use an open-source library. http://unirest.io/php
$response = Unirest\Request::post([ENDPOINT],
array(
"X-Mashape-Key" => [API KEY],
"Content-Type" => "application/x-www-form-urlencoded",
"Accept" => "application/json"
),
array(
"text" => "cats running ran cactus cactuses cacti community communities"
)
);

I highly recommend using Spacy (base text parsing & tagging) and Textacy (higher level text processing built on top of Spacy).
Lemmatized words are available by default in Spacy as a token's .lemma_ attribute and text can be lemmatized while doing a lot of other text preprocessing with textacy. For example while creating a bag of terms or words or generally just before performing some processing that requires it.
I'd encourage you to check out both before writing any code, as this may save you a lot of time!

df_plots = pd.read_excel("Plot Summary.xlsx", index_col = 0)
df_plots
# Printing first sentence of first row and last sentence of last row
nltk.sent_tokenize(df_plots.loc[1].Plot)[0] + nltk.sent_tokenize(df_plots.loc[len(df)].Plot)[-1]
# Calculating length of all plots by words
df_plots["Length"] = df_plots.Plot.apply(lambda x :
len(nltk.word_tokenize(x)))
print("Longest plot is for season"),
print(df_plots.Length.idxmax())
print("Shortest plot is for season"),
print(df_plots.Length.idxmin())
#What is this show about? (What are the top 3 words used , excluding the #stop words, in all the #seasons combined)
word_sample = list(["struggled", "died"])
word_list = nltk.pos_tag(word_sample)
[wnl.lemmatize(str(word_list[index][0]), pos = word_list[index][1][0].lower()) for index in range(len(word_list))]
# Figure out the stop words
stop = (stopwords.words('english'))
# Tokenize all the plots
df_plots["Tokenized"] = df_plots.Plot.apply(lambda x : nltk.word_tokenize(x.lower()))
# Remove the stop words
df_plots["Filtered"] = df_plots.Tokenized.apply(lambda x : (word for word in x if word not in stop))
# Lemmatize each word
wnl = WordNetLemmatizer()
df_plots["POS"] = df_plots.Filtered.apply(lambda x : nltk.pos_tag(list(x)))
# df_plots["POS"] = df_plots.POS.apply(lambda x : ((word[1] = word[1][0] for word in word_list) for word_list in x))
df_plots["Lemmatized"] = df_plots.POS.apply(lambda x : (wnl.lemmatize(x[index][0], pos = str(x[index][1][0]).lower()) for index in range(len(list(x)))))
#Which Season had the highest screenplay of "Jesse" compared to "Walt" 
#Screenplay of Jesse =(Occurences of "Jesse")/(Occurences of "Jesse"+ #Occurences of "Walt")
df_plots.groupby("Season").Tokenized.sum()
df_plots["Share"] = df_plots.groupby("Season").Tokenized.sum().apply(lambda x : float(x.count("jesse") * 100)/float(x.count("jesse") + x.count("walter") + x.count("walt")))
print("The highest times Jesse was mentioned compared to Walter/Walt was in season"),
print(df_plots["Share"].idxmax())
#float(df_plots.Tokenized.sum().count('jesse')) * 100 / #float((df_plots.Tokenized.sum().count('jesse') + #df_plots.Tokenized.sum().count('walt') + #df_plots.Tokenized.sum().count('walter')))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

I need to make a movie recommendation from text using spacy - python-3.x

Related

Information extracting from plain text using NLP

Is there a bi gram or tri gram feature in Spacy?

Python 3: Saving API Results into CSV

Extract human name from his CV in Python

How do I do word Stemming or Lemmatization?

Categories

Resources