Information extracting from plain text using NLP

Information extracting from plain text using NLP - nlp

Me and my friends working on a hobby project and trying to extract data from plain text. Not something too complicated, just trying to extract name, birth date or somethings like that.
Let's say that we have a text file like this,
"Hello my name is John and I'm 22 years old. I'm living in USA and I like playing video games"
We want to fill a table like this
Name: John
Age: 22
From: USA
Looking for NLP since like last week and I don't even know where to start. Every kind of help appreciated.

It looks like NER (Named Entity Recognition) is what you are looking for.
Here a link that explains what NER is.
For the operative part, I suggest you have a look at this, but you can find a lot of free guides on the Internet.
Basically, you will have a code that looks like this, more or less:
import spacy # spaCy is a python module to work with NLP
nlp = spacy.load('en_core_web_sm') # loads english NLP model (small)
sentence = "Apple is looking at buying U.K. startup for $1 billion" # here you will type your sentence
doc = nlp(sentence) # process the sentence with the nlp model and retrieve entities
for ent in doc.ents: # for every entity, print text, start index, end index, label (what type of entity it is)
print(ent.text, ent.start_char, ent.end_char, ent.label_)

Related

Training / using OpenAI GPT-3 for translations

I'm trying to use OpenAI for translation of my products descriptions from one language to some other languages (EN, DE, CZ, SK, HU, PL, SI...). The translations, especially to SK/CZ/HU/PL languages are (mainly gramatically) quite bad (using text-davinci-003 model). I've got an idea - I already have a few thousands of similar products fully translated into all of these languages by professional translators. Is it possible to use those existing correct translations to train GPT-3 and then use this model to translate new texts? Has anybody already tried something similar?

Have you tried using the Edits endpoint? You can fix grammar and spelling with it.
If you run test.py the OpenAI API will return the following completion:
What day of the week is it?
test.py
import openai
openai.api_key = 'sk-xxxxxxxxxxxxxxxxxxxx'
openai.Edit.create(
model = 'text-davinci-edit-001',
input = 'What day of the wek is it?',
instruction = 'Fix the spelling mistakes'
)
content = response['choices'][0]['text']
print(content)

Extracting Person Names from a text data in German Language using spacy or nltk?

I am using a spacy model for german language for extracting named entities such as location names, person names and company names but not getting the proper result as an output. Is there any missing concept which I am unable to figure out precisely.
def city_finder(text_data):
nlp = spacy.load('en_core_web_sm')
doc = nlp(text_data)
for ents in doc.ents:
if(ents.label_ == 'GPE'):
return (ents.text)
This is the code which I had used in order to find the city names from the text data but its accuracy is not very high. When I run this code the result is coming out to be something else instead of the city name. Is there something which I am missing out as part of the Natural Language Processing or any other area?

You have to load a German language model, currently you loaded an English language model (see the prefix "en"). There are two German models available:
de_core_news_sm
de_core_news_md
Source: https://spacy.io/models/de
You can install them via the following command:
python -m spacy download de_core_news_sm
However, there are currently only four different entities supported (and my experience is that without retraining, the entity extraction doesn't work very well in German)
Supported NER: LOC, MISC, ORG, PER
You would have to adapt your code the following way:
def city_finder(text_data):
nlp = spacy.load('de_core_news_sm') # load German language model
doc = nlp(text_data)
for ents in doc.ents:
if(ents.label_ == 'LOC'): # GPE is not supported
return (ents.text)

There are standard libraries available for extracting language specific POS. You can check other libraries for extracting nouns for example Pattern library from CLiPS (refer https://github.com/clips/pattern) implements POS for languages like German and Spanish.

I need to make a movie recommendation from text using spacy

Thank you for taking the time to read this
I AM NOT ASKING TO DO MY HOMEWORK FOR ME... Just need guidance
I have a homework problem that I can't figure out. I need to do the following with the spacy library in python.
The Homework Question
Read in the movies.txt file. Each separate line is a description of a different movie.
Your task is to create a function to return which movies a user would watch
next if they have watched Planet Hulk with the description “Will he save
their world or destroy it? When the Hulk becomes too dangerous for the
Earth, the Illuminati trick Hulk into a shuttle and launch him into space to a planet where the Hulk can live in peace. Unfortunately, Hulk land on the planet Sakaar where he is sold into slavery and trained as a gladiator.”
The function should take in the description as a parameter and return the
title of the most similar movie.
The movie.txt file contains the following:
Movie A :When Hiccup discovers Toothless isn't the only Night Fury, he must seek "The Hidden World", a secret Dragon Utopia before a hired tyrant named Grimmel finds it first.
Movie B :After the death of Superman, several new people present themselves as possible successors.
Movie C :A darkness swirls at the center of a world-renowned dance company, one that will engulf the artistic director, an ambitious young dancer, and a grieving psychotherapist. Some will succumb to the nightmare. Others will finally wake up.
Movie D :A humorous take on Sir Arthur Conan Doyle's classic mysteries featuring Sherlock Holmes and Doctor Watson.
Movie E :A 16-year-old girl and her extended family are left reeling after her calculating grandmother unveils an array of secrets on her deathbed.
Movie F :In the last moments of World War II, a young German soldier fighting for survival finds a Nazi captain's uniform. Impersonating an officer, the man quickly takes on the monstrous identity of the perpetrators he is trying to escape from.
Movie G :The world at an end, a dying mother sends her young son on a quest to find the place that grants wishes.
Movie H :A musician helps a young singer and actress find fame, even as age and alcoholism send his own career into a downward spiral.
Movie I :Corporate analyst and single mom, Jen, tackles Christmas with a business-like approach until her uncle arrives with a handsome stranger in tow.
Movie J :Adapted from the bestselling novel by Madeleine St John, Ladies in Black is an alluring and tender-hearted comedy drama about the lives of a group of department store employees in 1959 Sydney.
Things that I have tried:
I have tried looking for for a feature in spacy that does something like this but the only thing I can come across is the similarity function but that only checks if the sentence has similar values...
Yes I am new to Spacy
My code so far
from __future__ import unicode_literals
import spacy
nlp = spacy.load("en_core_web_md")
myfile = open("movies.txt").read()
NlpRead = nlp(myfile)
sentence_to_compare = "Will he save their world or destroy it? When the Hulk becomes too dangerous for the Earth, the Illuminati trick Hulk into a shuttle and launch him into space to a planet where the Hulk can live in peace. Unfortunately, Hulk land on the planet Sakaar where he is sold into slavery and trained as a gladiator"
model_sentences = nlp(sentence_to_compare)
for sentence in myfile:
similarity = nlp(sentence).similarity(model_sentences)
print(sentence + "-" + str(similarity))

Spacy has several available pre-trained models. You are using "en_core_web_md" which includes word vectors. According to the documentation these included word vectors are 'GloVe vectors trained on Common Crawl'.
As shown in the code and heatmap below, these word vectors capture semantic similarity, and can help you cluster topics.
Naturally, this is not a solution to your homework problem, but a hint about a technique which you may find useful.
import spacy
nlp = spacy.load("en_core_web_md")
tokens = nlp(u'Hulk Superman Batman dragon elf dance musical handsome romance war soldier')
for token in tokens:
print(token.text, token.has_vector, token.vector_norm, token.is_oov)
labels = [a.text for a in tokens]
print(labels)
M = np.zeros((len(tokens), len(tokens)))
for idx, token1 in enumerate(tokens):
for idy, token2 in enumerate(tokens):
M[idx, idy] = token1.similarity(token2)
%matplotlib inline
import numpy as np
import seaborn as sns
import matplotlib.pylab as plt
ax = sns.heatmap(M, cmap = "RdBu_r", xticklabels=labels, yticklabels=labels)
plt.show()
Also, Spacy also provides Part-of-speech tagging with which you can extract proper nouns and common nouns from sentences:
doc = nlp("Will he save their world or destroy it? When the Hulk becomes too dangerous for the Earth, the Illuminati trick Hulk into a shuttle and launch him into space to a planet where the Hulk can live in peace. Unfortunately, Hulk land on the planet Sakaar where he is sold into slavery and trained as a gladiator")
properNouns = [token.text for token in doc if token.pos_ =='PROPN']
commonNouns = [token.text for token in doc if token.pos_ =='NOUN']
print(properNouns)
# ['Hulk', 'Earth', 'Illuminati', 'Hulk', 'Hulk', 'Hulk', 'Sakaar']
print(commonNouns)
# ['world', 'shuttle', 'space', 'planet', 'peace', 'land', 'planet', 'slavery', 'gladiator']

Is there a bi gram or tri gram feature in Spacy?

The below code breaks the sentence into individual tokens and the output is as below
"cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies"
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp("Cloud computing is benefiting major manufacturing companies")
for token in doc:
print(token.text)
What I would ideally want is, to read 'cloud computing' together as it is technically one word.
Basically I am looking for a bi gram. Is there any feature in Spacy that allows Bi gram or Tri grams ?

Spacy allows the detection of noun chunks. So to parse your noun phrases as single entities do this:
Detect the noun chunks
https://spacy.io/usage/linguistic-features#noun-chunks
Merge the noun chunks
Do dependency parsing again, it would parse "cloud computing" as single entity now.
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp("Cloud computing is benefiting major manufacturing companies")
>>> list(doc.noun_chunks)
[Cloud computing, major manufacturing companies]
>>> for noun_phrase in list(doc.noun_chunks):
... noun_phrase.merge(noun_phrase.root.tag_, noun_phrase.root.lemma_, noun_phrase.root.ent_type_)
...
Cloud computing
major manufacturing companies
>>> [(token.text,token.pos_) for token in doc]
[('Cloud computing', 'NOUN'), ('is', 'VERB'), ('benefiting', 'VERB'), ('major manufacturing companies', 'NOUN')]

If you have a spacy doc, you can pass it to textacy:
ngrams = list(textacy.extract.basics.ngrams(doc, 2, min_freq=2))

Warning: This is just an extension of the right answer made by Zuzana.
My reputation does not allow me to comment so I am making this answer just to answer the question of Adit Sanghvi above: "How do you do it when you have a list of documents?"
First you need to create a list with the text of the documents
Then you join the text lists in just one document
now you use the spacy parser to transform the text document in a Spacy document
You use the Zuzana's answer's to create de bigrams
This is the example code:
Step 1
doc1 = ['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code']
doc2 = ['how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy']
doc3 = ['i love to repeat phrases to make bigrams because i love make bigrams']
listOfDocuments = [doc1,doc2,doc3]
textList = [''.join(textList) for text in listOfDocuments for textList in text]
print(textList)
This will print this text:
['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code', 'how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy', 'i love to repeat phrases to make bigrams because i love make bigrams']
then step 2 and 3:
doc = ' '.join(textList)
spacy_doc = parser(doc)
print(spacy_doc)
and will print this:
all what i want is that you give me back my code because i worked a lot on it. Just give me back my code how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy i love to repeat phrases to make bigrams because i love make bigrams
Finally step 4 (Zuzana's answer)
ngrams = list(textacy.extract.ngrams(spacy_doc, 2, min_freq=2))
print(ngrams)
will print this:
[make bigrams, make bigrams, make bigrams]

I had a similar problem (bigrams, trigrams, like your "cloud computing"). I made a simple list of the n-grams, word_3gram, word_2grams etc., with the gram as basic unit (cloud_computing).
Assume I have the sentence "I like cloud computing because it's cheap". The sentence_2gram is: "I_like", "like_cloud", "cloud_computing", "computing_because" ... Comparing that your bigram list only "cloud_computing" is recognized as a valid bigram; all other bigrams in the sentence are artificial. To recover all other words you just take the first part of the other words,
"I_like".split("_")[0] -> I;
"like_cloud".split("_")[0] -> like
"cloud_computing" -> in bigram list, keep it.
skip next bi-gram "computing_because" ("computing" is already used)
"because_it's".split("_")[0]" -> "because" etc.
To also capture the last word in the sentence ("cheap") I added the token "EOL". I implemented this in python, and the speed was OK (500k words in 3min), i5 processor with 8G. Anyway, you have to do it only once. I find this more intuitive than the official (spacy-style) chunk approach. It also works for non-spacy frameworks.
I do this before the official tokenization/lemmatization, as you would get "cloud compute" as possible bigram. But I'm not certain if this is the best/right approach.

Extract human name from his CV in Python

As you all know names of persons normally on the top of their resume, so i did NER(name entity recognition) tagging using spaCy library on CV's and then i extract the first tag of PERSON (hoping it should be Human Name). Some time it works for me fine but some time it gives me Other things which are not names(because spaCy don't even recognize some names with any NER tag), so it is giving me some other things which it recognize as a PERSON it may be like 'Curriculam vitae' obviously this i don't want.
Following is a Code for which i was talking above...
import spacy
import docx2txt
nlp = spacy.load('en_default')
my_text = docx2txt.process("/home/waqar/CV data/Adnan.docx")
doc_2 = nlp(my_text)
for ent in doc_2.ents:
if ent.label_ == "PERSON":
print('{}'.format(ent))
break
Is there any way through which i can add some name to NER for 'PERSON' tag in spaCy because then it will be able to recognize human names written in CV's
i think my logic is fine but something i am missing....
I would be very thankful if u peoples help me as i am Student and a beginer in python hope u peoples will definitely suggest some thing
OutPut
Abdul Ahad Ghous
but some time it giving me OutPuts like following as NER recognize it as a PERSON and don't even give any tag to the human name in this CV.
Curriculum Vitae

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Information extracting from plain text using NLP - nlp

Related

Training / using OpenAI GPT-3 for translations

Extracting Person Names from a text data in German Language using spacy or nltk?

I need to make a movie recommendation from text using spacy

Is there a bi gram or tri gram feature in Spacy?

Extract human name from his CV in Python

Categories

Resources