NLP related correlated "financial" and "finance" to same root - nlp

There are two documents: one containing term financial, and another finance.
How to link them to the same root finance?
NLTK stemming is giving
financi for financial
financ for finance
Lemmatization
{financial -> financial}
{finance -> finance}

Related

How do you differentiate between names, places, and things?

Here is a list of proper nouns taken from The Lord of the Rings. I was wondering if there is a good way to sort them based on whether they refer to a person, place or thing. Does there exist a natural language processing library that can do this? Is there a way to differentiate between places, names, and things?
Shire, Tookland, Bagginses, Boffins, Marches, Buckland, Fornost, Norbury, Hobbits, Took, Thain, Oldbucks, Hobbitry, Thainship, Isengrim, Michel, Delving, Midsummer, Postmaster, Shirriff, Farthing, Bounders, Bilbo, Frodo
You're talking about Named Entity Recognition. It is the task of information extraction that seeks to locate and classify piece of text into predefined categories such as pre-defined names, location, organizations, time expressions, monetary values, etc. You can either do that by unsupervised methods using a dictionary such as the words you have. Or though supervised methods, using methods such as CRFs, Neural Networks etc. But you need a list of predefined sentences with the respective annotated names and classes. In this example here, using Spacy (A NLP library), the authors applied NER to Lord of the rings novels. You can read more in the link.
Here is the solution:
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
Wikipedia link: https://en.wikipedia.org/wiki/Named-entity_recognition
Named Entity Recognition (NER) is a standard NLP problem which involves spotting named entities (people, places, organizations etc.) from a chunk of text, and classifying them into a predefined set of categories. Some of the practical applications of NER include:
Scanning news articles for the people, organizations and locations reported.
Providing concise features for search optimization: instead of searching the entire content, one may simply search for the major entities involved.
Quickly retrieving geographical locations talked about in Twitter posts.
NER with spaCy
spaCy is regarded as the fastest NLP framework in Python, with single optimized functions for each of the NLP tasks it implements. Being easy to learn and use, one can easily perform simple tasks using a few lines of code.
Installation :
!pip install spacy
!python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load('en_core_web_sm')
sentence = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(sentence)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
Output:
Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY
In the output, the first column specifies the entity, the next two columns the start and end characters within the sentence/document, and the final column specifies the category.
Further, it is interesting to note that spaCy’s NER model uses capitalization as one of the cues to identify named entities. The same example, when tested with a slight modification, produces a different result.

Find a sentence is related to a medical term or not

Input: user enters a sentence
if the word is related to any medical term , or if he needs any medical attention,
Output=True
else
Output=False
I am reading https://www.nltk.org/. I scraped 'https://www.merriam-webster.com/browse/medical/a' this website to get the medical related words but I am unable to figure out how to detect the sentence which are related to medical term . I haven't done any code because the algorithm is not clear to me.
I want to know what should I use , where to start, I need a tutorial link to implement this thing. Any guidance will be highly appreciated
I will list down the various ways you can do this with naive to intelligent ways -
Get a large vocabulary of medical terms, iterate over the sentence and return yes or no incase you find anything
Get a large vocabulary of medical terms, iterate over the sentence and do a fuzzy match with each word, so that words that are variations of the same work syntactically (alphabetically) are still detected and caught. [Check fuzzywuzzy library in python]
Get a large vocabulary of medical terms with definitions for each. Use pre-trained word embeddings (word2vec, Glove etc) for each word in the descriptions of those terms. Take a weighted sum of each word embeddings with weights set to the TFIDF of each word, to represent each medical term (its description to be precise) as a vector. Repeat the process for the sentence as well. Then take a cosine similary between them to calculate how contextually similar is the text to the description of the medical term. If the similarity is above a certain threshold that you fix, then return True. [This approach doesnt need the exact term, even if the person is talking about the condition, it should be able to detect]
Label a large number of sentences with their respective medical terms in them (annotate using something like the API.AI entity annotation tool or RASA entity annotation tool). Create a neural network with input embedding layer (which you can initialise with word2vec embeddings if you like), bi-LSTM layers and output with the list of medical terms / conditions with softmax. This will get you probability of each condition or term being associated with the sentence.
Create a neural network with encoder decoder architecture with attention layer between them. Create encoder embeddings from the input sentence. Create decoder with output as a string of medical terms. Train an encoder-decoder attention layer with pre-annotated data.
Create a pointer network which as input takes a sentence with the respective medical terms and return pointers, which point back to the inputs and marks them as medical term or non-medical term. (not easy to build fyi...)
OK so, I don't understand which part do you not understand? Because, the idea is rather simple and one google search gives you great and easy results. Unless the issue is that you don't know python. In that case it will be very hard for you to implement this.
The idea itself is simple - tokenize sentence (have each word for itself in a list) and search the list of medical terms. If the current word is in the list, the term is medical so the sentence is related to that medical term as well. If you imagine that you have a list of medical terms in a medical_terms list then in python it would look something like this:
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthurs' abdomen was hurting."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthurs', 'abdomen', "was", 'hurting', '.']
>>> def is_medical(tokens):
... for i in tokens:
... if i in medical_terms:
... return True
... else:
... return False
>>> is_medical(tokens)
True
You just tokenize the input sentence with NLTK and then search the list if any of the words in the sentence are medical terms. You can adapt this function to work with n-grams as well. This has a lot of other approaches and different special cases that have to be handled by this is the good start.

Word embeddings over user/customer reviews corpus

Most of the embeddings, publicly available, that I know are done over news articles, which use a different language/words as the one used in user/customer reviews.
Although such embeddings can be used in NLP tasks concerning reviews
and user generated content, I think the difference in language has an important role, and as such I would rather use embeddings trained over user generated content, such as product reviews.
I'm looking for a corpus of reviews or comments in English -- although in German and Dutch would also be useful -- to generate embeddings, or alternatively embeddings already trained over such a corpus.
Found two datasets/corpus in English:
https://www.yelp.com/dataset_challenge
https://snap.stanford.edu/data/web-Amazon.html
in German:
http://www.uni-weimar.de/en/media/chairs/webis/corpora/corpus-webis-cls-10/

Training Tagger with Custom Tags in NLTK

I have a document with tagged data in the format Hi here's my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York]. I want to train a model based on a set of these type of tagged documents, and then use my model to tag new documents. Is this possible in NLTK? I have looked at chunking and NLTK-Trainer scripts, but these have a restricted set of tags and corpora, while my dataset has custom tags.
As #AleksandarSavkov wrote already, this is essentially a named entity recognition (NER) task-- or more generally a chunking task, as you already realize. How to do it is covered nicely in chapter 7 of the NLTK book. I recommend you ignore the sections on regexp tagging and use the approach in section 3, Developing and evaluating chunkers. It includes code samples you can use verbatim to create a chunker (the ConsecutiveNPChunkTagger). Your responsibility is to select features that will give you good performance.
You'll need to transform your data into the IOB format expected by the NLTK's architecture; it expects part of speech tags, so the first step should be to run your input through a POS tagger; nltk.pos_tag() will do a decent enough job (once you strip off markup like [KEYWORD ...]), and requires no additional software to be installed. When your corpus is in the following format (word -- POS-tag -- IOB-tag), you are ready to train a recognizer:
Hi NNP O
here RB O
's POS O
my PRP$ O
phone NN B-KEYWORD
number NN I-KEYWORD
, , O
let VB O
me PRP O
...
The problem you are looking to solve is called most commonly, Named Entity Recognition (NER). There are many algorithms that can help you solve the problem, but the most important notion is that you need to convert your text data into a suitable format for sequence taggers. Here is an example of the BIO format:
I O
love O
Paris B-LOC
and O
New B-LOC
York I-LOC
. O
From there, you can choose to train any type of classifier, such as Naive Bayes, SVM, MaxEnt, CRF, etc. Currently the most popular algorithm for such multi-token sequence classification tasks is CRF. There are available tools that will let you train a BIO model (although originally intended for chunking) from a file using the format shown above (e.g. YamCha, CRF++, CRFSuite, Wapiti). If you are using Python you can look into scikit-learn, python-crfsuite and PyStruct in addition to NLTK.

How to make or get corpus of financial documents

I am working on a document classification problem for financial reports/documents. Is there a ready made corpus for this ? I found a couple of use cases, but they all made their own corpus.
You will more than likely have to create your own corpus. I had a similar task and manually creating such a corpus would be too tedious. As a result I created News Corpus Builder a python module that would allow you to quickly develop a corpus based on your particular interest of topics.
The module allows you to generate your own corpus and store the text and associated label in sqlite or as flat files.
from news_corpus_builder import NewsCorpusGenerator
# Location to save generated corpus
corpus_dir = '/Users/skillachie/finance_corpus'
# Save results to sqlite or files per article
ex = NewsCorpusGenerator(corpus_dir,'sqlite')
# Retrieve 50 links related to the search term dogs and assign a category of Pet to the retrieved links
links = ex.google_news_search('dogs','Pet',50)
# Generate and save corpus
ex.generate_corpus(links)
More details on my blog
The finance corpus is available for download here . The corpus has the following categories:
Policy (licenses , regulation, SEC, monetary, fed, monetary,fiscal,imf)
International Finance( global finance, IMF, ECB, trouble in Greece, RMB devaluation)
Economy (GDP, Jobs, unemployment, housing, economy) Raising Capital(ipo, equity)
Real Estate
Mergers & Acquisitions (merger,acquisitions)
Oil(oil,oil prices,natural gas price)
Commodities (commodities,gold ,silver)
Fraud(insider trading, ponzi scheme, finance fraud)
Litigation (company litigation, company settlement,)
Earning Reports
You can use the Reuters-21578 corpus. http://www.daviddlewis.com/resources/testcollections/reuters21578/
It is a basic corpus for test classification.

Resources