Building a thesaurus from corpus - nlp

I am working on a natural language processing application. I have a text describing 30 domains. Each domain is defined with a short paragraph that explains it. My aim is to build a thesaurus from this text so I can determine from an input string which domains are concerned. The text is about 5000 words and each domains is described by 150 words. My questions are :
Do I have a long enough text to create a thesaurus from ?
Is my idea of building a thesaurus legit or should I just use NLP libraries to analyse my corpus and the input string ?
At the moment, I have calculated the number total of occurrence of each words grouped by domains because I first thought of a indexed approach. But I am really not sure which method is the best. Does someone have experience in both NLP and thesaurus building ?

I think what you are looking for is topic modeling. Given a word, you want to get the probability of which domain the word belongs to. I would recommend using off the shelf algorithms that implement LDA (Latent Dirichlet Algorithm).
Alternatively, you can visit David Blei's website. He has written some great software that implements LDA, and topic modeling in general. He also has presented several tutorials for topic modeling for beginners.

If your goal is to build a thesaurus then build a thesaurus; if your goal is not to build a thesaurus, then you better use stuff available out there.
More generally, for any task in NLP - from data acquisition to machine translation - you're gonna face numerous problems (both technical and theoretical), and it is very easy to stray from the path, as these problems are - most of the time - fascinating.
Whatever the task is, build a system using existing resources. Then you get the big picture; then you can start thinking about improving component A or B.
Good luck.

Related

NLP - linguistic consistency analysis

I hope you can help me :).
I am working for a translation company.
As you know, every translation consists in splitting the original text into small segments and then re-joining them into the final product.
In other words, the segments are considered as "translation units".
Often, especially for large documents, the translators make some linguistic consistency errors, I try to explain it with an example.
In Spanish, you can use "tu" or "usted", depending on the context, and this determines the formality-informality tone of the sentence.
So, if you consider these two sentences of a document:
Lara, te has lavado las manos? (TU)
Lara usted se lavò las manos? (USTED)
They are BOTH correct, but if you consider the whole document, there is a linguistic inconsistency.
I am studying NLP basic in my spare time, and I am figuring out how to create a tool to perform a linguistic consistency analysis on a set of sentences.
I am looking in particular at Standford CoreNLP (I prefer Java to Python).
I guess that I need some linguistic tools to perform verb analysis first of all. And naturally, the tool would be able to work with different languages (EN, IT, ES, FR, PT).
Anyone can help me to figure out how to start this?
Any help would be appreciated,
thanks in advance!
Im not sure about Stanford CoreNLP, but if you're considering this an option, you could make your own tagger and use modifiers at pos tagging. Then, use this as a translation feature.
In other words, instead of just tagging a word to be a verb, you could tag it "a verb in the infinitive second person".
There are already good pre-tagged corpora out there for spanish that can help you do exactly that. For example, if you look at Universal Dependencies Ankora Corpus, you can find that there are annotations referring to the Person of a verb.
With a little tweaking, you could make a compose PoS that takes in "Verb-1st-Person" or something like that and train a Tagger.
I've made an article about how to do it in Python, but I bet that you can do it in Java using Weka. You can read the article here.
After this, I guess that the next step is that you ensure to match the person of one "translation unit" to the other, or make something in a pipeline fashion.

Techniques other than RegEx to discover 'intent' in sentences

I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.
Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.
This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.
If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.
I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.

Extracting information from unstructured text

I have a collection of "articles", each 1 to 10 sentences long, written in a noisy, informal english (i.e. social media style).
I need to extract some information from each article, where available, like date and time. I also need to understand what the article is talking about and who is the main "actor".
Example, given the sentence: "Everybody's presence is required tomorrow morning starting from 10.30 to discuss the company's financial forecast.", I need to extract:
the date/time => "10.30 tomorrow morning".
the topic => "company's financial forecast".
the actor => "Everybody".
As far as I know, the date and time could be extracted without using NLP techniques but I haven't found anything as good as Natty (http://natty.joestelmach.com/) in Python.
My understanding on how to proceed after reading some chapters of the NLTK book and watching some videos of the NLP courses on Coursera is the following:
Use part of the data to create an annotated corpus. I can't use off-the-shelf corpus because of the informal nature of the text (e.g. spelling errors, uninformative capitalization, word abbreviations, etc...).
Manually (sigh...) annotate each article with tags from the Penn TreeBank tagset. Is there any way to automate this step and just check/fix the results ?
Train a POS tagger on the annotated article. I've found the NLTK-trainer project that seems promising (http://nltk-trainer.readthedocs.org/en/latest/train_tagger.html).
Chunking/Chinking, which means I'll have to manually annotate the corpus again (...) using the IOB notation. Unfortunately according to this bug report n-gram chunkers are broken: https://github.com/nltk/nltk/issues/367. This seems like a major issue, and makes me wonder whether I should keep using NLTK given that it's more than a year old.
At this point, if I have done everything correctly, I assume I'll find actor, topic and datetime in the chunks. Correct ?
Could I (temporarily) skip 1,2 and 3 and produce a working, but possibly with a high error rate, implementation ? Which corpus should I use ?
I was also thinking of a pre-process step to correct common spelling mistakes or shortcuts like "yess", "c u" and other abominations. Anything already existing I can take advantage of ?
THE question, in a nutshell, is: is my approach at solving this problem correct ? If not, what am I doing wrong ?
Could I (temporarily) skip 1,2 and 3 and produce a working, but
possibly with a high error rate, implementation ? Which corpus should
I use ?
I was also thinking of a pre-process step to correct common spelling
mistakes or shortcuts like "yess", "c u" and other abominations.
Anything already existing I can take advantage of ?
I would suggest you first have a go at processing standard language text. The pre-processing you refer to is an NLP task in its own right, known as normalization. Here is a resource for Twitter normalization: http://www.ark.cs.cmu.edu/TweetNLP/ , additionally, you can use spell checking, sentence boundary detection, ...
THE question, in a nutshell, is: is my approach at solving this
problem correct ? If not, what am I doing wrong ?
If you make abstraction of normalization, I think your approach is valid. With regard to automating the annotation process: you can bootstrap the process by using off-the-shelf components first, after which you correct, retrain, and so on, ... during different iterations. To get acceptable results, you will need to do your steps 2, 3, and 4 a couple of times.
If you are interested in understanding the problem and being able to optimize existing solutions, I would suggest you focus on tools that allow you to develop your own models. If you prioritize getting results over being able to develop your own models, I would recommend looking into existing open source text engineering frameworks such as Gate (https://gate.ac.uk/) UIMA (http://uima.apache.org/) and DKPro (which extends UIMA) (https://code.google.com/p/dkpro-core-asl/). All three frameworks wrap existing components, so you have a wide range of possible solutions.
I'd suggesting giving a try to NER and Temporal Normalizer.
Here is what I see for your example sentence:
You can try the demo here:
http://deagol.cs.illinois.edu:8080/

Semantic search with NLP and elasticsearch

I am experimenting with elasticsearch as a search server and my task is to build a "semantic" search functionality. From a short text phrase like "I have a burst pipe" the system should infer that the user is searching for a plumber and return all plumbers indexed in elasticsearch.
Can that be done directly in a search server like elasticsearch or do I have to use a natural language processing (NLP) tool like e.g. Maui Indexer. What is the exact terminology for my task at hand, text classification? Though the given text is very short as it is a search phrase.
There may be several approaches with different implementation complexity.
The easiest one is to create list of topics (like plumbing), attach bag of words (like "pipe"), identify search request by majority of keywords and search only in specified topic (you can add field topic to your elastic search documents and set it as mandatory with + during search).
Of course, if you have lots of documents, manual creation of topic list and bag of words is very time expensive. You can use machine learning to automate some of tasks. Basically, it is enough to have distance measure between words and/or documents to automatically discover topics (e.g. by data clustering) and classify query to one of these topics. Mix of these techniques may also be a good choice (for example, you can manually create topics and assign initial documents to them, but use classification for query assignment). Take a look at Wikipedia's article on latent semantic analysis to better understand the idea. Also pay attention to the 2 linked articles on data clustering and document classification. And yes, Maui Indexer may become good helper tool this way.
Finally, you can try to build an engine that "understands" meaning of the phrase (not just uses terms frequency) and searches appropriate topics. Most probably, this will involve natural language processing and ontology-based knowledgebases. But in fact, this field is still in active research and without previous experience it will be very hard for you to implement something like this.
You may want to explore https://blog.conceptnet.io/2016/11/03/conceptnet-5-5-and-conceptnet-io/.
It combines semantic networks and distributional semantics.
When most developers need word embeddings, the first and possibly only place they look is word2vec, a neural net algorithm from Google that computes word embeddings from distributional semantics. That is, it learns to predict words in a sentence from the other words around them, and the embeddings are the representation of words that make the best predictions. But even after terabytes of text, there are aspects of word meanings that you just won’t learn from distributional semantics alone.
Some results
The ConceptNet Numberbatch word embeddings, built into ConceptNet 5.5, solve these SAT analogies better than any previous system. It gets 56.4% of the questions correct. The best comparable previous system, Turney’s SuperSim (2013), got 54.8%. And we’re getting ever closer to “human-level” performance on SAT analogies — while particularly smart humans can of course get a lot more questions right, the average college applicant gets 57.0%.
Semantic search is basically search with meaning. Elasticsearch uses JSON serialization by default, to apply search with meaning to JSON you would need to extend it to support edge relations via JSON-LD. You can then apply your semantic analysis over the JSON-LD schema to word disambiguate plumber entity and burst pipe contexts as a subject, predicate, object relationships. Elasticsearch has a very weak semantic search support but you can go around it using faceted searching and bag of words. You can index a thesaurus schema for plumbing terms, then do a semantic matching over the text phrases in your sentences.
"Elasticsearch 7.3 introduced introduced text similarity search with vector fields".
They describe the application of using text embeddings (e.g., word embeddings and sentence embeddings) to implement this sort of semantic similarity measure.
A bit late to the party, but part II of this blog seems to address this through "contextual searches". It basically makes a two-part query to Elasticsearch in order to build a list of "seed" documents and then an expanded query via the more-like-this API. The result is a set of documents most contextually similar to the search query.
it's possible. This GitHub repo shows how to integrate Elasticsearch with the current state-of-the-art on NLP for semantic representation of language: BERT (Bidirectional Encoder Representations from Transformers) https://github.com/Hironsan/bertsearch
Good luck.
My suggestion is to use BERT embedding for your sentences and add an embedding field to your ElasticSearch, as it is described in https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch
For BERT embedding I suggest to use sentence-transformers from Huggingface library. You can find sample codes in https://towardsdatascience.com/how-to-build-a-semantic-search-engine-with-transformers-and-faiss-dcbea307a0e8
There are several options for that:
You can perform it in elasticsearch itself. Elasticsearch supports the indexing of Dense Embedding of docs. From there, you can write your own pipeline for search and use your preferred relevancy score formula ie. cosine similarity or something else.
Use Haystack pipeline, refer to my blog which describes setting up a semantic search pipeline (end-to-end).
You can use Meta's Faiss

Document Analysis and Tagging

Let's say I have a bunch of essays (thousands) that I want to tag, categorize, etc. Ideally, I'd like to train something by manually categorizing/tagging a few hundred, and then let the thing loose.
What resources (books, blogs, languages) would you recommend for undertaking such a task? Part of me thinks this would be a good fit for a Bayesian Classifier or even Latent Semantic Analysis, but I'm not really familiar with either other than what I've found from a few ruby gems.
Can something like this be solved by a bayesian classifier? Should I be looking more at semantic analysis/natural language processing? Or, should I just be looking for keyword density and mapping from there?
Any suggestions are appreciated (I don't mind picking up a few books, if that's what's needed)!
Wow, that's a pretty huge topic you are venturing into :)
There is definitely a lot of books and articles you can read about it but I will try to provide a short introduction. I am not a big expert but I worked on some of this stuff.
First you need to decide whether you are want to classify essays into predefined topics/categories (classification problem) or you want the algorithm to decide on different groups on its own (clustering problem). From your description it appears you are interested in classification.
Now, when doing classification, you first need to create enough training data. You need to have a number of essays that are separated into different groups. For example 5 physics essays, 5 chemistry essays, 5 programming essays and so on. Generally you want as much training data as possible but how much is enough depends on specific algorithms. You also need verification data, which is basically similar to training data but completely separate. This data will be used to judge quality (or performance in math-speak) of your algorithm.
Finally, the algorithms themselves. The two I am familiar with are Bayes-based and TF-IDF based. For Bayes, I am currently developing something similar for myself in ruby, and I've documented my experiences in my blog. If you are interested, just read this - http://arubyguy.com/2011/03/03/bayes-classification-update/ and if you have any follow up questions I will try to answer.
The TF-IDF is a short for TermFrequence - InverseDocumentFrequency. Basically the idea is for any given document to find a number of documents in training set that are most similar to it, and then figure out it's category based on that. For example if document D is similar to T1 which is physics and T2 which is physics and T3 which is chemistry, you guess that D is most likely about physics and a little chemistry.
The way it's done is you apply the most importance to rare words and no importance to common words. For instance 'nuclei' is rare physics word, but 'work' is very common non-interesting word. (That's why it's called inverse term frequency). If you can work with Java, there is a very very good Lucene library which provides most of this stuff out of the box. Look for API for 'similar documents' and look into how it is implemented. Or just google for 'TF-IDF' if you want to implement your own
I've done something similar in the past (though it was for short news articles) using some vector-cluster algorithm. I don't remember it right now, it was what Google used in its infancy.
Using their paper I was able to have a prototype running in PHP in one or two days, then I ported it to Java for speed purposes.
http://en.wikipedia.org/wiki/Vector_space_model
http://www.la2600.org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf

Resources