Label text documents - Supervised Machine Learning

Label text documents - Supervised Machine Learning - nlp

I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...I've successfully stripped the message bodies out of my emails. I'm looking to start classifying.
To make multiple labels like sports, technology, politics, entertainment I need some set of words of each one to make the labelling. Example for
Sports label will have the label data: Football, Soccer, Hockey……
Where can I find online label data to help me ?

You can use DMOZ.
Be award, there are different kinds of text. For e.g one of the most common words in email-text will be Hi or Hello but in wiki-text Hi and Hello will not be common words

What you're trying to do is called topic modeling:
https://en.wikipedia.org/wiki/Topic_model
The list of topics is very dependent on your training dataset and the ultimate purpose for which you're building this.
A good place to start can be here:
https://nlp.stanford.edu/software/tmt/tmt-0.4/
You can look on their topics, but you can probably also use it to give some initial topics to your data and just work on top of their topics.

You can use the BBC dataset.
It has labeled news articles which can help.
for feature extraction, remove stopwords, do stemming, use n-gram with tf-idf, and than choose the best features

Related

How we can extract parts of text about a topic?

I want to do a Topic modelling but in my case : One article may contains many topic:
I have an article (word file) that contains several topics and each topic is associated with a company (see example below)
I have a text as input :
"IBM is an international company specializing in all that is IT, on the other hand Facebook is a social network and Google is a search engine. IBM invented a very powerful computer."
Knowing we have labeled topics : "Products and services","Communications","Products and services"...
I want to have as output:
IBM : Products and services
Facebook : Communications
Google : Products and services
So, I think that we can do this by splitting the text: associate the parts of the text that talks about company, for example :
IBM : ['IBM is an international company specializing in all that is IT', 'IBM invented a very powerful computer.']
Facebook : ['Facebook is a social network']
Google : ['Google is a search engine']
then, for each company, perform Topic Modelling based on parts of text for each company ...
OUTPUT:
IBM : Products and services
Facebook : Communications
Google : Products and services
Could you help me how I can split and match the parts of text to each company, how to determine the parts that talk about Facebook in

It seems like you have two separate problems: (1) Data preparation/cleaning, i.e. splitting your text into the right units for analysis; (2) classifying the different units of text into "topics".
1. Data Preparation
An 'easy' way of doing this would be splitting your text into sentences and use sentences as your unit of analysis. Spacy is good for this for example (see e.g. this answer here). Your example is more difficult since you want to split sentences even further, so you would have to come up with a custom logic for splitting your text according to specific patterns, e.g. using regular expressions. I don't think that there is a standard way for doing this and is depends very much on your data.
2. Topic classification
If I understand correctly, you already have the labels ("topics" like ["Products and services", "Communications"]) which you want to attribute to different texts.
In this case, topic modeling is probably not the right tool, because topic modeling is mostly used when you want to discover new topics and don't know the topics/labels yet. And in any case, a topic model would only return the most frequent/exclusive words associated to a topic and not a neat abstract topic label like "Products and services". You also need enough text for a topic model to produce meaningful output.
A more elegant solution is zero-shot classification. This basically means that you take a general machine learning model that has been pre-trained by someone else in a very general way for text classification and you simply apply it to your specific use case for "topic classification" without having to train/fine-tune it. The Transformers library has a very easy to use implementation of this.
# pip install transformers==3.1.0 # pip install in terminal
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
sequence1 = "IBM is an international company specializing in all that is IT"
sequence2 = "Facebook is a social network."
sequence3 = "Google is a search engine. "
candidate_labels = ["Products and services", "Communications"]
classifier(sequence1, candidate_labels)
# output: {'labels': ['Products and services', 'Communications'], 'scores': [0.8678683042526245, 0.1321316659450531]}
classifier(sequence2, candidate_labels)
# output: {'labels': ['Communications', 'Products and services'], 'scores': [0.525628387928009, 0.47437164187431335]}
classifier(sequence3, candidate_labels)
# output: {'labels': ['Products and services', 'Communications'], 'scores': [0.5514479279518127, 0.44855210185050964]}
=> it classifies all texts correctly based on your example and labels. The label ("topic") with the highest score is the one which the model thinks fits best to your text. Note that you have to think hard about which labels are the most suitable. In your example, I wouldn't even be sure as a human which one fits better and the model is also not very sure. With this zero-shot classification approach you can chose the topic labels that you find most adequate.
Here is an interactive web application to see what it does without coding. Here is a Jupyter notebook which demonstrates how to use it in Python. You can just copy-paste code from the notebook.

Techniques other than RegEx to discover 'intent' in sentences

I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.

Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.

This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.

If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.

I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.

Streets recognition, deduction of severity

I'm trying to make an analysis of a set of phrases, and I don't know exactly how "natural language processing" can help me, or if someone can share his knowledge with me.
The objective is to extract streets and localizations. Often this kind of information is not presented to the reader in a structured way, and It's hard to find a way of parsing it. I have two main objectives.
First the extraction of the streets itself. As far as I know NLP libraries can help me to tokenize a phrase and perform an analysis which will get nouns (for example). But where a street begins and where does it ends?. I assume that I will need to compare that analysis with a streets database, but I don't know wich is the optimal method.
Also, I would like to deduct the level of severity , for example, in car accidents. I'm assuming that the only way is to stablish some heuristic by the present words in the phrase (for example, if deceased word appears + 100). Am I correct?
Thanks a lot as always! :)

The first part of what you want to do ("First the extraction of the streets itself. [...] But where a street begins and where does it end?") is a subfield of NLP called Named Entity Recognition. There are many libraries available which can do this. I like NLTK for Python myself. Depending on your choice I assume that a streetname database would be useful for training the recognizer, but you might be able to get reasonable results with the default corpus. Read the documentation for your NLP library for that.
The second part, recognizing accident severity, can be treated as an independent problem at first. You could take the raw words or their part of speech tags as features, and train a classifier on it (SVM, HMM, KNN, your choice). You would need a fairly large, correctly labelled training set for that; from your description I'm not certain you have that?
"I'm assuming that the only way is to stablish some heuristic by the present words in the phrase " is very vague, and could mean a lot of things. Based on the next sentence it kind of sounds like you think scanning for a predefined list of keywords is the only way to go. In that case, no, see the paragraph above.
Once you have both parts working, you can combine them and count the number of accidents and their severity per street. Using some geocoding library you could even generalize to neighborhoods or cities. Another challenge is the detection of synonyms ("Smith Str" vs "John Smith Street") and homonyms ("Smith Street" in London vs "Smith Street" in Leeds).

Text classification using Java

I need to categorize a text or word to a particular category. For example, the text 'Pink Floyd' should be categorized as 'music' or 'Wikimedia' as 'technology' or 'Einstein' as 'science'.
How can this be done? Is there a way I can use the DBpedia for the same? If not, the database has to be trained from time to time, right?

This is a text classification problem. Manning, Raghavan and Schütze's Information Retrieval book chapter is a nice introduction. I think you do not need DBPedia nor NER for this, just a small labeled training data set with enough labeled examples for all of your classes.

Yes, DBpedia may be a good choice for this kind of problem. You'll have to
squash the DBpedia category structure so you get the right granularity (e.g., Pink Floyd is listed under Capitol Records artists and a host of other categories, but not directly under Music). Maybe pick a few large categories and try to find whether your concepts are listed indirectly in them;
normalize text; Einstein is listed as Albert Einstein, not einstein
deal with ambiguity due to terms describing multiple concepts and concepts belonging to multiple top-level categories.
These problems may be solvable using machine learning, but I only see how it can be done if you extract these terms, along with relevant features, from running text. But in that case, you might just as well classify the entire text into one of the categories you choose in step 1.

This is the well-studied named entity recognition problem. Unless you have a particular need to roll your own technology (hint: it's a hard problem in general), using Gate, or perhaps one of the online services that builds on it (e.g. TSO's Data Enrichment Service), would be a good option. An alternative online service is OpenCalais.

Mapping your categries to DBPedia.
Index with lucene selected DBPedia categories and label data with your category names.
Do search for your data - tokenization, normalization will be done by Lucene.
This approach is somehow related to KNN classification.

Yes DBpedia is a good choice for text classification, as you can use its predicates/ relations to query and to extract the meaningful information for the particular category.
You can look into the endpoint for querying Dbpedia:
http://dbpedia.org/sparql
Further, learn the basic syntax of SPARQL to query on the endpoint from the following link:
http://www.w3.org/TR/rdf-sparql-query/

Can I identify intranet page content using Named Entity Recognition?

I am new to Natural Language Processing and I want to learn more by creating a simple project. NLTK was suggested to be popular in NLP so I will use it in my project.
Here is what I would like to do:
I want to scan our company's intranet pages; approximately 3K pages
I would like to parse and categorize the content of these pages based on certain criteria such as: HR, Engineering, Corporate Pages, etc...
From what I have read so far, I can do this with Named Entity Recognition. I can describe entities for each category of pages, train the NLTK solution and run each page through to determine the category.
Is this the right approach? I appreciate any direction and ideas...
Thanks

It looks like you want to do text/document classification, which is not quite the same as Named Entity Recognition, where the goal is to recognize any named entities (proper names, places, institutions etc) in text. However, proper names might be very good features when doing text classification in a limited domain, it is for example likely that a page with the name of the head engineer could be classified as Engineering.
The NLTK book has a chapter on basic text classification.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string