I want to do a Topic modelling but in my case : One article may contains many topic:
I have an article (word file) that contains several topics and each topic is associated with a company (see example below)
I have a text as input :
"IBM is an international company specializing in all that is IT, on the other hand Facebook is a social network and Google is a search engine. IBM invented a very powerful computer."
Knowing we have labeled topics : "Products and services","Communications","Products and services"...
I want to have as output:
IBM : Products and services
Facebook : Communications
Google : Products and services
So, I think that we can do this by splitting the text: associate the parts of the text that talks about company, for example :
IBM : ['IBM is an international company specializing in all that is IT', 'IBM invented a very powerful computer.']
Facebook : ['Facebook is a social network']
Google : ['Google is a search engine']
then, for each company, perform Topic Modelling based on parts of text for each company ...
OUTPUT:
IBM : Products and services
Facebook : Communications
Google : Products and services
Could you help me how I can split and match the parts of text to each company, how to determine the parts that talk about Facebook in
It seems like you have two separate problems: (1) Data preparation/cleaning, i.e. splitting your text into the right units for analysis; (2) classifying the different units of text into "topics".
1. Data Preparation
An 'easy' way of doing this would be splitting your text into sentences and use sentences as your unit of analysis. Spacy is good for this for example (see e.g. this answer here). Your example is more difficult since you want to split sentences even further, so you would have to come up with a custom logic for splitting your text according to specific patterns, e.g. using regular expressions. I don't think that there is a standard way for doing this and is depends very much on your data.
2. Topic classification
If I understand correctly, you already have the labels ("topics" like ["Products and services", "Communications"]) which you want to attribute to different texts.
In this case, topic modeling is probably not the right tool, because topic modeling is mostly used when you want to discover new topics and don't know the topics/labels yet. And in any case, a topic model would only return the most frequent/exclusive words associated to a topic and not a neat abstract topic label like "Products and services". You also need enough text for a topic model to produce meaningful output.
A more elegant solution is zero-shot classification. This basically means that you take a general machine learning model that has been pre-trained by someone else in a very general way for text classification and you simply apply it to your specific use case for "topic classification" without having to train/fine-tune it. The Transformers library has a very easy to use implementation of this.
# pip install transformers==3.1.0 # pip install in terminal
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
sequence1 = "IBM is an international company specializing in all that is IT"
sequence2 = "Facebook is a social network."
sequence3 = "Google is a search engine. "
candidate_labels = ["Products and services", "Communications"]
classifier(sequence1, candidate_labels)
# output: {'labels': ['Products and services', 'Communications'], 'scores': [0.8678683042526245, 0.1321316659450531]}
classifier(sequence2, candidate_labels)
# output: {'labels': ['Communications', 'Products and services'], 'scores': [0.525628387928009, 0.47437164187431335]}
classifier(sequence3, candidate_labels)
# output: {'labels': ['Products and services', 'Communications'], 'scores': [0.5514479279518127, 0.44855210185050964]}
=> it classifies all texts correctly based on your example and labels. The label ("topic") with the highest score is the one which the model thinks fits best to your text. Note that you have to think hard about which labels are the most suitable. In your example, I wouldn't even be sure as a human which one fits better and the model is also not very sure. With this zero-shot classification approach you can chose the topic labels that you find most adequate.
Here is an interactive web application to see what it does without coding. Here is a Jupyter notebook which demonstrates how to use it in Python. You can just copy-paste code from the notebook.
Related
Hi I would like to know if we can have something like the following example on Doccano:
So let's say that we have a sentence like this : "MS is an IT company". I want to label some words in this sentence, for example MS (Microsoft). MS should be labelled as a Company (so imagine that I have an entity named Company) but I also want to say that MS stands for Microsoft.
Is there a way to do that with Doccano?
Thanks
Doccano supports
Sequence Labelling good for Named Entity Recognition (NER)
Text Classification good e.g. for Sentiment Analysis
Sequence To Sequence good for Machine Translation
What you're describing sounds a little like Entity Linking.
You can see from Doccano's roadmap in its docs that Entity Linking is part of the plans, but not yet available.
For now, I suggest to frame this as a NER problem, and to have different entities for MS (Microsoft) and MS (other). If you have too many entities to choose from, the labelling could become complicated, but then you could break up the dataset in smaller entity-focussed datasets. For example, you could get only documents with MS in them and label the mentions as one of the few synonyms.
I am trying to generate a taxonomy of extracted terminology using topic models. Therefore, I had to use Hierarchical Latent Dirichlet allocation.
However, after getting the topics tree, I would like to annotate topics but I am unable to produce the word-topic distribution in Mallet.
I have checked the parameters, it seems as if the only output file I can get is the output state, and it doesn't show the needed information.
I am using mallet implementation from the command window, I am using the following command line:
bin/mallet run cc.mallet.topics.tui.HierarchicalLDATUI --input my_corpus.mallet --output-state topic-statehlda.txt
I managed to get the topic-statehlda.txt which contains all the topic paths for the words and I have also visualized it (an example of the topics tree TopicsTree- terms where trimmed because they make the tree big and difficult to navigate). Some terms occur in multiple topics so that is why I am interested in the word-topic distribution to be able to select the most representative ones.
Can you please advise? is there a way to retrieve topics labels in a different way?
I am applying HLDA over documents from the same topic, and I am only using HLDA to extract possible taxonomies over a list of automatically extracted terminology (noun phrases), does this look meaningful or is it a bad practice?
The corpus is a collection of OCR'ed insurance documents. An Example of my automatically extracted taxonomy is:
motor insurance policy, motor policy schedule, motorcycle policy schedule, policy cover, cover use, cover note, theft cover, windscreen cover, comprehensive cover,
breakdown cover, commercial vehicle policy, commercial vehicle, motor vehicle,
vehicle policyholder, vehicle insurer, insured vehicle
and I am trying to build a taxonomy that suggests that the first 3 phrases, for example, are under the same node (belong to the same level)
I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...I've successfully stripped the message bodies out of my emails. I'm looking to start classifying.
To make multiple labels like sports, technology, politics, entertainment I need some set of words of each one to make the labelling. Example for
Sports label will have the label data: Football, Soccer, Hockey……
Where can I find online label data to help me ?
You can use DMOZ.
Be award, there are different kinds of text. For e.g one of the most common words in email-text will be Hi or Hello but in wiki-text Hi and Hello will not be common words
What you're trying to do is called topic modeling:
https://en.wikipedia.org/wiki/Topic_model
The list of topics is very dependent on your training dataset and the ultimate purpose for which you're building this.
A good place to start can be here:
https://nlp.stanford.edu/software/tmt/tmt-0.4/
You can look on their topics, but you can probably also use it to give some initial topics to your data and just work on top of their topics.
You can use the BBC dataset.
It has labeled news articles which can help.
for feature extraction, remove stopwords, do stemming, use n-gram with tf-idf, and than choose the best features
I need to categorize a text or word to a particular category. For example, the text 'Pink Floyd' should be categorized as 'music' or 'Wikimedia' as 'technology' or 'Einstein' as 'science'.
How can this be done? Is there a way I can use the DBpedia for the same? If not, the database has to be trained from time to time, right?
This is a text classification problem. Manning, Raghavan and Schütze's Information Retrieval book chapter is a nice introduction. I think you do not need DBPedia nor NER for this, just a small labeled training data set with enough labeled examples for all of your classes.
Yes, DBpedia may be a good choice for this kind of problem. You'll have to
squash the DBpedia category structure so you get the right granularity (e.g., Pink Floyd is listed under Capitol Records artists and a host of other categories, but not directly under Music). Maybe pick a few large categories and try to find whether your concepts are listed indirectly in them;
normalize text; Einstein is listed as Albert Einstein, not einstein
deal with ambiguity due to terms describing multiple concepts and concepts belonging to multiple top-level categories.
These problems may be solvable using machine learning, but I only see how it can be done if you extract these terms, along with relevant features, from running text. But in that case, you might just as well classify the entire text into one of the categories you choose in step 1.
This is the well-studied named entity recognition problem. Unless you have a particular need to roll your own technology (hint: it's a hard problem in general), using Gate, or perhaps one of the online services that builds on it (e.g. TSO's Data Enrichment Service), would be a good option. An alternative online service is OpenCalais.
Mapping your categries to DBPedia.
Index with lucene selected DBPedia categories and label data with your category names.
Do search for your data - tokenization, normalization will be done by Lucene.
This approach is somehow related to KNN classification.
Yes DBpedia is a good choice for text classification, as you can use its predicates/ relations to query and to extract the meaningful information for the particular category.
You can look into the endpoint for querying Dbpedia:
http://dbpedia.org/sparql
Further, learn the basic syntax of SPARQL to query on the endpoint from the following link:
http://www.w3.org/TR/rdf-sparql-query/
how do news outlets like google news automatically classify and rank documents about emerging topics, like "obama's 2011 budget"?
i've got a pile of articles tagged with baseball data like player names and relevance to the article (thanks, opencalais), and would love to create a google news-style interface that ranks and displays new posts as they come in, especially emerging topics. i suppose that a naive bayes classifier could be trained w/ some static categories, but this doesn't really allow for tracking trends like "this player was just traded to this team, these other players were also involved."
No doubt, Google News may use other tricks (or even a combination thereof), but one relatively cheap trick, computationally, to infer topics from free-text would exploit the NLP notion that a word gets its meaning only when connected to other words.
An algorithm susceptible of discovering new topic categories from multiple documents could be outlined as follow:
POS (part-of-speech) tag the text
We probably want to focus more on nouns and maybe even more so on named entities (such as Obama or New England)
Normalize the text
In particular replace inflected words by their common stem. Maybe even replace some adjectives by a corresponding Named Entity (ex: Parisian ==> Paris, legal ==> law)
Also, remove noise words and noise expressions.
identify some words from a list of manually maintained "current / recurring hot words" (Superbowl, Elections, scandal...)
This can be used in subsequent steps to provide more weight to some N-grams
Enumerate all N-grams found in each documents (where N is 1 to say 4 or 5)
Be sure to count, separately, the number of occurrences of each N-gram within a given document and the number of documents which cite a given N-gram
The most frequently cited N-grams (i.e. the ones cited in the most documents) are probably the Topics.
Identify the existing topics (from a list of known topics)
[optionally] Manually review the new topics
This general recipe can also be altered to leverage other attributes of the documents and the text therein. For example the document origin (say cnn/sports vs. cnn/politics ...) can be used to select domain specific lexicons. Another example the process can more or less heavily emphasize the words/expressions from the document title (or other areas of the text with a particular mark-up).
The main algorithms behind Google News have been published in the academic literature by Google researchers:
Original paper.
Talk: Google News Personalization: Scalable Online Collaborative Filtering
Blog discussion.