how can i use weka to terminology extraction? - text

i need to extract domain-specific terms from a big training corpus, such as political terms or etc .how can i use Weka and it's filters to aim this object?
can i use feature vector produced by StringToVector() filter in Weka to do this or not?

You can at least partly, as far as you have an appropriate dataset. For instance, let us assume you have a dataset like this one:
#relation test
#attribute text String
#attribute politics {yes,no}
#attribute religion {yes,no}
#data
"this is a text about politics",yes,no
"this text is about religion",no,yes
"this text mixes everything",yes,yes
For instance, for getting terms about politics, you can:
Remove the religion attribute.
Apply the StringToWordVector filter to the text attribute to get terms.
Apply the AttributeSelection filter with Ranker and InfoGainAttributeEval to get the top ranked terms.
This latter step will give you a list of terms that are most predictive for the politics category. Most of them will be terms in the politics domain (although it is possible that some terms are predictive but just because they are not in the politics domain - that is, they provide negative evidence).
The quality of the terms you get depens on the dataset. The more topics it deals with, the better for your results; so instead of having two classes (politics, religion, like in my dataset), it is much better to have plenty of them and many examples for each category.

Related

Cluster similar words using word2vec

I have various restaurant labels with me and i have some words that are unrelated to restaurants as well. like below:
vegan
vegetarian
pizza
burger
transportation
coffee
Bookstores
Oil and Lube
I have such mix of around 500 labels. I want to know is there a way pick the similar labels that are related to food choices and leave out words like oil and lube, transportation.
I tried using word2vec but, some of them have more than one word and could not figure out a right way.
Brute-force approach is to tag them manually. But, i want to know is there a way using NLP or Word2Vec to cluster all related labels together.
Word2Vec could help with this, but key factors to consider are:
How are your word-vectors trained? Using off-the-shelf vectors (like say the popular GoogleNews vectors trained on a large corpus of news stories) are unlikely to closely match the senses of these words in your domain, or include multi-word tokens like 'oil_and_lube'. But, if you have a good training corpus from your own domain, with multi-word tokens from a controlled vocabulary (like oil_and_lube) that are used in context, you might get quite good vectors for exactly the tokens you need.
The similarity of word-vectors isn't strictly 'synonymity' but often other forms of close-relation including oppositeness and other ways words can be interchangeable or be used in similar contexts. So whether or not the word-vector similarity-values provide a good threshold cutoff for your particular desired "related to food" test is something you'd have to try out & tinker around. (For example: whether words that are drop-in replacements for each other are closest to each other, or words that are common-in-the-same-topics are closest to each other, can be influenced by whether the window parameter is smaller or larger. So you could find tuning Word2Vec training parameters improve the resulting vectors for your specific needs.)
Making more recommendations for how to proceed would require more details on the training data you have available – where do these labels come from? what's the format they're in? how much do you have? – and your ultimate goals – why is it important to distinguish between restaurant- and non-restaurant- labels?
OK, thank you for the details.
In order to train on word2vec you should take into account the following facts :
You need a huge and variate text dataset. Review your training set and make sure it contains the useful data you need in order to obtain what you want.
Set one sentence/phrase per line.
For preprocessing, you need to delete punctuation and set all strings to lower case.
Do NOT lemmatize or stemmatize, because the text will be less complex!
Try different settings:
5.1 Algorithm: I used word2vec and I can say BagOfWords (BOW) provided better results, on different training sets, than SkipGram.
5.2 Number of layers: 200 layers provide good result
5.3 Vector size: Vector length = 300 is OK.
Now run the training algorithm. The, use the obtained model in order to perform different tasks. For example, in your case, for synonymy, you can compare two words (i.e. vectors) with cosine (or similarity). From my experience, cosine provides a satisfactory result: the distance between two words is given by a double between 0 and 1. Synonyms have high cosine values, you must find the limit between words which are synonyms and others that are not.

How to use BERT in image caption tasks,such as im2txt,densecap

Preparing to use Bert Before refining downstream tasks, let's start with a few general questions about NLP, saying that this is to emphasize Bert's universality. In general, most NLP problems fall into the four types of tasks shown in the figure above:
One type is sequence labeling. This is the most typical NLP task. For example, Chinese word segmentation, part-of-speech tagging, named entity recognition, semantic role tagging, etc. can be classified into this type of problem. It is characterized by the requirement of each word in the sentence model. The context must give a classification category.
The second category is classification tasks, such as our common text classification, emotional calculations, etc. can be classified into this category. It is characterized by the fact that no matter how long the article is, it is generally given a classification category.
The third type of task is sentence relationship judgment. For example, Entailment, QA, semantic rewriting, natural language reasoning and other tasks are all this model. Its characteristic is that given two sentences, the model judges whether two sentences have some semantic relationship.
The fourth category is generative tasks, such as machine translation, text summaries, writing poems, reading pictures, and so on. It is characterized by the need to generate another piece of text autonomously after entering the text content.
The first three tasks are very common, and now I want to do the image caption task in the fourth task.My normal implementation is based on densecap:A Hierarchical Approach for Generating Descriptive Image Paragraphs. the code is here

Multiclass text classification with python and nltk

I am given a task of classifying a given news text data into one of the following 5 categories - Business, Sports, Entertainment, Tech and Politics
About the data I am using:
Consists of text data labeled as one of the 5 types of news statement (Bcc news data)
I am currently using NLP with nltk module to calculate the frequency distribution of every word in the training data with respect to each category(except the stopwords).
Then I classify the new data by calculating the sum of weights of all the words with respect to each of those 5 categories. The class with the most weight is returned as the output.
Heres the actual code.
This algorithm does predict new data accurately but I am interested to know about some other simple algorithms that I can implement to achieve better results. I have used Naive Bayes algorithm to classify data into two classes (spam or not spam etc) and would like to know how to implement it for multiclass classification if it is a feasible solution.
Thank you.
In classification, and especially in text classification, choosing the right machine learning algorithm often comes after selecting the right features. Features are domain dependent, require knowledge about the data, but good quality leads to better systems quicker than tuning or selecting algorithms and parameters.
In your case you can either go to word embeddings as already said, but you can also design your own custom features that you think will help in discriminating classes (whatever the number of classes is). For instance, how do you think a spam e-mail is often presented ? A lot of mistakes, syntaxic inversion, bad traduction, punctuation, slang words... A lot of possibilities ! Try to think about your case with sport, business, news etc.
You should try some new ways of creating/combining features and then choose the best algorithm. Also, have a look at other weighting methods than term frequencies, like tf-idf.
Since your dealing with words I would propose word embedding, that gives more insights into relationship/meaning of words W.R.T your dataset, thus much better classifications.
If you are looking for other implementations of classification you check my sample codes here , these models from scikit-learn can easily handle multiclasses, take a look here at documentation of scikit-learn.
If you want a framework around these classification that is easy to use you can check out my rasa-nlu, it uses spacy_sklearn model, sample implementation code is here. All you have to do is to prepare the dataset in a given format and just train the model.
if you want more intelligence then you can check out my keras implementation here, it uses CNN for text classification.
Hope this helps.

Short text classification

I am about to start a project where my final goal is to classify short texts into classes: "may be interested in visiting place X" : "not interested or neutral". Place is described by set of keywords (e.g. meals or types of miles like "chinese food"). So ideally I need some approach to model desire of user based on short text analysis - and then classify based on a desire score or desire probability - is there any state-of-the-art in this field ? Thank you
This problem is exactly the same as sentiment analysis of texts. But, instead of the traditional binary classification, you seem to have a "neutral" opinion. State-of-the-art in sentiment analysis is highly domain-dependent. Techniques that have excelled in classifying movies do not perform as well on commercial products, for example.
Additionally, even the feature-selection is highly domain-dependent. For example, unigrams work well for movie review classification, but a combination of unigrams and bigrams perform better for classifying twitter texts.
My best advice is to "play around" with different features. Since you are looking at short texts, twitter is probably a good motivational example. I would start with unigrams and bigrams as my features. The exact algorithm is not very important. SVM usually performs very well with correct parameter tuning. Use a small amount of held-out data for tuning these parameters before experimenting on bigger datasets.
The more interesting portion of this problem is the ranking! A "purity score" has been recently used for this purpose in the following papers (and I'd say they are pretty state-of-the-art):
Sentiment summarization: evaluating and learning user preferences. Lerman, Blair-Goldensohn and McDonald. EACL. 2009.
The viability of web-derived polarity lexicons. Velikovich, Blair-Goldensohn, Hannan and McDonald. NAACL. 2010.

Is there an algorithm for determining the relevance of a text to a theme?

I want to know what can be used to determine the relevance of a page for a theme like games, movies, etc.
Is there some research in this area or is there only counting how many times some relevant words appear?
The common choice is supervised document classification on bag of words (or bag of n-grams) features, preferably with tf-idf weighting.
Popular algorithms include Naive Bayes and (linear) SVMs.
For this approach, you'll need labeled training data, i.e. documents annotated with relevant themes.
See, e.g., Introduction to Information Retrieval, chapters 13-15.

Resources