Creating probability matrix from a DocumentTermMatrix - text

I'm an economist and now I'm analysing some qualitative and text data. This is new for me.
I want to create a Markov Model for text predicton based on my interviews corpora. I have analyzed a corpora with tm package and after creating a DocumentTermMatrix and the TermDocumentMatrix (is equivalent) with bigrams (pairs of words), I want to compute the probability matrix for each pair of words in order to use it for further Markov Chain prediction. So, I have tried this piece from http://www.salemmarafi.com/code/twitter-naive-bayes/
probabilityMatrix <-function(docMatrix)
{
# Sum up the term frequencies
termSums<-cbind(colnames(as.matrix(docMatrix)),as.numeric(colSums(as.matrix(docMatrix))))
# Add one
termSums<-cbind(termSums,as.numeric(termSums[,2])+1)
# Calculate the probabilties
termSums<-cbind(termSums,(as.numeric(termSums[,3])/sum(as.numeric(termSums[,3]))))
# Calculate the natural log of the probabilities
termSums<-cbind(termSums,log(as.numeric(termSums[,4])))
# Add pretty names to the columns
colnames(termSums)<-c("term","count","additive","probability","lnProbability")
termSums
}
But I'm sure that this is not a correct approach to my problem because this code compute the frequency of each pair, but not consider the transition probability from a word to another. I have also seen that there are some implementations of text prediction algorithms in phyton, also in Java (see github), but I'm not able to translate it to R. Some people has a piece of code to perform this kind of analysis in R or know a package that performs it directly?
Thanks in advance
Jose

Related

Cluster similar words using word2vec

I have various restaurant labels with me and i have some words that are unrelated to restaurants as well. like below:
vegan
vegetarian
pizza
burger
transportation
coffee
Bookstores
Oil and Lube
I have such mix of around 500 labels. I want to know is there a way pick the similar labels that are related to food choices and leave out words like oil and lube, transportation.
I tried using word2vec but, some of them have more than one word and could not figure out a right way.
Brute-force approach is to tag them manually. But, i want to know is there a way using NLP or Word2Vec to cluster all related labels together.
Word2Vec could help with this, but key factors to consider are:
How are your word-vectors trained? Using off-the-shelf vectors (like say the popular GoogleNews vectors trained on a large corpus of news stories) are unlikely to closely match the senses of these words in your domain, or include multi-word tokens like 'oil_and_lube'. But, if you have a good training corpus from your own domain, with multi-word tokens from a controlled vocabulary (like oil_and_lube) that are used in context, you might get quite good vectors for exactly the tokens you need.
The similarity of word-vectors isn't strictly 'synonymity' but often other forms of close-relation including oppositeness and other ways words can be interchangeable or be used in similar contexts. So whether or not the word-vector similarity-values provide a good threshold cutoff for your particular desired "related to food" test is something you'd have to try out & tinker around. (For example: whether words that are drop-in replacements for each other are closest to each other, or words that are common-in-the-same-topics are closest to each other, can be influenced by whether the window parameter is smaller or larger. So you could find tuning Word2Vec training parameters improve the resulting vectors for your specific needs.)
Making more recommendations for how to proceed would require more details on the training data you have available – where do these labels come from? what's the format they're in? how much do you have? – and your ultimate goals – why is it important to distinguish between restaurant- and non-restaurant- labels?
OK, thank you for the details.
In order to train on word2vec you should take into account the following facts :
You need a huge and variate text dataset. Review your training set and make sure it contains the useful data you need in order to obtain what you want.
Set one sentence/phrase per line.
For preprocessing, you need to delete punctuation and set all strings to lower case.
Do NOT lemmatize or stemmatize, because the text will be less complex!
Try different settings:
5.1 Algorithm: I used word2vec and I can say BagOfWords (BOW) provided better results, on different training sets, than SkipGram.
5.2 Number of layers: 200 layers provide good result
5.3 Vector size: Vector length = 300 is OK.
Now run the training algorithm. The, use the obtained model in order to perform different tasks. For example, in your case, for synonymy, you can compare two words (i.e. vectors) with cosine (or similarity). From my experience, cosine provides a satisfactory result: the distance between two words is given by a double between 0 and 1. Synonyms have high cosine values, you must find the limit between words which are synonyms and others that are not.

Efficient way to get best matching pairs given a similarity-outputting neural network?

I am trying to come up with a neural network that ranks two short pairs of text (for example, stackexchange title and body). Following the deep learning cookbook's example, the network would look basically like this:
So we have our two inputs (title and body), embed them, then calculate the cosine similarity between embeddings. The inputs of the model would be [title,body], the output is [sim].
Now I'd like the closest matching body for a given title. I am wondering if there's a more efficient way of doing this that doesn't involve iterating over every possible pair of (title,body) and calculating the corresponding similarity? Because for very large datasets this is just not feasible.
Any help is much appreciated!
It is indeed not very efficient to iterate over every possible data pair. Instead you could use your model to extract all the embeddings of your titles and text bodies and save them in a database (or simply a .npy file). So, you don't use your model to output a similarity score but instead use your model to output an embedding (from your embedding layer).
At inference time you can then use a library for efficient similarity search such as faiss. Given a title you would simply look up its embedding and search in the whole embedding space of all body embeddings to see which ones get the highest score. I have used this approach myself and been able to search 1M vectors in just 100 ms.

What's a good measure for classifying text documents?

I have written an application that measures text importance. It takes a text article, splits it into words, drops stopwords, performs stemming, and counts word-frequency and document-frequency. Word-frequency is a measure that counts how many times the given word appeared in all documents, and document-frequency is a measure that counts how many documents the given word appeared.
Here's an example with two text articles:
Article I) "A fox jumps over another fox."
Article II) "A hunter saw a fox."
Article I gets split into words (afters stemming and dropping stopwords):
["fox", "jump", "another", "fox"].
Article II gets split into words:
["hunter", "see", "fox"].
These two articles produce the following word-frequency and document-frequency counters:
fox (word-frequency: 3, document-frequency: 2)
jump (word-frequency: 1, document-frequency: 1)
another (word-frequency: 1, document-frequency: 1)
hunter (word-frequency: 1, document-frequency: 1)
see (word-frequency: 1, document-frequency: 1)
Given a new text article, how do I measure how similar this article is to previous articles?
I've read about df-idf measure but it doesn't apply here as I'm dropping stopwords, so words like "a" and "the" don't appear in the counters.
For example, I have a new text article that says "hunters love foxes", how do I come up with a measure that says this article is pretty similar to ones previously seen?
Another example, I have a new text article that says "deer are funny", then this one is a totally new article and similarity should be 0.
I imagine I somehow need to sum word-frequency and document-frequency counter values but what's a good formula to use?
A standard solution is to apply the Naive Bayes classifier which estimates the posterior probability of a class C given a document D, denoted as P(C=k|D) (for a binary classification problem, k=0 and 1).
This is estimated by computing the priors from a training set of class labeled documents, where given a document D we know its class C.
P(C|D) = P(D|C) * P(D) (1)
Naive Bayes assumes that terms are independent, in which case you can write P(D|C) as
P(D|C) = \prod_{t \in D} P(t|C) (2)
P(t|C) can simply be computed by counting how many times does a term occur in a given class, e.g. you expect that the word football will occur a large number of times in documents belonging to the class (category) sports.
When it comes to the other factor P(D), you can estimate it by counting how many labeled documents are given from each class, may be you have more sports articles than finance ones, which makes you believe that there is a higher likelihood of an unseen document to be classified into the sports category.
It is very easy to incorporate factors, such as term importance (idf), or term dependence into Equation (1). For idf, you add it as a term sampling event from the collection (irrespective of the class).
For term dependence, you have to plugin probabilities of the form P(u|C)*P(u|t), which means that you sample a different term u and change (transform) it to t.
Standard implementations of Naive Bayes classifier can be found in the Stanford NLP package, Weka and Scipy among many others.
It seems that you are trying to answer several related questions:
How to measure similarity between documents A and B? (Metric learning)
How to measure how unusual document C is, compared to some collection of documents? (Anomaly detection)
How to split a collection of documents into groups of similar ones? (Clustering)
How to predict to which class a document belongs? (Classification)
All of these problems are normally solved in 2 steps:
Extract the features: Document --> Representation (usually a numeric vector)
Apply the model: Representation --> Result (usually a single number)
There are lots of options for both feature engineering and modeling. Here are just a few.
Feature extraction
Bag of words: Document --> number of occurences of each individual word (that is, term frequencies). This is the basic option, but not the only one.
Bag of n-grams (on word-level or character-level): co-occurence of several tokens is taken into account.
Bag of words + grammatic features (e.g. POS tags)
Bag of word embeddings (learned by an external model, e.g. word2vec). You can use embedding as a sequence or take their weighted average.
Whatever you can invent (e.g. rules based on dictionary lookup)...
Features may be preprocessed in order to decrease relative amount of noise in them. Some options for preprocessing are:
dividing by IDF, if you don't have a hard list of stop words or believe that words might be more or less "stoppy"
normalizing each column (e.g. word count) to have zero mean and unit variance
taking logs of word counts to reduce noise
normalizing each row to have L2 norm equal to 1
You cannot know in advance which option(s) is(are) best for your specific application - you have to do experiments.
Now you can build the ML model. Each of 4 problems has its own good solutions.
For classification, the best studied problem, you can use multiple kinds of models, including Naive Bayes, k-nearest-neighbors, logistic regression, SVM, decision trees and neural networks. Again, you cannot know in advance which would perform best.
Most of these models can use almost any kind of features. However, KNN and kernel-based SVM require your features to have special structure: representations of documents of one class should be close to each other in sense of Euclidean distance metric. This sometimes can be achieved by simple linear and/or logarithmic normalization (see above). More difficult cases require non-linear transformations, which in principle may be learned by neural networks. Learning of these transformations is something people call metric learning, and in general it is an problem which is not yet solved.
The most conventional distance metric is indeed Euclidean. However, other distance metrics are possible (e.g. manhattan distance), or different approaches, not based on vector representations of texts. For example, you can try to calculate Levenstein distance between texts, based on count of number of operations needed to transform one text to another. Or you can calculate "word mover distance" - the sum of distances of word pairs with closest embeddings.
For clustering, basic options are K-means and DBScan. Both these models require your feature space have this Euclidean property.
For anomaly detection you can use density estimations, which are produced by various probabilistic algorithms: classification (e.g. naive Bayes or neural networks), clustering (e.g. mixture of gaussian models), or other unsupervised methods (e.g. probabilistic PCA). For texts, you can exploit the sequential language structure, estimating probabilitiy of each word conditional on the previous words (using n-grams or convolutional/recurrent neural nets) - this is called language models, and it is usually more efficient than bag-of-word assumption of Naive Bayes, which ignores word order. Several language models (one for each class) may be combined into one classifier.
Whatever problem you solve, it is strongly recommended to have a good test set with the known "ground truth": which documents are close to each other, or belong to the same class, or are (un)usual. With this set, you can evaluate different approaches to feature engineering and modelling, and choose the best one.
If you don't have resourses or willingness to do multiple experiments, I would recommend to choose one of the following approaches to evaluate similarity between texts:
word counts + idf normalization + L2 normalization (equivalent to the solution of #mcoav) + Euclidean distance
mean word2vec embedding over all words in text (the embedding dictionary may be googled up and downloaded) + Euclidean distance
Based on one of these representations, you can build models for the other problems - e.g. KNN for classifications or k-means for clustering.
I would suggest tf-idf and cosine similarity.
You can still use tf-idf if you drop out stop-words. It is even probable that whether you include stop-words or not would not make such a difference: the Inverse Document Frequency measure automatically downweighs stop-words since they are very frequent and appear in most documents.
If your new document is entirely made of unknown terms, the cosine similarity will be 0 with every known document.
When I search on df-idf I find nothing.
tf-idf with cosine similarity is very accepted and common practice
Filtering out stop words does not break it. For common words idf gives them low weight anyway.
tf-idf is used by Lucene.
Don't get why you want to reinvent the wheel here.
Don't get why you think the sum of df idf is a similarity measure.
For classification do you have some predefined classes and sample documents to learn from? If so can use Naive Bayes. With tf-idf.
If you don't have predefined classes you can use k means clustering. With tf-idf.
It depend a lot on your knowledge of the corpus and classification objective. In like litigation support documents produced to you, you have and no knowledge of. In Enron they used names of raptors for a lot of the bad stuff and no way you would know that up front. k means lets the documents find their own clusters.
Stemming does not always yield better classification. If you later want to highlight the hits it makes that very complex and the stem will not be the length of the word.
Have you evaluated sent2vec or doc2vec approaches? You can play around with the vectors to see how close the sentences are. Just an idea. Not a verified solution to your question.
While in English a word alone may be enough, it isn't the case in some other more complex languages.
A word has many meanings, and many different uses cases. One text can talk about the same things while using fews to none matching words.
You need to find the most important words in a text. Then you need to catch their possible synonyms.
For that, the following api can help. It is doable to create something similar with some dictionaries.
synonyms("complex")
function synonyms(me){
var url = 'https://api.datamuse.com/words?ml=' + me;
fetch(url).then(v => v.json()).then((function(v){
syn = JSON.stringify(v)
syn = JSON.parse(syn)
for(var k in syn){
document.body.innerHTML += "<span>"+syn[k].word+"</span> "
}
})
)
}
From there comparing arrays will give much more accuracy, much less false positive.
A sufficient solution, in a possibly similar task:
Use of a binary bag-of-word (BOW) approach for the vector representation (frequent words aren't higher weighted than seldom words), rather than a real TF approach
The embedding "word2vec" approach, is sensitive to sequence and distances effects. It might make - depending on your hyper-parameters - a difference between 'a hunter saw a fox' and 'a fox saw a jumping hunter' ... so you have to decide, if this means adding noise to your task - or, alternatively, to use it as an averaged vector only, over all of your text
Extract high within-sentence-correlation words ( e.g., by using variables- mean-normalized- cosine-similaritities )
Second Step: Use this list of high-correlated words, as a positive list, i.e. as new vocab for an new binary vectorizer
This isolated meaningful words for the 2nd step cosine comparisons - in my case, even for rather small amounts of training texts

Python sensitivity analysis from measured data with SALib toolbox

I would like to understand, how to use the SALib python toolbox to perform a Sobol sensitivity analysis (to study parameters and crossed parameters influence)
From the original example I'm supposed to proceed this way:
from SALib.sample import saltelli
from SALib.analyze import sobol
from SALib.test_functions import Ishigami
import numpy as np
problem = {
'num_vars': 3,
'names': ['x1', 'x2', 'x3'],
'bounds': [[-np.pi, np.pi]]*3
}
# Generate samples
param_values = saltelli.sample(problem, 1000)
# Run model (example)
Y = Ishigami.evaluate(param_values)
# Perform analysis
Si = sobol.analyze(problem, Y, print_to_console=True)
# Returns a dictionary with keys 'S1', 'S1_conf', 'ST', and 'ST_conf'
# (first and total-order indices with bootstrap confidence intervals
Because in my case I'm getting data from experiments, I don't have the model that is linking Xi and Yi. I just have an input matrix and an output matrix.
If we assume that my input data are generated from a Latin Hypercube (a good statistical repartition), how to use Salib to evaluate the sensitivity of my parameters? From what I see in the code:
Si = sobol.analyze(problem, Y, print_to_console=True)
We are only using input parameters boundaries and output. But with this approach how is it possible to know which parameter is evolving between two sets ?
thanks for your help!
There is no direct way to compute the Sobol indices using SAlib based on your description of the data. SAlib computes the first- and total-order indices by generating two matrices (A and B) and then using additional values generated by cross-sampling a value from matrix B in matrix A. The diagram below shows how this is done. When the code evaluates the indices it expects the model output to be in this order. The method of computing indices this way is based on the methods published by Saltelli et al. (2010). Because this is not a Latin hypercube sampling method, the experimental data will most likely not work.
One possible method to still complete a sensitivity analysis is to use a surrogate or meta model from your experimental data. In this case you could use the experimental data to fit an approximation of your true model. This approximation can then be analyzed by SAlib or another sensitivity package. The surrogate model is typically a polynomial or based on kriging. Iooss et al (2006) describes some methods. Some software for this method includes UQlab (http://www.uqlab.com/, MATLAB-based) and BASS (https://cran.r-project.org/web/packages/BASS/index.html, R package) among others depending on the specific type of model and fitting techniques you want to use.
Another possibility is to find an estimator that is not based on the Saltelli et al (2010) method. I am not sure if such an estimator exists, but it would probably be better to post that question in the Math or Probability and Statistics Stack Exchanges.
References:
Iooss, B, F. Van Dorpe, N. Devictor. (2006). "Response surfaces and sensitivity analyses for an environmental model of dose calculations". Reliability Engineering and System Safety 91:1241-1251.
Saltelli, A., P. Annoni, I. Azzini, F. Campolongo, M. Ratto, S. Tarantola. 2010. "Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index". Computer Physics Communications 181:259-270.

SVM: Adding Clinical Features To Feature Vector Extracted From Image

I'm using SVM to classify clinical images of patients belonging to two different groups (patients vs. controls). I use PCA to extract a vector of features from each image, but I'd like to add other clinical information (for example, the output value of a clinical exam) in order to include it in the classification process.
Is there a way to do this?
I didn't find exhaustive suggestions in literature.
Thanks in advance.
You could just append the new information at the end of each sample. Other approach that you could try is having two additional classifiers, one that you could train with the additional information and a third classifier that would take the output of the other two classifiers as input to product a final prediction.
The question is pretty old, I' post my answer though.
If you have to scale your values, make sure that the new values are scaled to the similar range of your values in PCA-vector.
If your PCA vectors of features have constant length, you just start enumerating your features from length+1 e.g. for SVM input (libsvm):
1 1:<PCAval1> ... N:<PCAvalN> N+1:<Clinical exam value 1> ...
I've made a test adding such general features for cell recognition and the accuracy raised.
This Guide describes how to use enumerator-features.
P.S.:
In my test I've isolated, and squeezed cells from microscope image to a matrix 16x16. Each pixel in this matrix was a feature - 256 features. Additionally I've added some features as original size, moments, etc.

Resources