How to define log-count ratio for multiclass text dataset (fastai)? - nlp

I am trying to follow Rachel Thomas path of sentiment classification with Naive Bayes. In the video she uses a binary dataset (pos. and neg. movie reviews). When it comes to apply Naive Bayes, this is what she does:
Defintion: log-count ratio r for each word f:
r = log (ratio of feature f in positive documents) / (ratio of feature f in negative documents)
where ratio of feature $f$ in positive documents is the number of times a positive document has a feature divided by the number of positive documents.
p1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))
pr1 = (p1+1) / ((y.items==positive).sum() + 1)
pr0 = (p0+1) / ((y.items==negative).sum() + 1)
r = np.log(pr1/pr0)
--> it is very simple to apply the log-count-ratio to a dataset with 2 labels!
Problem:
My dataset is not binary! Lets assume I have 5 labels: label_1,...,label_5
How do I get the log-count ratio r for multilabel dataset?
My approach:
p4 = np.squeeze(np.asarray(x[y.items==label_5].sum(0)))
p3 = np.squeeze(np.asarray(x[y.items==label_4].sum(0)))
p2 = np.squeeze(np.asarray(x[y.items==label_3].sum(0)))
p1 = np.squeeze(np.asarray(x[y.items==label_2].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==label_1].sum(0)))
log-count-ratio:
pr1 = (p1+1) / ((y.items==label_2).sum() + 1)
pr1_not = (p1+1) / ((y.items!=label_2).sum() + 1)
r_1 = np.log(pr1/pr1_not)
log-count-ratio:
pr2 = (p2+1) / ((y.items==label_3).sum() + 1)
pr2_not = (p2+1) / ((y.items!=label_3).sum() + 1)
r_2 = np.log(pr2/pr2_not)
...
Is this correct? Does it mean I get multiple ratios?

Yes this is correct. The “negative class” is basically all the classes but the one you are considering. So yes, you will get multiple ratios (as many as number of classes you have).

From https://marvinlsj.github.io/2018/11/23/NBSVM%20for%20sentiment%20and%20topic%20classification/
, the log-count-ratio is derived from Posterior prob ratio which is good for comparing 2 classes to get insight into which is the most probable. I guess you're trying to do one-vs-one method for multi-class problem. This will end up with 5x4/2=10 pairs of ratios for classification. If you'd like to do classification only, we normally compute Posterior prob for each class and select the best. So in your case, you just select the best from sum(log(p1)), sum(log(p2)), ..., sum(log(p5)).

Related

How to calculate Covariance and Correlation in Python without using cov and corr?

How can we calculate the correlation and covariance between two variables without using cov and corr in Python3?
At the end, I want to write a function that returns three values:
a boolean that is true if two variables are independent
covariance of two variables
correlation of two variables.
You can find the definition of correlation and covariance here:
https://medium.com/analytics-vidhya/covariance-and-correlation-math-and-python-code-7cbef556baed
I wrote this part for covariance:
'''
ans=[]
mean_x , mean_y = x.mean() , y.mean()
n = len(x)
Cov = sum((x - mean_x) * (y - mean_y)) / n
sum_x = float(sum(x))
sum_y = float(sum(y))
sum_x_sq = sum(xi*xi for xi in x)
sum_y_sq = sum(yi*yi for yi in y)
psum = sum(xi*yi for xi, yi in zip(x, y))
num = psum - (sum_x * sum_y/n)
den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
if den == 0: return 0
return num / den
'''
For the covariance, just subtract the respective means and multiply the vectors together (using the dot product). (Of course, make sure whether you're using the sample covariance or population covariance estimate -- if you have "enough" data the difference will be tiny, but you should still account for it if necessary.)
For the correlation, divide the covariance by the standard deviations of both.
As for whether or not two columns are independent, that's not quite as easy. For two random variables, we just have that $\mathbb{E}\left[(X - \mu_X)(Y - \mu_Y)\right] = 0$, where $\mu_X, \mu_Y$ are the means of the two variables. But, when you have a data set, you are not dealing with the actual probability distributions; you are dealing with a sample. That means that the correlation will very likely not be exactly $0$, but rather a value close to $0$. Whether or not this is "close enough" will depend on your sample size and what other assumptions you're willing to make.

Singular matrix when implementing GMM from scratch

In the EM algorithm for Gaussian Mixture Models. We go through the above steps. 9.23 composes the E step and 9.24,9.25,9.26 composes the M step.
However, Implementing 9.25 always gave $\Sigma_K$ that were singular. No matter mean or sigma I chose, the algorithm would create a singular $\Sigma$ after the first iteration.
My implementation of M step:
for k in range(K):
mu[:,k] = np.sum(X*gamma[:,k][:,None],axis = 0)/N[k]
for k in range(K):
A = X - mu[:,k]
SigmaK = A.T#(A *gamma[:,k][:,None])
SigmaK /=self.N[k]
sigma[k] = SigmaK
pi = N*(1/N.sum())
Is there something particularly wrong with this implementation that would cause covariance to always be singular?

sklearn customized standarization of data

Suppose I have a 2D numpy array:
X = np.array[
[..., ...],
[..., ...]]
And I want to standardize the data either with:
X = StandardScaler().fit_transform(X)
or:
X = (X - X.mean())/X.std()
The results are different. Why are they different?
Assuming X is a feature matrix of shape (n x m) (n instances and m features). We want to scale each feature so its instances are distributed with a mean of zero and with unit variance.
To do this you need to calculate the mean and standard deviation of each feature for the provided instances (column of X) and then calculate the scaled feature vectors. Currently you are calculating the mean and standard deviation of the whole dataset and scaling the data using these values: this will give you meaningless results in all but a few special cases (i.e., X = np.ones((100,2)) is such a special case).
Practically, to calculate these statistics for each feature you will need to set the axis parameter of the .mean() or .std() methods to 0. This will perform the calculations along the columns and return a (1 x m) shaped array (actually a (m,) array, but thats another story), where each value is the mean or standard deviation for the given column. You can then use numpy broadcasting to correctly scale the feature vectors.
The below example shows how you can correctly implement it manually. x1 and x2 are 2 features with 100 training instances. We store them in a feature matrix X.
x1 = np.linspace(0, 100, 100)
x2 = 10 * np.random.normal(size=100)
X = np.c_[x1, x2]
# scale the data using the sklearn implementation
X_scaled = StandardScaler().fit_transform(X)
# scale the data taking mean and std along columns
X_scaled_manual = (X - X.mean(axis=0)) / X.std(axis=0)
If you print the two you will see they match exactly, explicitly:
print(np.sum(X_scaled-X_scaled_manual))
returns 0.0.

Bayesian network conditional independency

If we observe that it is cloudy and raining. What is the probability that the grass is wet? The answer would be:
P(W=T|C=T,R =T) = P(W=T|R=T,S=T)*P(S=T|C=T)+P(W=T|R=T,S=F)*P(S=F|C=T)
But if we observe that the sprinkler is on and the grass is wet, then what would be the probability that it is raining? I'm not sure what would be the solution query to this problem?
The question is a bit off-topic and better for math, because formulas aren't supported here...
1) First, apply the definition of conditional probability:
p(R|S,W) = p(R,S,W) / p(S,W)
2) The numerator can be computed by the Law of total probability:
p(R,S,W) = p(R,S,W|C)p(C) + p(R,S,W|!C)p(!C)
and Bayesian network condition:
p(R,S,W|C) = p(W|S,R) p(S|C) p(R|C)
3) The denominator is computed likewise, but conditioning on both R and C:
p(S,W) = p(S,W|R,C)p(R|C)p(C) + p(S,W|R,!C)p(R|!C)p(!C) +
p(S,W|!R,C)p(!R|C)p(C) + p(S,W|!R,!C)p(!R|!C)p(!C)
Finally, each
p(S,W|R,C) = p(S,W,R,C) / p(R,C) =
p(W|S,R) p(S|C) p(R, C) / p(R,C) =
p(W|S,R) p(S|C)
This will give you all four: p(S,W|R,C), p(S,W|R,!C), p(S,W|!R,C) and p(S,W|!R,!C), which in turn give p(S,W).

Incorporating uncertainty into a pymc3 model

I have a set of data for which I have the mean, standard deviation and number of observations for each point (i.e., I have knowledge regarding the accuracy of the measure). In a traditional pymc3 model where I look only at the means, I may do something along the lines of:
x = data['mean']
with pm.Model() as m:
a = pm.Normal('a', mu=0, sd=1)
b = pm.Normal('b', mu=1, sd=1)
y = a + b*x
eps= pm.HalfNormal('eps', sd=1)
likelihood = pm.Normal('likelihood', mu=y, sd=eps, observed=x)
What is the best way to incorporate the information regarding the variance of the observations into the model? Obviously the result should weight low-variance observations more heavily than high-variance (less certain) observations.
One approach a statistician suggested was to do the following:
x = data['mean'] # mean of observation
x_sd = data['sd'] # sd of observation
x_n = data['n'] # of measures for observation
x_sem = x_sd/np.sqrt(x_n)
with pm.Model() as m:
a = pm.Normal('a', mu=0, sd=1)
b = pm.Normal('b', mu=1, sd=1)
y = a + b*x
eps = pm.HalfNormal('eps', sd=1)
obs = mc.Normal('obs', mu=x, sd=x_sem, shape=len(x))
likelihood = pm.Normal('likelihood', mu=y, eps=eps, observed=obs)
However, when I run this I get:
TypeError: observed needs to be data but got: <class 'pymc3.model.FreeRV'>
I am running the master branch of pymc3 (3.0 has some performance issues resulting in very slow sample times).
You are close, you just need to make some small changes. The main reason is that for PyMC3 data is always constant. Check the following code:
with pm.Model() as m:
a = pm.Normal('a', mu=0, sd=1)
b = pm.Normal('b', mu=1, sd=1)
mu = a + b*x
mu_est = pm.Normal('mu_est', mu, x_sem, shape=len(x))
likelihood = pm.Normal('likelihood', mu=mu_est, sd=x_sd, observed=x)
Notice than I keep the data fixed and I introduce the observed uncertainty at two points: for the estimation of mu_est and for the likelihood. Of course you are free to do not use x_sem or x_sd and instead estimate them, like you did in your code with the variable eps.
On a historical note, code with "random data" used to work on PyMC3 (at least for some models), but given that it was not really designed to work that way, developers decided to prevent the user from using random data, and that explains the message you got.

Resources