Due to privacy issues I don't have the original raw data matrices, but instead I can have covariance matrices of x and y (x'x, y'y, x'y) datasets or the correlation matrix between the two of them (or any other sort of matrix that is not the original data matrix).
I need to find a way to apply canonical correlation analysis directly on those matrices. Browsing the net I didn't find any solution to my problem. I want to ask if there is already an implemented algorithm able to work on these data, in R would be the best, but other languages are ok
Example from the tutorial in R for cca package: (https://stats.idre.ucla.edu/r/dae/canonical-correlation-analysis/)
mm <- read.csv("https://stats.idre.ucla.edu/stat/data/mmreg.csv")
colnames(mm) <- c("Control", "Concept", "Motivation", "Read", "Write", "Math",
"Science", "Sex")
You divide the dataset into x and y :
x <- mm[, 1:3]
y <- mm[, 4:8]
Then the function works taking as input these two datasets: cc(x,y) (note that the function standardizes the data by itself).
What I want to know if there is a way to perform cca starting by centering matrices around the mean:
x = scale(x, scale = F)
y = scale(Y, scale = F)
An then computing the covariance matrices x'x, y'y, xy'xy:
cvx = crossprod(x); cvy = crossprod(y); cvxy = crossprod(x,y)
And the algorithm should take in input those matrices to work and compute the canonical variates and correlation coefficients
like: f(cvx, cvy, cvxy)
In this article is written a solution starting from covariance matrices for example, but I don't if it is just theory or someone has actually implemented it
http://graphics.stanford.edu/courses/cs233-20-spring/ReferencedPapers/CCA_Weenik.pdf
I hope to be exhaustive enough!
In short: the correlation are using internally in most (probably all) CCA analysis.
In long: you will need to work out a bit how to do that depending on the case. Let me show you below a example.
What is Canonical-correlation analysis (CCA)?
Canonical-correlation analysis (CCA): help you to identify the best possible linear relations you could create between two datasets. See wikipedia. See references for examples. I will follow this post for the data and use libraries.
Set up libraries, upload the data, select some variables, removed nans, estandarizad the data.
import pandas as pd
import numpy as np
df = pd.read_csv('2016 School Explorer.csv')
# choose relevant features
df = df[['Rigorous Instruction %',
'Collaborative Teachers %',
'Supportive Environment %',
'Effective School Leadership %',
'Strong Family-Community Ties %',
'Trust %','Average ELA Proficiency',
'Average Math Proficiency']]
df.corr()
# drop missing values
df = df.dropna()
# separate X and Y groups
X = df[['Rigorous Instruction %',
'Collaborative Teachers %',
'Supportive Environment %',
'Effective School Leadership %',
'Strong Family-Community Ties %',
'Trust %'
]]
Y = df[['Average ELA Proficiency',
'Average Math Proficiency']]
for col in X.columns:
X[col] = X[col].str.strip('%')
X[col] = X[col].astype('int')
# Standardise the data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler(with_mean=True, with_std=True)
X_sc = sc.fit_transform(X)
Y_sc = sc.fit_transform(Y)
What are Correlations?
I am pausing here to talk about the idea and the implementation.
First of all CCA analysis is naturally based on that idea however for the numerical resolution there are different ways to do that.
The definition from wikipedia. See the pic:
I am talking about this because I am going to modify a function of that library and I want you to really pay attention to that.
See Eq 4 in Bilenko et al 2016. But you need to be really careful with how to place that well.
Notice that strictly speaking you do not need the correlations.
Let me show the the function that is working out that expression, in pyrrcca library here
def kcca(data, reg=0., numCC=None, kernelcca=True,
ktype='linear',
gausigma=1.0, degree=2):
"""Set up and solve the kernel CCA eigenproblem
"""
if kernelcca:
kernel = [_make_kernel(d, ktype=ktype, gausigma=gausigma,
degree=degree) for d in data]
else:
kernel = [d.T for d in data]
nDs = len(kernel)
nFs = [k.shape[0] for k in kernel]
numCC = min([k.shape[1] for k in kernel]) if numCC is None else numCC
# Get the auto- and cross-covariance matrices
crosscovs = [np.dot(ki, kj.T) for ki in kernel for kj in kernel]
# Allocate left-hand side (LH) and right-hand side (RH):
LH = np.zeros((sum(nFs), sum(nFs)))
RH = np.zeros((sum(nFs), sum(nFs)))
# Fill the left and right sides of the eigenvalue problem
for i in range(nDs):
RH[sum(nFs[:i]) : sum(nFs[:i+1]),
sum(nFs[:i]) : sum(nFs[:i+1])] = (crosscovs[i * (nDs + 1)]
+ reg * np.eye(nFs[i]))
for j in range(nDs):
if i != j:
LH[sum(nFs[:j]) : sum(nFs[:j+1]),
sum(nFs[:i]) : sum(nFs[:i+1])] = crosscovs[nDs * j + i]
LH = (LH + LH.T) / 2.
RH = (RH + RH.T) / 2.
maxCC = LH.shape[0]
r, Vs = eigh(LH, RH, eigvals=(maxCC - numCC, maxCC - 1))
r[np.isnan(r)] = 0
rindex = np.argsort(r)[::-1]
comp = []
Vs = Vs[:, rindex]
for i in range(nDs):
comp.append(Vs[sum(nFs[:i]):sum(nFs[:i + 1]), :numCC])
return comp
The output from here the Canonical Covariates (comp), those are a and b in Eq4 in Bilenko et al 2016.
I just want you to pay attention to this:
# Get the auto- and cross-covariance matrices
crosscovs = [np.dot(ki, kj.T) for ki in kernel for kj in kernel]
That is exactly the place where that operation happens. Notice that is not exactly the definition from Wikipedia, however is mathematically equivalent.
Calculation of the correlations
I am going to calculate the correlations as in wikipedia but later I will modify that function, so it is going to bit a couple of details, to make sure this is answering the original questions clearly.
# Get the auto- and cross-covariance matrices
crosscovs = [np.dot(ki, kj.T) for ki in kernel for kj in kernel]
print(crosscovs)
[array([[1217. , 746.04496925, 736.14178336, 575.21073838,
517.52474332, 641.25363806],
[ 746.04496925, 1217. , 732.6297358 , 1094.38480773,
572.95747557, 1073.96490387],
[ 736.14178336, 732.6297358 , 1217. , 559.5753228 ,
682.15312862, 774.36607617],
[ 575.21073838, 1094.38480773, 559.5753228 , 1217. ,
495.79248754, 1047.31981248],
[ 517.52474332, 572.95747557, 682.15312862, 495.79248754,
1217. , 632.75610906],
[ 641.25363806, 1073.96490387, 774.36607617, 1047.31981248,
632.75610906, 1217. ]]), array([[367.74099904, 391.82683717],
[348.78464015, 355.81358426],
[440.88117453, 514.22183796],
[326.32173163, 311.97282341],
[216.32441793, 269.72859023],
[288.27601974, 304.20209135]]), array([[367.74099904, 348.78464015, 440.88117453, 326.32173163,
216.32441793, 288.27601974],
[391.82683717, 355.81358426, 514.22183796, 311.97282341,
269.72859023, 304.20209135]]), array([[1217. , 1139.05867099],
[1139.05867099, 1217. ]])]
Have a look to the output, I am going to change that a bit so is between -1 and 1. Again, this modification is minor. Following the definition from wikipedia the authors just care about the numerator, and I am just going to include now the denominator.
max_unit = 0
for crosscov in crosscovs:
max_unit = np.max([max_unit,np.max(crosscov)])
"""I normalice"""
crosscovs_new = []
for crosscov in crosscovs:
crosscovs_new.append(crosscov/max_unit)
print(crosscovs_new)
[array([[1. , 0.6130197 , 0.60488232, 0.47264646, 0.4252463 ,
0.52691342],
[0.6130197 , 1. , 0.6019965 , 0.89924799, 0.47079497,
0.88246911],
[0.60488232, 0.6019965 , 1. , 0.45979895, 0.56052024,
0.63629094],
[0.47264646, 0.89924799, 0.45979895, 1. , 0.40738906,
0.86057503],
[0.4252463 , 0.47079497, 0.56052024, 0.40738906, 1. ,
0.51993107],
[0.52691342, 0.88246911, 0.63629094, 0.86057503, 0.51993107,
1. ]]), array([[0.30217009, 0.32196125],
[0.28659379, 0.29236942],
[0.36226884, 0.42253232],
[0.26813618, 0.25634579],
[0.17775219, 0.22163401],
[0.2368743 , 0.24996063]]), array([[0.30217009, 0.28659379, 0.36226884, 0.26813618, 0.17775219,
0.2368743 ],
[0.32196125, 0.29236942, 0.42253232, 0.25634579, 0.22163401,
0.24996063]]), array([[1. , 0.93595618],
[0.93595618, 1. ]])]
For clarity I will show you in a slightly different way to see that the numbers and indeed correlations of the original data.
df.corr()
Average ELA Proficiency Average Math Proficiency
Average ELA Proficiency 1.000000 0.935956
Average Math Proficiency 0.935956 1.000000
That is a way to see as well the variables name. I just want to show you that the numbers above make sense, and are what you are calling correlations.
Calculations of the CCA
So now I will just modify a bit the function kcca from pyrrcca. The idea is for that function to accept the previously calculated correlations matrixes.
from rcca import _make_kernel
from scipy.linalg import eigh
def kcca_working(data, reg=0.,
numCC=None,
kernelcca=False,
ktype='linear',
gausigma=1.0,
degree=2,
crosscovs=None):
"""Set up and solve the kernel CCA eigenproblem
"""
if kernelcca:
kernel = [_make_kernel(d, ktype=ktype, gausigma=gausigma,
degree=degree) for d in data]
else:
kernel = [d.T for d in data]
nDs = len(kernel)
nFs = [k.shape[0] for k in kernel]
numCC = min([k.shape[1] for k in kernel]) if numCC is None else numCC
if crosscovs is None:
# Get the auto- and cross-covariance matrices
crosscovs = [np.dot(ki, kj.T) for ki in kernel for kj in kernel]
# Allocate left-hand side (LH) and right-hand side (RH):
LH = np.zeros((sum(nFs), sum(nFs)))
RH = np.zeros((sum(nFs), sum(nFs)))
# Fill the left and right sides of the eigenvalue problem
for i in range(nDs):
RH[sum(nFs[:i]) : sum(nFs[:i+1]),
sum(nFs[:i]) : sum(nFs[:i+1])] = (crosscovs[i * (nDs + 1)]
+ reg * np.eye(nFs[i]))
for j in range(nDs):
if i != j:
LH[sum(nFs[:j]) : sum(nFs[:j+1]),
sum(nFs[:i]) : sum(nFs[:i+1])] = crosscovs[nDs * j + i]
LH = (LH + LH.T) / 2.
RH = (RH + RH.T) / 2.
maxCC = LH.shape[0]
r, Vs = eigh(LH, RH, eigvals=(maxCC - numCC, maxCC - 1))
r[np.isnan(r)] = 0
rindex = np.argsort(r)[::-1]
comp = []
Vs = Vs[:, rindex]
for i in range(nDs):
comp.append(Vs[sum(nFs[:i]):sum(nFs[:i + 1]), :numCC])
return comp, crosscovs
Let run the function:
comp, crosscovs = kcca_working([X_sc, Y_sc], reg=0.,
numCC=2, kernelcca=False, ktype='linear',
gausigma=1.0, degree=2, crosscovs = crosscovs_new)
print(comp)
[array([[-0.00375779, 0.0078263 ],
[ 0.00061439, -0.00357358],
[-0.02054012, -0.0083491 ],
[-0.01252477, 0.02976148],
[ 0.00046503, -0.00905069],
[ 0.01415084, -0.01264106]]), array([[ 0.00632283, 0.05721601],
[-0.02606459, -0.05132531]])]
So I take the original function, and make possible to introduce the correlations, I also output that just for checking.
I print the Canonical Covariates (comp), those are a and b in Eq4 in Bilenko et al 2016.
Comparing results
Now I am going to compare results from the original and the modified function. I will show you that the results are equivalent.
I could obtain the original results this way. With crosscovs = None, so it is calculated as originally, instead of us introducing it:
comp, crosscovs = kcca_working([X_sc, Y_sc], reg=0.,
numCC=2, kernelcca=False, ktype='linear',
gausigma=1.0, degree=2, crosscovs = None)
print(comp)
[array([[-0.13109264, 0.27302457],
[ 0.02143325, -0.12466608],
[-0.71655285, -0.2912628 ],
[-0.43693303, 1.03824477],
[ 0.01622265, -0.31573818],
[ 0.49365965, -0.44098996]]), array([[ 0.2205752 , 1.99601077],
[-0.90927705, -1.79051045]])]
I print the Canonical Covariates (comp), those are a' and b' in Eq4 in Bilenko et al 2016.
a, b and a', b' are different but they are just different in the scale, so for all purpose they are equivalent. This is because of the correlations definitions.
To show that let me pick up numbers from each case and calculate the ratio:
print(0.00061439/-0.00375779)
-0.16349769412340764
print(0.02143325/-0.13109264)
-0.16349697435340382
They are the same result.
When that is modified you could just build in the top of that.
References:
Cool post with example and explanations in Python, using library pyrcca: https://towardsdatascience.com/understanding-how-schools-work-with-canonical-correlation-analysis-4c9a88c6b913
Bilenko, Natalia Y., and Jack L. Gallant. "Pyrcca: regularized kernel canonical correlation analysis in python and its applications to neuroimaging." Frontiers in neuroinformatics 10 (2016): 49. Paper in which pyrcca is explained: https://www.frontiersin.org/articles/10.3389/fninf.2016.00049/full
I am doing a fuzzy similarity matching between all rows in 'name' column using python pyspark in Jupyter notebook. The expected output is to produce a column with the similar string together with the score for each of the string as a new column. My question is quite fimiliar with this question, it's just that the question is in R language and it used 2 datasets (mine is only 1). As I'm quite new to python, I'm quite confused how to do it.
I'm also have used a simple code with similar function however not so sure how to run it for the dataframe.
Here is the code:
import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
""" levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of s and the
first j characters of t
"""
# Initialize matrix of zeros
rows = len(s)+1
cols = len(t)+1
distance = np.zeros((rows,cols),dtype = int)
# Populate matrix of zeros with the indeces of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k
# Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if s[row-1] == t[col-1]:
cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
return Ratio
else:
# print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
# insertions and/or substitutions
# This is the minimum number of edits needed to convert string a to string b
return "The strings are {} edits away".format(distance[row][col])
#example I do for simple string
Str1 = "Apple Inc."
Str2 = "Jo Inc"
Distance = levenshtein_ratio_and_distance(Str1,Str2)
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1,Str2,ratio_calc = True)
print(Ratio)
However, the code above only applicable for string. What is I want to run the dataframe as the input instead of string. For example, the input data is (Saying that dataset name is customer):
name
1 Ace Co
2 Ace Co.
11 Baes
4 Bayes Inc.
8 Bayes
12 Bays
10 Bcy
15 asd
13 asd
The expected outcome is:
name b_name dist
Ace Co Ace Co. 0.64762
Baes Bayes Inc., Bayes,Bays, Bcy 0.80000,0.86667,0.70000,0.97778
asd asdf 0.08333
I am interested in fitting the following nested random effect model in JAGS.
SAS code
proc nlmixed data=data1 qpoints=20;
parms beta0=2 beta1=1 ;
bounds vara >=0, varb_a >=0;
eta = beta0+ beta1*t+ b2+b3;
p = exp(eta)/(1+exp(eta));
model TestResult ~ binary(p);
random b2 ~ normal(0,vara) subject = HHcode;
random b3 ~ normal(0,varb_a) subject = IDNo_N(HHcode);
run;
My question: How to specify the random effect part?
I have repeated measurements on individuals. These individuals are further nested in the household. Note: The number of individuals per household vary!
Looking forward to hearing from you
Let's assume that we have two vectors which indicate which house and which individual a data point belongs to (these are things you will need to create, in R you can make these by changing a factor to numeric via as.numeric). So, if we have 10 data points from 2 houses and 5 individuals they would look like this.
house_vec = c(1,1,1,1,1,1,2,2,2,2) # 6 points for house 1, 4 for house 2
ind_vec = c(1,1,2,2,3,3,4,4,5,5) # everyone has two observations
N = 10 # number of data points
So, the above vectors tell us that there are 3 individuals in the first house (because the first 6 elements of house_vec are 1 and the first 6 elements of ind_vec range from 1 to 3) and the second house has 2 individuals (last 4 elements of house_vec are 2 and the last 4 elements of ind_vec are 4 and 5). With these vectors, we can do nested indexing in JAGS to create your random effect structure. Something like this would suffice. These vectors would be supplied in the data.list that you have to include with TestResult
for(i in 1:N){
mu_house[house_vec[i]] ~ dnorm(0, taua)
mu_ind[ind_vec[i]] ~ dnorm(mu_house[house_vec[i]], taub_a)
}
# priors
taua ~ dgamma(0.01, 0.01) # precision
sda <- 1 / sqrt(taua) # derived standard deviation
taub_a ~ dgamma(0.01, 0.01) # precision
sdb_a <- 1 / sqrt(taub_a) # derived standard deviation
You would only need to include mu_ind within the linear predictor, as it is informed by mu_house. So the rest of the model would look like.
for(i in 1:N){
logit(p[i]) <- beta0 + beta1 * t + mu_ind[ind_vec[i]]
TestResult[i] ~ dbern(p[i])
}
You would then need to set priors for beta0 and beta1