Multiple linear regression with missing covariates - statistics

Imagine I have a dataset like
df <- data.frame(y=c(11:16), x1=c(23,NA,27,20,20,21), x2=c(NA,9,2,9,7,8))
df
y x1 x2
1 11 23 NA
2 12 NA 9
3 13 27 2
4 14 20 9
5 15 20 7
6 16 21 8
If I perform a multiple linear regression, I get
m <- lm(y~x1+x2, data=df)
summary(m)
Call:
lm(formula = y ~ x1 + x2, data = df)
Residuals:
3 4 5 6
-1.744e-01 -1.047e+00 -4.233e-16 1.221e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.72093 27.06244 0.729 0.599
x1 -0.24419 0.93927 -0.260 0.838
x2 0.02326 1.01703 0.023 0.985
Residual standard error: 1.617 on 1 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.4767, Adjusted R-squared: -0.5698
F-statistic: 0.4556 on 2 and 1 DF, p-value: 0.7234
Here we have 2 observations (1 and 2) deleted due to missingness.
To reduce the effects of missing data, would it be wise to compute 2 different simple linear regressions?
I.e.
m1 <- lm(y~x1, data=df)
m2 <- lm(y~x2, data=df)
In this case, for each model we will have only 1 observation deleted due to missingness.

No, that would probably not be wise.
Because you run into the issue of omitted variables bias.
You can see how this will affect your estimates, for instance for x1, which is inflated:
summary(lm(y~x1, data=df))
Call:
lm(formula = y ~ x1, data = df)
Residuals:
1 3 4 5 6
-2.5287 0.8276 -0.5460 0.4540 1.7931
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.3276 7.1901 2.966 0.0592 .
x1 -0.3391 0.3216 -1.054 0.3692
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.897 on 3 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.2703, Adjusted R-squared: 0.02713
F-statistic: 1.112 on 1 and 3 DF, p-value: 0.3692
Note that you're relation of interest is y~x1+x2, so the effect of x1 on y accounting for the effect of x2, and vice versa.
That is of course not the same as estimating y~x1 and y~x2separately, where you omit the effect of the other explanatory variable.
Now there are of course strategies to deal with missing values.
One option is estimating a Bayesian model, using JAGS for instance, where you can model the missing values. An example would be the following for instance, where I take the mean and standard deviation of each variable to model the missing values:
model{
for(i in 1:N){
y[i] ~ dnorm(yhat[i], tau)
yhat[i] <- a+ b1*x1[i] + b2*x2[i]
# Accounting for missing data
x1[i]~dnorm(22,3)
x2[i]~dnorm(7,1.3)
}
# Priors
b1~dnorm(0, .01)
b2~dnorm(0, .01)
# Hyperpriors
tau <- pow(sd, -2)
sd ~ dunif(0, 20)
}
This is just off the top of my head.
For better and more insightful advice on how deal with missing values I would recommend paying a visit to stats.stackexchange

Related

Contour plots of noisy data - gridding and averaging

I am trying to make a contour plot from a dataframe in which the x and y coordinates are unevenly spaced and sometimes overlap and the z coordinate is noisy:
x y z
1 15.4707 174.6779 1592.811638
2 15.4707 171.3179 1304.953183
3 61.6107 108.2379 1687.233377
4 46.3707 151.6929 1688.368690
5 30.7107 124.5429 1339.451757
6 31.1307 202.8704 1616.756963
7 0.2307 141.5029 1620.288736
8 15.4707 141.9054 1167.798302
9 46.3707 72.0729 1687.546227
10 15.4707 212.6929 638.059709
What I'd like to do is to define a grid in x and y whose gridelines pass coordinates, say
x=[7.5, 22.5, 37.5, 52.5]
y=[60, 120, 180, 240]
In every grid section, I then take the average of the z values and make a new dataframe where the x and y columns are the centres of the grid sections and the z column is the aforementioned average. The dataframe should look something like
x y z
1 15 90 1621.1
2 30 150 1444.2
3 45 210 1651.7
From this stage it easy to get a contour plot using matplotlib.contourf or similar, but how can do this type of gridding and averaging? Is there an elegant way to do it in Pandas or other python packages?

Using Regression - Differentiate between two data frame columns, which is Linear and which is polynomial function?

In a dataframe with 6 columns (A B C D E F), from columns E or F, one is a linear combination of the first 4 columns with varying coefficients while the other column is a polynomial function of the same inputs.
Find which column is linear function and which is polynomial function.
Providing 30 Samples from dataframe (512 total rows)
A B C D E F
0 28400 28482 28025 28060 738.0 117.570740
1 28136 28382 28135 28184 -146.0 295.430176
2 28145 28255 28097 28119 30.0 132.123714
3 28125 28192 27947 27981 357.0 101.298064
4 28060 28146 27981 28007 124.0 112.153318
5 27995 28100 27945 28022 149.0 182.427089
6 28088 28195 27985 28019 167.0 141.255137
7 28049 28157 27996 28008 22.0 120.069010
8 28025 28159 28025 28109 34.0 218.401641
9 28170 28638 28170 28614 420.0 919.376358
10 28666 28980 28551 28710 234.0 475.389093
11 28660 28779 28531 28634 345.0 222.895307
12 28590 28799 28568 28783 265.0 425.738484
13 28804 28930 28740 28808 138.0 194.449548
14 28770 28770 28650 28719 378.0 69.289005
15 28769 28770 28600 28638 413.0 39.225874
16 28694 28866 28674 28847 214.0 346.158401
17 28843 28928 28807 28874 121.0 152.281425
18 28921 28960 28680 28704 491.0 63.234310
19 28683 28950 28628 28905 397.0 547.115621
20 28877 28877 28712 28749 404.0 37.212629
21 28685 29011 28680 28949 222.0 598.104568
22 29045 29180 29045 29111 -3.0 201.306765
23 29220 29499 29216 29481 259.0 546.566915
24 29439 29485 29310 29376 344.0 112.394063
25 29319 29345 28951 29049 906.0 125.333702
26 29001 29009 28836 28938 526.0 110.611943
27 28905 28971 28851 28917 174.0 132.274514
28 28907 28916 28711 28862 685.0 161.078158
29 28890 29025 28802 28946 329.0 280.114923
Performed Linear regression on (512 total rows)
Column A B C D as input, column E as target values.
OUTPUT-
Intercept [-2.67164069e-12]
coefficients[[ 2. 3. -1. -4.]]
Column A B C D as input, column F as target values.
OUTPUT-
Intercept [0.32815962]
coefficients[[ 1.01293825 -1.0003835 1.00503772 -1.01765453]]
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
For column E
x = df.iloc[:, :4].values
y = df.iloc[:, [4]].values
regressor = LinearRegression()
regressor.fit(x, y)
print(regressor.intercept_)
print(regressor.coef_)
output
[-2.67164069e-12]
[[ 2. 3. -1. -4.]]
For column F
x_new = df.iloc[:, :4].values
y_new = df.iloc[:, [5]].values
regressor_new = LinearRegression()
regressor_new.fit(x_new, y_new)
print(regressor_new.intercept_)
print(regressor_new.coef_)
output
[0.32815962]
[[ 1.01293825 -1.0003835 1.00503772 -1.01765453]]
One of the 2 columns is a linear combination of the first 4 columns with varying coefficients while the other is a polynomial function of the same inputs.
Mention which column is a linear function and which is polynomial.
I think the columns with linear combination can be found by checking the multicollinearity between the columns. So, the column/s which is/are linear combination of remaining column/s will have a high VIF.
Try plotting the graphs (histograms) of the two columns, and see if you can identify the function as linear or polynomial based on the graph.

Find the most similar vector/string in Matlab

Consider that I have some vectors (or strings) of numbers which generally have different length, eg x=[1 2 3 4 3 3 3 2 5].
Now, for a new vector y I want to find which one of the existing vectors x is the most similar.
Any idea?
The complete Problem:
I want to predict a time serie with some Neural Networks. Atevery step all the networks predict the next value of the serie. When the real value comes, the network that did the best prediction wins and I write its number to the vector X. After i finish with the timeserie1 i will have generate a vector X1 and each element of it will represend the best NN.
Now consider that i have 10 time series , so 10 X vectors. For a new one time serie Y i will do the same procedure. I want to define the kind of the Y using its similarity between this and the X vectors. I think the most importand aspect is the succession of the NN. I need for output something like a number or percentage of similarity.
eg:
X1= [ 1 1 2 2 3 3 4 4 5 5 6 6 ]
X2=[1 2 3 4 5 6 1 2 3 4 5 6]
Y=[1 1 1 2 2 3 4 5 5 6 6]
Then Y is more similar to X1

Naive Bayes (Weka) - Attributes total x Instances total - Why is it different?

I've been running a dataset through Weka, applying NB.
I stuck on the following problem: while I was analyzing it, I noticed the difference between total number in attributes section, and total instances appeared in log.
If you sum "a0" attribute, you'll notice Weka points 1044 instances.
If you check "Instances", it is 1036.
Dataset, actually, contains 1036 instances.
Does anyone have a explanation about it? Thanks.
Here's a log paste:
=== Run information ===
Scheme: weka.classifiers.bayes.NaiveBayes
Relation: teste.carro
Instances: 1036
Attributes: 7
a0
a1
a2
a3
a4
a5
class
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute 0 1
(0.5) (0.5)
===========================
a0
1 105.0 175.0
2 112.0 165.0
3 153.0 109.0
4 152.0 73.0
[total] 522.0 522.0
a1
1 101.0 165.0
2 123.0 165.0
3 136.0 119.0
4 162.0 73.0
[total] 522.0 522.0
a2
1 150.0 107.0
2 122.0 133.0
3 121.0 141.0
4 129.0 141.0
[total] 522.0 522.0
a3
1 247.0 1.0
2 134.0 265.0
3 140.0 255.0
[total] 521.0 521.0
a4
1 189.0 127.0
2 177.0 185.0
3 155.0 209.0
[total] 521.0 521.0
a5
1 244.0 1.0
2 160.0 220.0
3 117.0 300.0
[total] 521.0 521.0
Time taken to build model: 0 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0.01 seconds
=== Summary ===
Correctly Classified Instances 957 92.3745 %
Incorrectly Classified Instances 79 7.6255 %
Kappa statistic 0.8475
Mean absolute error 0.1564
Root mean squared error 0.2398
Relative absolute error 31.2731 %
Root relative squared error 47.9651 %
Coverage of cases (0.95 level) 100 %
Mean rel. region size (0.95 level) 80.2124 %
Total Number of Instances 1036
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0,847 0,000 1,000 0,847 0,917 0,858 0,989 0,991 0
1,000 0,153 0,868 1,000 0,929 0,858 0,989 0,988 1
Weighted Avg. 0,924 0,076 0,934 0,924 0,923 0,858 0,989 0,989
=== Confusion Matrix ===
a b <-- classified as
439 79 | a = 0
0 518 | b = 1
Reading from "Data Mining: Practical Machine Learning Tools and Techniques" by Witten and Frank (the companion book for Weka) a problem is pointed out in naive Bayes.
If a particular attribute value does not appear with every possible class value, then the zero attribute has undue influence over the class prediction. In Weka, this possibility is avoided by adding one to the numerator of every categorical attribute when calculating the conditional probabilities (with the denominator adjusted accordingly). If you look at your example you can verify this is what was done.
Below I attempt to explain the undue influence that is exhibited by the absence of an attribute value.
The naive bayes formula:
P(y|x)= ( P(x1|y) * P(x2|y) * ... * P(xn|y) * P(Y) ) / P(x)
From the naive bayes formula we can see what they mean:
Say:
P(x1|y1) = 0
P(x2|y1) ... P(xn|y1) all equal 1
From the above formula:
P(y1|x) = 0
Even though all other attributes strongly indicate that the instance belongs to class y1, the resulting probability is zero. The adjustment made by Weka allows for the possibility that the instance still comes from the class y1.
A true numeric example can be found starting around slide 12 on this webpage

Effective reasonable indexing for numeric vector search?

I have a long numeric table where 7 columns are a key and 4 columns is a value to find.
Actually I have rendered an object with different distances and perspective angles and have calculated Hu moments for it's contour. But this is not important to the question, just a sample to imagine.
So, when I have 7 values, I need to scan a table, find closest values in that 7 columns and extract corresponding 4 values.
So, the task aspects to consider is follows:
1) numbers have errors
2) the scale in function domain is not the same as the scale in function value; i.e. the "distance" from point in 7-dimensional space should depend on that 4 values, how it affect
3) search should be fast
So the question is follows: isn't some algorithm out there to solve this task efficiently, i.e. perform some indexing on that 7 columns, but do this no like conventional databases do, but taking into account point above.
If I understand the problem correctly, you might consider using scipy.cluster.vq (vector quantization):
Suppose your 7 numeric columns look like this (let's call the array code_book):
import scipy.cluster.vq as vq
import scipy.spatial as spatial
import numpy as np
np.random.seed(2013)
np.set_printoptions(precision=2)
code_book = np.random.random((3,7))
print(code_book)
# [[ 0.68 0.96 0.27 0.6 0.63 0.24 0.7 ]
# [ 0.84 0.6 0.59 0.87 0.7 0.08 0.33]
# [ 0.08 0.17 0.67 0.43 0.52 0.79 0.11]]
Suppose the associated 4 columns of values looks like this:
values = np.arange(12).reshape(3,4)
print(values)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
And finally, suppose we have some "observations" of 7-column values like this:
observations = np.random.random((5,7))
print(observations)
# [[ 0.49 0.39 0.41 0.49 0.9 0.89 0.1 ]
# [ 0.27 0.96 0.16 0.17 0.72 0.43 0.64]
# [ 0.93 0.54 0.99 0.62 0.63 0.81 0.36]
# [ 0.17 0.45 0.84 0.02 0.95 0.51 0.26]
# [ 0.51 0.8 0.2 0.9 0.41 0.34 0.36]]
To find the 7-valued row in code_book which is closest to each observation, you could use vq.vq:
index, dist = vq.vq(observations, code_book)
print(index)
# [2 0 1 2 0]
The index values refer to rows in code_book. However, if the rows in values are ordered the same way as code_book, we can "lookup" the associated value with values[index]:
print(values[index])
# [[ 8 9 10 11]
# [ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]
# [ 0 1 2 3]]
The above assumes you have all your observations arranged in an array. Thus, to find all the indices you need only one call to vq.vq.
However, if you obtain the observations one at a time and need to find the closest row in code_book before going on to the next observation, then it would be inefficient to call vq.vq each time. Instead, generate a KDTree once, and then find the nearest neighbor(s) in the tree:
tree = spatial.KDTree(code_book)
for observation in observations:
distances, indices = tree.query(observation)
print(indices)
# 2
# 0
# 1
# 2
# 0
Note that the number of points in your code_book (N) must be large compared to the dimension of the data (e.g. N >> 2**7) for the KDTree to be fast compared to simple exhaustive search.
Using vq.vq or KDTree.query may or may not be faster than exhaustive search, depending on the size of your data (code_book and observations). To find out which is faster, be sure to benchmark these versus an exhaustive search using timeit.
i don't know if i understood well your question,but i will try giving an answer.
for each row K in the table compute the distance of your key from the key in that row:
( (X1-K1)^2 + (X2-K2)^2 + (X3-K3)^2 + (X4-K4)^2 + (X5-K5)^2 + (X6-K6)^2 + (X7-K7)^2 )^0.5
where {X1,X2,X3,X4,X5,X6,X7} is the key and {K1,K2,K3,K4,K5,K6,K7}is the key at row K
you could make one factor of the key more or less relevant of the others multiplying it while computing distance,for example you could replace (X1-K1)^2 in the formula above with 5*(X1-K1)^2 to make that more influent.
and store in a variable the distance ,in a second variable the row number
do the same with the following rows and if the new distance is lower then the one you stored then replace the distance and the row number.
when you have checked all the rows in your table the second variable you have used will show you the nearest row to the key
here is some pseudo-code:
int Row= 0
float Key[7] #suppose it is already filled with some values
float ClosestDistance= +infinity
int ClosestRow= 0
while Row<NumberOfRows{
NewDistance= Distance(Key,Table[Row][0:7])#suppose Distance is a function that outputs the distance and Table is the table you want to control Table[Row= NumberOfRows][Column= 7+4]
if NewDistance<ClosestDistance{
ClosestDistance= NewDistance
ClosestRow= Row}
increase row by 1}
ValueFound= Table[ClosestRow][7:11]#this should be the value you were looking for
i know it isn't fast but it is the best i could do,hope it helped.
P.S. i haven't considered measurement errors,i know.

Resources