Naive Bayes (Weka) - Attributes total x Instances total - Why is it different? - attributes

I've been running a dataset through Weka, applying NB.
I stuck on the following problem: while I was analyzing it, I noticed the difference between total number in attributes section, and total instances appeared in log.
If you sum "a0" attribute, you'll notice Weka points 1044 instances.
If you check "Instances", it is 1036.
Dataset, actually, contains 1036 instances.
Does anyone have a explanation about it? Thanks.
Here's a log paste:
=== Run information ===
Scheme: weka.classifiers.bayes.NaiveBayes
Relation: teste.carro
Instances: 1036
Attributes: 7
a0
a1
a2
a3
a4
a5
class
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute 0 1
(0.5) (0.5)
===========================
a0
1 105.0 175.0
2 112.0 165.0
3 153.0 109.0
4 152.0 73.0
[total] 522.0 522.0
a1
1 101.0 165.0
2 123.0 165.0
3 136.0 119.0
4 162.0 73.0
[total] 522.0 522.0
a2
1 150.0 107.0
2 122.0 133.0
3 121.0 141.0
4 129.0 141.0
[total] 522.0 522.0
a3
1 247.0 1.0
2 134.0 265.0
3 140.0 255.0
[total] 521.0 521.0
a4
1 189.0 127.0
2 177.0 185.0
3 155.0 209.0
[total] 521.0 521.0
a5
1 244.0 1.0
2 160.0 220.0
3 117.0 300.0
[total] 521.0 521.0
Time taken to build model: 0 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0.01 seconds
=== Summary ===
Correctly Classified Instances 957 92.3745 %
Incorrectly Classified Instances 79 7.6255 %
Kappa statistic 0.8475
Mean absolute error 0.1564
Root mean squared error 0.2398
Relative absolute error 31.2731 %
Root relative squared error 47.9651 %
Coverage of cases (0.95 level) 100 %
Mean rel. region size (0.95 level) 80.2124 %
Total Number of Instances 1036
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0,847 0,000 1,000 0,847 0,917 0,858 0,989 0,991 0
1,000 0,153 0,868 1,000 0,929 0,858 0,989 0,988 1
Weighted Avg. 0,924 0,076 0,934 0,924 0,923 0,858 0,989 0,989
=== Confusion Matrix ===
a b <-- classified as
439 79 | a = 0
0 518 | b = 1

Reading from "Data Mining: Practical Machine Learning Tools and Techniques" by Witten and Frank (the companion book for Weka) a problem is pointed out in naive Bayes.
If a particular attribute value does not appear with every possible class value, then the zero attribute has undue influence over the class prediction. In Weka, this possibility is avoided by adding one to the numerator of every categorical attribute when calculating the conditional probabilities (with the denominator adjusted accordingly). If you look at your example you can verify this is what was done.
Below I attempt to explain the undue influence that is exhibited by the absence of an attribute value.
The naive bayes formula:
P(y|x)= ( P(x1|y) * P(x2|y) * ... * P(xn|y) * P(Y) ) / P(x)
From the naive bayes formula we can see what they mean:
Say:
P(x1|y1) = 0
P(x2|y1) ... P(xn|y1) all equal 1
From the above formula:
P(y1|x) = 0
Even though all other attributes strongly indicate that the instance belongs to class y1, the resulting probability is zero. The adjustment made by Weka allows for the possibility that the instance still comes from the class y1.
A true numeric example can be found starting around slide 12 on this webpage

Related

Excel and selecting variables conditionally

I have a data set which contains information by country. For example, Australia_F is the observation for Australia and Australia_Weight is the weight of Australia. Each period, represents a specific year.
Period Australia_F Canada_F Denmark_F Japan_F Australia_Weight Canada_Weight Denmark_Weight Japan_weight
1985 0.05 -0.02 0.02 0.03 0.10 0.30 0.45 0.15
1986 -0.04 -0.03 0.02 0.01 0.15 0.30 0.30 0.25
The user can input any value to the following cell. For example I have inserted 3
Weight_Modification = 3
The goal is to only include countries where the variable XXXXX_F are positive
and use those with the highest values such that the total weight of counties selected is not greater than 1.
The problem is complicated by the fact that the weight_modification variable, multiplies each individual county weight by whatever the value is. For example, the Weight for Australia would be 0.10 *3 = 0.3 in 1985.
Total weights can be less than 1.00 but can't be greater than 1.00
So taking the above data as an example and for 1985 the results would be
Australia_weight Canada_weight Denmark_weight Japan_weight Total_weight
0.3 0.45 0.75
This is because in 1985 Australia has the highest value (Australia_F = 0.05), followed by Japan (Japan_F = 0.03).
Each countries weights are multiplied by 3.
Denmark is not selected even through Denmark_F is positive, because including Denmark the total weight exceeds 1.
In the actual file there are many more countries (12 in total) and many years.
Any help with how to put this together in excel is greatly appreciated.

Train/test set split for LSTM with multivariate Time series

I'm trying to solve time series prediction problem for multivariate data in Python using LSTM approach.
In here , the author solving problem for time series air pollution prediction. The data looks like this:
pollution dew temp press wnd_dir wnd_spd snow rain
date
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0
As opposed to yearly in the above tutorial, I have a 30-sec time step observations on soccer matches with over 20 features. Where each match with unique ID has different length ranging from 190 to 200.
The author split train/test set by number of days in a year as follow:
# split into train and test sets
values = reframed.values
n_train_hours = 365 * 24
train = values[:n_train_hours, :]
test = values[n_train_hours:, :]
So my train/test set should be by number of matches:
(matches*len(match))
n_train_matches = some k number of matches * len(match)
train = values[:n_train_matches, :]
test = values[n_train_matches:, :]
I want to translate this to my problem to make a prediction for each feature as early as time t=2. I.e. 30-sec into a match.
Question
Do I need to apply pre-Sequence Padding on each match?
Is there a way of solving the problem without padding?
If you are using an LSTM then I believe you are more likely to benefit from that model if you are padding and feeding in multiple 30 second step observations.
If you didn't pad the sequences, and you wanted a prediction at t=2, then you'll only be able to use the very last step-observation.

excel formula in one column based on variable entries in another column

My goal is to populate column G for each row with a CPTCode(column B). If I were to copy and paste by hand this would simply be: f2/f5, f3/f5,f4/f5,f6/f15,f7/f15, etc. Column A contains the names of staff. Each month staff codes for different types of procedures(columns B and C) and a variable quantity of each(column E). The rows with total in column B represent the end of a staff person's monthly list of codes. We are hoping to provide monthly updates to our service chiefs. I am hoping someone might be able to help me devise a formula I could use to copy down column G. We are looking at roughly 3000-4000 rows of code per month for about 200 different clinical providers.
CPTCode CPTName Work RVU FY16 Total Qty FY16 Total RVU CPT's % of FY RVUs
96119 NEUROPSYCH TESTING BY TECH 0.6 76 41.8
99212 OFFICE/OUTPATIENT VISIT EST 0.5 2 1.0
T1016 CASE MANAGEMENT 0.5 1 0.5
Total 79 43.3
H0038 SELF-HELP/PEER SVC PER 15MIN 0.0 727 0.0
90853 GROUP PSYCHOTHERAPY 0.6 236 139.2
99212 OFFICE/OUTPATIENT VISIT EST 0.5 153 73.4
S9446 PT EDUCATION NOC GROUP 0.4 105 42.0
99211 OFFICE/OUTPATIENT VISIT EST 0.2 44 7.9
90785 PSYTX COMPLEX INTERACTIVE 0.3 10 3.3
99202 OFFICE/OUTPATIENT VISIT NEW 0.9 1 0.9
99213 OFFICE/OUTPATIENT VISIT EST 1.0 1 1.0
H0031 MH HEALTH ASSESS BY NON-MD 0.6 1 0.6
Total 1278 268.4
H0038 SELF-HELP/PEER SVC PER 15MIN 0.0 452 0.0
98967 HC PRO PHONE CALL 11-20 MIN 0.5 1 0.5
Total 453 0.5
[1]: http://i.stack.imgur.com/dw44F.jpg

Multiple linear regression with missing covariates

Imagine I have a dataset like
df <- data.frame(y=c(11:16), x1=c(23,NA,27,20,20,21), x2=c(NA,9,2,9,7,8))
df
y x1 x2
1 11 23 NA
2 12 NA 9
3 13 27 2
4 14 20 9
5 15 20 7
6 16 21 8
If I perform a multiple linear regression, I get
m <- lm(y~x1+x2, data=df)
summary(m)
Call:
lm(formula = y ~ x1 + x2, data = df)
Residuals:
3 4 5 6
-1.744e-01 -1.047e+00 -4.233e-16 1.221e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.72093 27.06244 0.729 0.599
x1 -0.24419 0.93927 -0.260 0.838
x2 0.02326 1.01703 0.023 0.985
Residual standard error: 1.617 on 1 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.4767, Adjusted R-squared: -0.5698
F-statistic: 0.4556 on 2 and 1 DF, p-value: 0.7234
Here we have 2 observations (1 and 2) deleted due to missingness.
To reduce the effects of missing data, would it be wise to compute 2 different simple linear regressions?
I.e.
m1 <- lm(y~x1, data=df)
m2 <- lm(y~x2, data=df)
In this case, for each model we will have only 1 observation deleted due to missingness.
No, that would probably not be wise.
Because you run into the issue of omitted variables bias.
You can see how this will affect your estimates, for instance for x1, which is inflated:
summary(lm(y~x1, data=df))
Call:
lm(formula = y ~ x1, data = df)
Residuals:
1 3 4 5 6
-2.5287 0.8276 -0.5460 0.4540 1.7931
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.3276 7.1901 2.966 0.0592 .
x1 -0.3391 0.3216 -1.054 0.3692
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.897 on 3 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.2703, Adjusted R-squared: 0.02713
F-statistic: 1.112 on 1 and 3 DF, p-value: 0.3692
Note that you're relation of interest is y~x1+x2, so the effect of x1 on y accounting for the effect of x2, and vice versa.
That is of course not the same as estimating y~x1 and y~x2separately, where you omit the effect of the other explanatory variable.
Now there are of course strategies to deal with missing values.
One option is estimating a Bayesian model, using JAGS for instance, where you can model the missing values. An example would be the following for instance, where I take the mean and standard deviation of each variable to model the missing values:
model{
for(i in 1:N){
y[i] ~ dnorm(yhat[i], tau)
yhat[i] <- a+ b1*x1[i] + b2*x2[i]
# Accounting for missing data
x1[i]~dnorm(22,3)
x2[i]~dnorm(7,1.3)
}
# Priors
b1~dnorm(0, .01)
b2~dnorm(0, .01)
# Hyperpriors
tau <- pow(sd, -2)
sd ~ dunif(0, 20)
}
This is just off the top of my head.
For better and more insightful advice on how deal with missing values I would recommend paying a visit to stats.stackexchange

Effective reasonable indexing for numeric vector search?

I have a long numeric table where 7 columns are a key and 4 columns is a value to find.
Actually I have rendered an object with different distances and perspective angles and have calculated Hu moments for it's contour. But this is not important to the question, just a sample to imagine.
So, when I have 7 values, I need to scan a table, find closest values in that 7 columns and extract corresponding 4 values.
So, the task aspects to consider is follows:
1) numbers have errors
2) the scale in function domain is not the same as the scale in function value; i.e. the "distance" from point in 7-dimensional space should depend on that 4 values, how it affect
3) search should be fast
So the question is follows: isn't some algorithm out there to solve this task efficiently, i.e. perform some indexing on that 7 columns, but do this no like conventional databases do, but taking into account point above.
If I understand the problem correctly, you might consider using scipy.cluster.vq (vector quantization):
Suppose your 7 numeric columns look like this (let's call the array code_book):
import scipy.cluster.vq as vq
import scipy.spatial as spatial
import numpy as np
np.random.seed(2013)
np.set_printoptions(precision=2)
code_book = np.random.random((3,7))
print(code_book)
# [[ 0.68 0.96 0.27 0.6 0.63 0.24 0.7 ]
# [ 0.84 0.6 0.59 0.87 0.7 0.08 0.33]
# [ 0.08 0.17 0.67 0.43 0.52 0.79 0.11]]
Suppose the associated 4 columns of values looks like this:
values = np.arange(12).reshape(3,4)
print(values)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
And finally, suppose we have some "observations" of 7-column values like this:
observations = np.random.random((5,7))
print(observations)
# [[ 0.49 0.39 0.41 0.49 0.9 0.89 0.1 ]
# [ 0.27 0.96 0.16 0.17 0.72 0.43 0.64]
# [ 0.93 0.54 0.99 0.62 0.63 0.81 0.36]
# [ 0.17 0.45 0.84 0.02 0.95 0.51 0.26]
# [ 0.51 0.8 0.2 0.9 0.41 0.34 0.36]]
To find the 7-valued row in code_book which is closest to each observation, you could use vq.vq:
index, dist = vq.vq(observations, code_book)
print(index)
# [2 0 1 2 0]
The index values refer to rows in code_book. However, if the rows in values are ordered the same way as code_book, we can "lookup" the associated value with values[index]:
print(values[index])
# [[ 8 9 10 11]
# [ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]
# [ 0 1 2 3]]
The above assumes you have all your observations arranged in an array. Thus, to find all the indices you need only one call to vq.vq.
However, if you obtain the observations one at a time and need to find the closest row in code_book before going on to the next observation, then it would be inefficient to call vq.vq each time. Instead, generate a KDTree once, and then find the nearest neighbor(s) in the tree:
tree = spatial.KDTree(code_book)
for observation in observations:
distances, indices = tree.query(observation)
print(indices)
# 2
# 0
# 1
# 2
# 0
Note that the number of points in your code_book (N) must be large compared to the dimension of the data (e.g. N >> 2**7) for the KDTree to be fast compared to simple exhaustive search.
Using vq.vq or KDTree.query may or may not be faster than exhaustive search, depending on the size of your data (code_book and observations). To find out which is faster, be sure to benchmark these versus an exhaustive search using timeit.
i don't know if i understood well your question,but i will try giving an answer.
for each row K in the table compute the distance of your key from the key in that row:
( (X1-K1)^2 + (X2-K2)^2 + (X3-K3)^2 + (X4-K4)^2 + (X5-K5)^2 + (X6-K6)^2 + (X7-K7)^2 )^0.5
where {X1,X2,X3,X4,X5,X6,X7} is the key and {K1,K2,K3,K4,K5,K6,K7}is the key at row K
you could make one factor of the key more or less relevant of the others multiplying it while computing distance,for example you could replace (X1-K1)^2 in the formula above with 5*(X1-K1)^2 to make that more influent.
and store in a variable the distance ,in a second variable the row number
do the same with the following rows and if the new distance is lower then the one you stored then replace the distance and the row number.
when you have checked all the rows in your table the second variable you have used will show you the nearest row to the key
here is some pseudo-code:
int Row= 0
float Key[7] #suppose it is already filled with some values
float ClosestDistance= +infinity
int ClosestRow= 0
while Row<NumberOfRows{
NewDistance= Distance(Key,Table[Row][0:7])#suppose Distance is a function that outputs the distance and Table is the table you want to control Table[Row= NumberOfRows][Column= 7+4]
if NewDistance<ClosestDistance{
ClosestDistance= NewDistance
ClosestRow= Row}
increase row by 1}
ValueFound= Table[ClosestRow][7:11]#this should be the value you were looking for
i know it isn't fast but it is the best i could do,hope it helped.
P.S. i haven't considered measurement errors,i know.

Resources