Naive Bayes classifier with intersecting/orthogonal feature sets? - statistics

I'm faced with a classification problem that seems to lend itself to Naive Bayes Classifier (NBC). However, I have one problem: usually the NBC works by estimating the most likely Class c out of a set of classes C based on an observation x of a random variable X.
In my case I have multiple variables X1, X2 which might or might not share features. Variable X1 could have features (xa,xb,xc) and X2 might have (xc,xd) and another variable X3 might have (xe). Is it possible to construct one classifier, that allows me to classify X1,X2 and X3 simultaneously, in spite of the fact that the features are intersecting or even orthogonal?
The problem could be viewed from another point of view: for certain classes I am missing all data in certain features. Consider the following tables:
Classes = {C1,C2}.
Features = X = {X1,X2,X3}, X1={A,B}, X2={1,2}, X3={Y,N}
Class C1:
X1 X2 X3 #observations
A 1 ? 50
A 2 ? 20
B 1 ? 20
B 2 ? 10
Class C2:
X1 X2 X3 #observations
A 1 Y 20
A 1 N 0
A 2 Y 20
A 2 N 10
B 1 Y 10
B 1 N 20
B 1 Y 10
B 1 N 10
As you can see the feature X3 does not have any bearing on Class C1. No data is available for feature X3 in classifying class C1. Can I make a classifier that classifies X=(A,2,N) into both C1 and C2? How would I calculate the conditional probabilities for the missing data for X3 in class C1?

Related

Excel Count/Sum Based on Multiple Criteria for Same Row

I have a relatively simple problem which I am getting stumped on - perhaps it is this brain fog from Covid. I'll try my best to explain the problem.
Here is a simulated dataset:
A B C D E F G H I J K L M N
1 X1 X2 X3 Y1 Y2 Y3 X1 X2 X3 X1 X2 X3 Ct St
2 1 2 0.2 0 2 0.5 1 2 0.1 2 0.3
3 1 2 0.3 1 1 0.2 1 0.3
4 1 2 0.6 1 2 0.1 1 0.6
5 1 2 1.1 2 0.7 1 0.5 1 1.1
A-N reflects the column names while the first column (1-5) reflects the row names in Excel.
Each column has been labelled as either X (e.g., male) and Y (e.g., female). There are three characteristics for male (X1, X2, X3) and three characteristics for female (Y1, Y2, Y3). We can think of adjacent columns as belonging to a trait (e.g., X1, X2, and X3 in columns A, B and C form a set of male characteristics for trait 1; X1, X2, and X3 in columns G, H and I form a set of similar characteristics but for trait 2, etc.).
For each row, I would like to calculate a count total (Ct, see column M) and sum total (St, see column N) based on a set of conditions.
Count total: Count the number of male (X) traits that feature a "1" for X1 and "2" for X2, giving a 'count total'.
Sum total: Sum the X3 values over male (X) traits that feature a "2" for X2, giving a 'sum total'.
I have manually calculated the count totals and sum totals for each column to make these definitions clearer. In row 1, there are two traits that fulfil the count total criteria (Ct = 2), whereby their X1 values = 1 and X2 values = 2. Notice that while the X2 value in column H qualifies (X2 = 2), X1 in column G is not equal to 1, so it is not counted. Furthermore, we only sum the X3 values for traits 1 and 2 (e.g., X3 in Column C and X3 in Column L), giving us a total of 0.3 (0.2 + 0.1).
The formulae should ignore sets of values that qualify but are for female traits (e.g., see row 3) and should work across missing values (e.g., in col J, row 4, X1 is missing, so it cannot be counted, even if X2 in col K row 4 features a qualifying value of 2).
I hope that makes sense.
My instinct was to use a SUMPRODUCT formula, but I am struggling to integrate the two conditions, e.g., for each row:
=SUMPRODUCT(((A1:L1="X1")*(A2:L2=1))*((A1:L1="X2")*(A2:L2=2)))
Any guidance would be much appreciated.
I haven't checked this thoroughly, but suggest for Ct
=SUMPRODUCT((A$1:J$1="X1")*(A2:J2=1)*(B$1:K$1="X2")*(B2:K2=2))
and for St
=SUMPRODUCT((A$1:J$1="X1")*(A2:J2=1)*(B$1:K$1="X2")*(B2:K2=2)*(C$1:L$1="X3")*C2:L2)
copied down.

Get the values of 4 unknowns from 2 given equations

I have 2 equations in which there are a total of 10 values and out of these 10 values 6 are known and the rest 4 values are unknown. Is there any method in python which could solve this type of problem?
I am talking about the second and fourth equations. Here all the X values are known and in the fourth equation, it is μ12 instead of μ21.
SymPy can tell you what two of the 4 are in terms of the other two (and known values):
>>> from sympy import var, solve
var('i1 x2d k12 x1 x3 mu12 x2 x4 i2 x4d mu21')
(i1, x2d, k12, x1, x3, mu12, x2, x4, i2, x4d, mu21)
>>> solve(
... (i1*x2d-(k12*(x1-x3)+mu12*(x2-x4)),
... i2*x4d-(k12*(x3-x1)+mu21*(x4-x2))),
... (i1,i2,mu12,mu21))
{i1: mu12*(x2 - x4)/x2d + (k12*x1 - k12*x3)/x2d,
i2: mu21*(-x2 + x4)/x4d + (-k12*x1 + k12*x3)/x4d}

how to interpret coefficients in log-log market mix model

I am running a multivariate OLS regression as below using weekly sales and media data. I would like to understand how to calculate the sales contribution when doing log transforms like log-linear, linear-log and log-log.
For example:
Volume_Sales = b0 + b1.TV_GRP + b2.SocialMedia + b3.PaidSearch + e
In this case, the sales contributed by TV is b1 x TV_GRPs (coefficient multiplied by the TV GRP of that month)
Now, my question is: How do we calculate sales contribution for the below cases:
Log-Linear: ln(Volume_Sales) = b0 + b1.TV_GRP + b2.SocialMedia + b3.PaidSearch + e
Linear-Log: Volume_Sales = b0 + b1.TV_GRP) + b2. ln(SocialMedia) + b3. ln(PaidSearch) + e
Log-Log: *ln(Volume_Sales) = b0 + b1.TV_GRP) + b2. ln(SocialMedia) + b3. ln(PaidSearch) + e**
In general terms, a log transformation takes something that acts on the multiplicative scale and re-represents it on the additive scale so certain mathematical assumptions hold: among them, linearity. So to step beyond the "transform data we don't like" paradigm that many of us are guilty of, I like thinking in terms of "does it make most sense if an effect to this variable is additive (+3 units) or multiplicative (3 times as much, 20% reduction, etc)?" That and your diagnostic plots (residual, q-q, etc.) will do a good job of telling you what's the most appropriate in your case.
As for interpreting coefficients, here are some ways I've seen it done.
Linear: y = b0 + b1x + e
Interpretation: there is an estimated b1-unit increase in the mean of y for every 1-unit increase in x.
Log-linear: ln(y) = b0 + b1x + e
Interpretation: there is an estimated change in the median of y by a factor of exp(b1) for every 1-unit increase in x.
Linear-log: y = b0 + b1ln(x) + e
Interpretation: there is an estimated b1*ln(2)-unit increase in the mean of y when x is doubled.
Log-log: ln(y) = b0 + b1ln(x) + e
Interpretation: there is an estimated change in the median of y by a factor of 2^b1 when x is doubled.
Note: these can be fairly readily derived by considering what happens to y if you replace x with (x+1) or with 2x.
These generic-form interpretations tend to make more sense with a bit of context, particularly once you know the sign of the coefficient. Say you've got a log-linear model with an estimated b1 of -0.3. Exponentiated, this is exp(-0.3)=0.74, meaning that there is an estimated change in the median of y by a factor of 0.74 for every 1-unit increase in x ... or better yet, a 26% decrease.
Log-linear means an exponential: ln(y) = a x + b is equivalent to y = exp(a x) * exp(b), which is of the form A^x * B. Likewise, a log-log transform gives a power law: ln(y) = a ln(x) + b is of the form y = B * x^a, with B = exp(b).
On a log-linear plot an exponential will thus be a straight line, and a power law will be on a log-log plot.

SVM in R language

I have two data set like below
1st dataset
x1 x2 types
1 3 1
2 4 1
3 5 1
2nd dataset
x1 x2 types
4 8 -1
2 10 -1
3 12 -1
in 1st dataset x2 = 2+x1
amd 2nd x2= 2*x1
how can I train the dataset for SVM in R language
so that if i input another data like(2,4) it will present in class 2
You have to tell svm() which is the class that has the label that describes you data, what is your dataset, and what parameters you want to use.
For example, supposing all of your data is together on a dataframe called "dataset" you could call:
svm(types ~., data = dataset, kernel = "radial", gamma = 0.1, cost = 1)
To check what parameters are better for your problem you can use tune.svm().
tune.svm(types~., data = dataset, gamma = 10^(-6:-1), cost = 10^(-3:1))

How to do a linear regression in case of incomplete information about output variable

I need to do a linear regression
y <- x1 + x2+ x3 + x4
y is not known
but instead of y we have f(y) which depends on y
for example, y is a probability from 0 to 1 of a binomial distribution over 0, 1
and instead of y we have (the number of 0, the number of 1) out of (the number of 0 + the number of 1) experiments
How should I perform linear regression to find correct y
How should I take into account the amount of information provided that for some x1 x2 x3 we have n experiments which give high confidence value of y, but for other x1 x2 x3 we have low confidence value of y due to small number of measurements
Sounds like you need something like BUGS (Bayes inference Using Gibbs Sampling) for the unknown variable y.
It sounds like you might be asking for logistic regression.

Resources