How to compare lme models with different fixed variables? - statistics

If you have 2 linear mixed effect models with the same y variable and with the same dataset but with different fixed variables.
model 1: a * b
model 2: c * d
Can you compare them by using AIC? I am using statsmodels/Python and I know how to calculate AIC, just I could not find info about whether AIC is the right choice here.
Thank you!

Related

Determining the Distance between two matrices using numpy

I am developing my own Architecture Search algorithm using Pythons numpy. Currently I am trying to determine how to develop a cost function that can see the distance between X and Y, or two matrices.
I'd like to reduce the difference between the two, to a meaningful scalar value.
Ideally between 0 and 1, so that if both sets of elements within the matrices are the same numerically and positionally, a 0 is returned.
In the example below, I have the output of my algorithm X. Both X and Y are the same shape. I tried to sum the difference between the two matrices; however I'm not sure that using summation will work in all conditions. I also tried returning the mean. I don't think that either approach will work though. Aside from looping through both matrices and comparing elements directly, is there a way to capture the degree of difference in a scalar?
Y = np.arange(25).reshape(5, 5)
for i in range(1000):
X = algorithm(Y)
# I try to reduce the difference between the two matrices to a scalar value
cost = np.sum(X-Y)
There are many ways to calculate a scalar "difference" between two matrices. Here are just two examples.
The mean square error:
((m1 - m2) ** 2).mean() ** 0.5
The max absolute error:
np.abs(m1 - m2).max()
The choice of the metric depends on your problem.

Issues with OLS Regression - highly similar X and Intercept coefficients

I'm estimating a linear OLS regression using some software, and I have three variables: Y (dependent), X1 (independent), and Intercept (a column of "1"s I manually created). I created Intercept because this particular software doesn't have a function to add a constant term.
The coefficients of X and Intercept are almost perfectly inverse (i.e. Intercept-coefficient = 1.5 and X-coefficient = negative 1.51). Both Y and X are columns of very small percentage changes (i.e. 0.0001). I've tried adding some other independent variables, and quickly run into multicollinearity issues - not sure if that's simply because the variables are highly similar, though.
I'm not very experienced with stats, are the coefficients a dead giveaway from statistical issues with the regression? Any advice is much appreciated, thank you!

sklearn decision tree classifier: How to control max number of branches of each split

I am trying to code a two class classification DT problem that I used SAS EM before. But trying to do it in Sklearn. The target variable is a two class categorical variable. But there are a few continuous independent variables. In SAS I could specify the "Maximum Number of Branches" for each split. So when it is set to 4, some leaf will split into 2 and some in 4 (especially for continuous variables). I could not find an equivalent parameter in sklearn. Looked at "max_leaf-nodes". But that controls the total number of "leaf" nodes of the entire tree. I am sure some of you probably has faced the same situation and already found a solution. Please help/share. I will really appreciate it.
I don't think this option is available in sklearn, You will find this Post very useful for your Classification DT; as it lists all the options you have available.
I would recommend creating Bins for your continues variables; this way you force the branches to be the number of bins you have.
Example: For continuous variable COl1 has values between 1-100; you can create a 4 bins 1-25, 26-50 , 51-75, 76-100. or you can create the bins bases on the median.

modelling multiplicative relationships with categorical data

If I want to create a model that best describes the price of an asset using a multiplicative relationship, that is,
Price = base_rate * size_of_asset * number_of_subassets
(size of asset, number of subassets are both 0,1,2,3... N)
can I do this with a linear combination when the variables are categorical? If they were numerical I could log everything, which would do exactly that... however, the same approach can't be applied with categorical data, can it?
NB: I want to keep it as a multiplicative relationships so it's highly interpretable from a ratio perspective - that is, one can say by increasing the size_of_asset by 30% increases the price by x amount.
Thanks for the advice!
I think log-linear might be your solution as it can help you analyse the multiplicative effects of one or more categorical independent variables with a categorical dependent variable.
Check this out:
http://members.home.nl/jeroenvermunt/esbs2005c.pdf

spark - MLlib: transform and manage categorical features

For big datasets with 2bil+ samples and approximately 100+ features per sample. Among these, 10% features you have are numerical/continuous variables and the rest of it are categorical variables (position, languages, url etc...).
Let's use some examples:
e.g: dummy categorical feature
feature: Position
real values: SUD | CENTRE | NORTH
encoded values: 1 | 2 | 3
...would have sense use reduction like SVD because distance beetween sud:north > sud:centre and, moreover, it's possible to encode (e.g OneHotEncoder, StringIndexer) this variable because of the small cardinality of it values-set.
e.g: real categorical feature
feature: url
real values: very high cardinality
encoded values: ?????
1) In MLlibthe 90% of the model works just with numerical values (a part of Frequent Itemset and DecisionTree techniques)
2) Features transformers/reductor/extractor as PCA or SVD are not good for these kind of data, and there is no implementation of (e.g) MCA
a) Which could be your approach to engage with this kind of data in spark, or using Mllib?
b) Do you have any suggestions to cope with this much categorical values?
c) After reading a lot in literature, and counting the implemented model in spark, my idea, about make inference on one of that features using the others (categorical), the models at point 1 could be the best coiche. What do you think about it?
(to standardize a classical use case you can imagine the problem of infer the gender of a person using visited url and other categorical features).
Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example?
Thanks in advance
Well, first I would say stackoverflow works in a different way, you should be the one providing a working example with the problem you are facing and we help you out using this example.
Anyways I got intrigued with the use of the categorical values like the one you show as position. If this is a categorical value as you mention with 3 levels SUD,CENTRE, NORTH, there is no distance between them if they are truly categorical. In this sense I would create dummy variables like:
SUD_Cat CENTRE_Cat NORTH_Cat
SUD 1 0 0
CENTRE 0 1 0
NORTH 0 0 1
This is a truly dummy representation of a categorical variable.
On the other hand if you want to take that distance into account then you have to create another feature which takes this distance into account explicitly, but that is not a dummy representation.
If the problem you are facing is that after you wrote your categorical features as dummy variables (note that now all of them are numerical) you have very many features and you want to reduce your feature's space, then is a different problem.
As a rule of thumbs I try to utilize the entire feature space first, now a plus since in spark computing power allows you to run modelling tasks with big datasets, if it is too big then I would go for dimensionality reduction techniques, PCA etc...

Resources