Analyze catagorical variable with a continuous variable - python-3.x

I have a data set with more than 20 columns. there I have a main categorical variable(x) as target. Assume x have 3 levels.
high
moderate
low
So I want to find out is there any relationship between each and every levels of X(high, low, moderate) independently and other variables.
I tried with python. using featurewiz and simple correlation heat map. and tied one hot encoding. but things didn't went good.

Related

How do I deal with 3 categorical IVs, one continuous IV and 4 ordinal DVs?

So, I am writing up a paper for a study I conducted, and I am finding the statistics to be more difficult than I thought. Essentially, I have:
three categorical independent variables, each of which are binary
one categorical variable which is out of 4 points, and
four ordinal DVs, each of which have three levels.
Ideally, what I would want is to be able to run an ANOVA of some sort, but the data do not meet parametric assumptions. I do not know of any tests which would allow me to test all of my IVs and DVs in a single model (the assumptions for linearity, normality, and parametric data are not met). Ideally, I'd also like to run a test where multiple comparisons wouldn't be an issue.
I know that is a lot, so please forgive the long windedness of the question!
Thank you so much for all your help!

Scale before PCA

I'm using PCA from sckit-learn and I'm getting some results which I'm trying to interpret, so I ran into question - should I subtract the mean (or perform standardization) before using PCA, or is this somehow embedded into sklearn implementation?
Moreover, which of the two should I perform, if so, and why is this step needed?
I will try to explain it with an example. Suppose you have a dataset that includes a lot features about housing and your goal is to classify if a purchase is good or bad (a binary classification). The dataset includes some categorical variables (e.g. location of the house, condition, access to public transportation, etc.) and some float or integer numbers (e.g. market price, number of bedrooms etc). The first thing that you may do is to encode the categorical variables. For instance, if you have 100 locations in your dataset, the common way is to encode them from 0 to 99. You may even end up encoding these variables in one-hot encoding fashion (i.e. a column of 1 and 0 for each location) depending on the classifier that you are planning to use. Now if you use the price in million dollars, the price feature would have a much higher variance and thus higher standard deviation. Remember that we use square value of the difference from mean to calculate the variance. A bigger scale would create bigger values and square of a big value grow faster. But it does not mean that the price carry significantly more information compared to for instance location. In this example, however, PCA would give a very high weight to the price feature and perhaps the weights of categorical features would almost drop to 0. If you normalize your features, it provides a fair comparison between the explained variance in the dataset. So, it is good practice to normalize the mean and scale the features before using PCA.
Before PCA, you should,
Mean normalize (ALWAYS)
Scale the features (if required)
Note: Please remember that step 1 and 2 are not the same technically.
This is a really non-technical answer but my method is to try both and then see which one accounts for more variation on PC1 and PC2. However, if the attributes are on different scales (e.g. cm vs. feet vs. inch) then you should definitely scale to unit variance. In every case, you should center the data.
Here's the iris dataset w/ center and w/ center + scaling. In this case, centering lead to higher explained variance so I would go with that one. Got this from sklearn.datasets import load_iris data. Then again, PC1 has most of the weight on center so patterns I find in PC2 I wouldn't think are significant. On the other hand, on center | scaled the weight is split up between PC1 and PC2 so both axis should be considered.

spark - MLlib: transform and manage categorical features

For big datasets with 2bil+ samples and approximately 100+ features per sample. Among these, 10% features you have are numerical/continuous variables and the rest of it are categorical variables (position, languages, url etc...).
Let's use some examples:
e.g: dummy categorical feature
feature: Position
real values: SUD | CENTRE | NORTH
encoded values: 1 | 2 | 3
...would have sense use reduction like SVD because distance beetween sud:north > sud:centre and, moreover, it's possible to encode (e.g OneHotEncoder, StringIndexer) this variable because of the small cardinality of it values-set.
e.g: real categorical feature
feature: url
real values: very high cardinality
encoded values: ?????
1) In MLlibthe 90% of the model works just with numerical values (a part of Frequent Itemset and DecisionTree techniques)
2) Features transformers/reductor/extractor as PCA or SVD are not good for these kind of data, and there is no implementation of (e.g) MCA
a) Which could be your approach to engage with this kind of data in spark, or using Mllib?
b) Do you have any suggestions to cope with this much categorical values?
c) After reading a lot in literature, and counting the implemented model in spark, my idea, about make inference on one of that features using the others (categorical), the models at point 1 could be the best coiche. What do you think about it?
(to standardize a classical use case you can imagine the problem of infer the gender of a person using visited url and other categorical features).
Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example?
Thanks in advance
Well, first I would say stackoverflow works in a different way, you should be the one providing a working example with the problem you are facing and we help you out using this example.
Anyways I got intrigued with the use of the categorical values like the one you show as position. If this is a categorical value as you mention with 3 levels SUD,CENTRE, NORTH, there is no distance between them if they are truly categorical. In this sense I would create dummy variables like:
SUD_Cat CENTRE_Cat NORTH_Cat
SUD 1 0 0
CENTRE 0 1 0
NORTH 0 0 1
This is a truly dummy representation of a categorical variable.
On the other hand if you want to take that distance into account then you have to create another feature which takes this distance into account explicitly, but that is not a dummy representation.
If the problem you are facing is that after you wrote your categorical features as dummy variables (note that now all of them are numerical) you have very many features and you want to reduce your feature's space, then is a different problem.
As a rule of thumbs I try to utilize the entire feature space first, now a plus since in spark computing power allows you to run modelling tasks with big datasets, if it is too big then I would go for dimensionality reduction techniques, PCA etc...

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

Data mining for significant variables (numerical): Where to start?

I have a trading strategy on the foreign exchange market that I am attempting to improve upon.
I have a huge table (100k+ rows) that represent every possible trade in the market, the type of trade (buy or sell), the profit/loss after that trade closed, and 10 or so additional variables that represent various market measurements at the time of trade-opening.
I am trying to find out if any of these 10 variables are significantly related to the profits/losses.
For example, imagine that variable X ranges from 50 to -50.
The average value of X for a buy order is 25, and for a sell order is -25.
If most profitable buy orders have a value of X > 25, and most profitable sell orders have a value of X < -25 then I would consider the relationship of X-to-profit as significant.
I would like a good starting point for this. I have installed RapidMiner 5 in case someone can give me a specific recommendation for that.
A Decision Tree is perhaps the best place to begin.
The tree itself is a visual summary of feature importance ranking (or significant variables as phrased in the OP).
gives you a visual representation of the entire
classification/regression analysis (in the form of a binary tree),
which distinguishes it from any other analytical/statistical
technique that i am aware of;
decision tree algorithms require very little pre-processing on your data, no normalization, no rescaling, no conversion of discrete variables into integers (eg, Male/Female => 0/1); they can accept both categorical (discrete) and continuous variables, and many implementations can handle incomplete data (values missing from some of the rows in your data matrix); and
again, the tree itself is a visual summary of feature importance ranking
(ie, significant variables)--the most significant variable is the
root node, and is more significant than the two child nodes, which in
turn are more significant than their four combined children. "significance" here means the percent of variance explained (with respect to some response variable, aka 'target variable' or the thing
you are trying to predict). One proviso: from a visual inspection of
a decision tree you cannot distinguish variable significance from
among nodes of the same rank.
If you haven't used them before, here's how Decision Trees work: the algorithm will go through every variable (column) in your data and every value for each variable and split your data into two sub-sets based on each of those values. Which of these splits is actually chosen by the algorithm--i.e., what is the splitting criterion? The particular variable/value combination that "purifies" the data the most (i.e., maximizes the information gain) is chosen to split the data (that variable/value combination is usually indicated as the node's label). This simple heuristic is just performed recursively until the remaining data sub-sets are pure or further splitting doesn't increase the information gain.
What does this tell you about the "importance" of the variables in your data set? Well importance is indicated by proximity to the root node--i.e., hierarchical level or rank.
One suggestion: decision trees handle both categorical and discrete data usually without problem; however, in my experience, decision tree algorithms always perform better if the response variable (the variable you are trying to predict using all other variables) is discrete/categorical rather than continuous. It looks like yours is probably continuous, in which case in would consider discretizing it (unless doing so just causes the entire analysis to be meaningless). To do this, just bin your response variable values using parameters (bin size, bin number, and bin edges) meaningful w/r/t your problem domain--e.g., if your r/v is comprised of 'continuous values' from 1 to 100, you might sensibly bin them into 5 bins, 0-20, 21-40, 41-60, and so on.
For instance, from your Question, suppose one variable in your data is X and it has 5 values (10, 20, 25, 50, 100); suppose also that splitting your data on this variable with the third value (25) results in two nearly pure subsets--one low-value and one high-value. As long as this purity were higher than for the sub-sets obtained from splitting on the other values, the data would be split on that variable/value pair.
RapidMiner does indeed have a decision tree implementation, and it seems there are quite a few tutorials available on the Web (e.g., from YouTube, here and here). (Note, I have not used the decision tree module in R/M, nor have i used RapidMiner at all.)
The other set of techniques i would consider is usually grouped under the rubric Dimension Reduction. Feature Extraction and Feature Selection are two perhaps the most common terms after D/R. The most widely used is PCA, or principal-component analysis, which is based on an eigen-vector decomposition of the covariance matrix (derived from to your data matrix).
One direct result from this eigen-vector decomp is the fraction of variability in the data accounted for by each eigenvector. Just from this result, you can determine how many dimensions are required to explain, e.g., 95% of the variability in your data
If RapidMiner has PCA or another functionally similar dimension reduction technique, it's not obvious where to find it. I do know that RapidMiner has an R Extension, which of course let's you access R inside RapidMiner.R has plenty of PCA libraries (Packages). The ones i mention below are all available on CRAN, which means any of the PCA Packages there satisfy the minimum Package requirements for documentation and vignettes (code examples). I can recommend pcaPP (Robust PCA by Projection Pursuit).
In addition, i can recommend two excellent step-by-step tutorials on PCA. The first is from the NIST Engineering Statistics Handbook. The second is a tutorial for Independent Component Analysis (ICA) rather than PCA, but i mentioned it here because it's an excellent tutorial and the two techniques are used for the similar purposes.

Resources