Decorrelating 3 categorical variables - python-3.x

I have a table of 3 categorical variables (salary, face_amount, and area_code) that I was looking to decorrelate from one another. In other words, I'm trying to find how much of some output can be attributed solely to each one of these variables. So I would want to see how much of this output is due to the salary variable and not the correlated amount of salary with face_amount for example if that makes sense.
I noticed that there is Multiple Correspondence Analysis for this type of problem that will decorrelate the variables, however, the issue I'm having is that I need the original variables and not the ones that are produced from multiple correspondence analysis. I'm very confused as to how to go about analyzing this type of problem and would appreciated any help with this kind of problem.
Sample of data:
salary face_amount area_code
'1-50' 1000 67
'1-50' 500 600
'1-50' 500 600
'51-200' 2000 623
'51-200' 1000 623
'201-500' 500 700
I'm not exactly sure how to go about this kind of problem

Related

Missing data while using fixed effects model panel data

I have panel data made of 99 cross-sectional units and 7 time series. When I do the fixed effects model in gretl, there are only 158 observations, 56 cross-sectional units and 4 time series. My model has one dependent variable and 5 independent variables. There are some values missing and I have been told this shoudn't be a problem while doing the linear regression. But gretl removed all of the units where there is only one variable missing. Even when I have data for all other 5 variables, gretl removed it from model completely because one is missing. This reduced my dataset significantly. Could you please advise how to fix this?
I'm not very experienced in gretl so I don't know what to do.

How to handle a highly unbalanced dataset

I was checking the dataset CERT V4.1 which was synthesized to simulate insider threats. I realized that it contains about 850K samples and there are about 200 samples considered as malicious data. Is this normal? am I missing something here? If this is the case, how can I handle such data if I want to use deep learning?
If you have unbalanced Data you have many options (see link below).
Additional to these there is a really interesting approach that works like this:
1: you randomly split your 850K negative samples in blocks of 200
2: you build one classifier for every block where you put all positive samples in together with one block of the negative samples
3: Use all classifiers in paralell and let them vote, find a good threshold of how many positive votes you need to be "sure enough" to classify the test sample as positive
Regarding that your data is 200 vs 850K (meaning around 4250 Classifiers) you might consider to combine this approach with one of the others, like duplicating mentioned by #Prune or one of the approaches explained in the link below.
Here you have some approaches dealing with imbalanced data
http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Yes, this is normal in many paradigms: a large majority of the traffic is "normal". You handle this simply by being careful to distribute the negative samples proportionately in your train, test, and validation sets. For instance, if your desired proportions are 50-30-20, make sure that you have about 100 malicious samples in the training set, 60 in testing, and 20 in validation.
If the training fails in this paradigm, you can also try adding multiple instances of each malicious sample to each of the sets: duplicate those 100 records several times; for instance, add 10 copies of each sample to each of the data sets (but still do not cross from one set to another -- you would now have 1000 malicious samples in the training set, not 10 copies each of the original 200).

How to compare means of two sets when one set is a subset of another and the sample sizes are not

I have two sets containing citation counts for some publications. Of those sets one is a subset of the another. That is, subset contains some exact citation counts appearing on the other set. e.g.
Set1 Set2 (Subset)
50 50
24 24
12 -
5 5
4 4
43 43
2 -
2 -
1 -
1 -
So I want to decide if the numbers from the subset are good enough to represent set1? On this matter:
I have intended to apply student t-test but i could not be sure how
to apply it. The reason is that the sets are dependent so I could
not apply unpaired t-test requiring both sets must come from
independent populations. On the other hand, paired t-test also does
not look suitable since sample sizes must be equal.
In case of an outlier should I remove it? To me it is not logical
since it is not normally an outlier but a publication is cited quite a
lot so it belongs to the same sample. How to deal with such cases?
If I do not remove it, it causes the variance to be too big
affecting statistical tests...Is it a good idea to replace it with
median instead of mean since citation distributions generally tend
to be highly skewed?
How could I remedy this issue?

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

Randomly select increasing subset of data to see where mean levels off

Could anyone please advise the best way to do the following?
I have three variables (X, Y & Z) and four groups (1, 2, 3 & 4). I have been using discriminant function analysis in SPSS to predict group membership of known grouped data for use with future ungrouped data.
Ideally I would like to able to randomly sample an increasing number of a subset of the data to see how many observations are required to hit a desired correct classification percentage.
However, I understand this might be difficult. Therefore, I'm looking to to do this for the means.
For example, Lets say variable X has a mean of 141 for group 1. This mean might have been calculated from 2000 observations. However, it might be the case that the mean occurred at say 700 observations. I would like to be able to calculate at what number of observations/cases the mean levels of in my data. For example, perhaps starting at 10 observations and repeating this randomly say 50 or 100 times, then increasing to 20 observations....and so on.
I understand this is a form of monte carlo testing. I have access to SPSS 15, 17 and 18 and excel. I also have access to minitab 15 & 16 and amos17 and have downloaded "R" but im not familiar with these. My experience is with SPSS and excel. I have tried some syntax in SPSS Modified from this..http://pages.infinit.net/rlevesqu/Syntax/RandomSampling/Select2CasesFromEachGroup.txt but this would still be quite time consuming on my part to enter the subset number ect etc.
Hope some one can help.
Thanks for reading.
Andy
The text you linked to is a good start (you can also use the SAMPLE command in SPSS, but IMO the Raynald script you linked to is more flexible when you think about constructing the sample that way).
In pseudo-code, the process might look like;
do n for sample size (a to b)
loop 100 times
draw sample size n
compute (& save) statistics
Here is where SPSS's macro language comes into play (I think this document is a good introduction, plus you can examine other references on the SPSS tag wiki). Basically once you figure out how to draw the sample and compute the stats you want, you just need to figure out how to write a macro so you can loop through the process (and pass it the sample size parameter). I include the loop 100 times because you want to be able to make some type of estimate about the error associated with each sample size.
If you give an example of how you compute the statistics I may be able to give examples of how to make that into a macro function and loop through the desired number of times.
#Andy W
#Oliver
Thanks for your suggestions guys. Ive managed to find a work around using the following macro from.........http://www.spsstools.net/Syntax/Bootstrap/GetRandomSampleOfVariousSizeCalcStats.txt However, for this I need to copy and paste the variable data for a given group into a new data window. Thats not to much of a problem. To take this further would anyone know how: 1/ I could get other statistics recorded eg std error, std dev ect ect. 2/Use other analysis, ideally discriminant function analysis and record in a new data window the percentage of correct classificcations rather than having lots of output tables 3/not need to copy and paste variables for each group so I can just run the macro specifying n samples for x variable on group 1, 2, 3 & 4.
Thanks again.
DEFINE !sample(myvar !TOKENS(1)
/nbsampl !TOKENS(1)
/size !CMDEND).
* myvar = the variable of interest (here we want the mean of salary)
* nbsampl = number of samples.
* size = the size of each samples.
!LET !first='1'
!DO !ss !IN (!size)
!DO !count = 1 !TO !nbsampl.
GET FILE='c:\Program Files\SPSS\employee data.sav'.
COMPUTE draw=uniform(1).
SORT CASES BY draw.
N OF CASES !ss.
COMPUTE samplenb=!count.
COMPUTE ss=!ss.
AGGREGATE
/OUTFILE=*
/BREAK=samplenb
/!myvar = MEAN(!myvar) /ss=FIRST(ss).
!IF (!first !NE '1') !THEN
ADD FILES /FILE=* /FILE='c:\temp\sample.sav'.
!IFEND
SAVE OUTFILE='c:\temp\sample.sav'.
!LET !first='0'
!DOEND.
!DOEND.
VARIABLE LABEL ss 'Sample size'.
EXAMINE
VARIABLES=salary BY ss /PLOT=BOXPLOT/STATISTICS=NONE/NOTOTAL
/MISSING=REPORT.
!ENDDEFINE.
* ----------------END OF MACRO ----------------------------------------------.
* Call macro (parameters are number of samples (here 20) and sizes of sample (here 5, 10,15,30,50).
* Thus 20 samples of size 5.
* Thus 20 samples of size 10, etc.
!sample myvar=salary nbsampl=20 size= 5 10 15 30 50.

Resources