SVM and cross validation - svm

The problem is as follows. When I do support vector machine training, suppose I have already performed cross validation on 10000 training points with a Gaussian kernel and have obtained the best parameter C and \sigma. Now I have another 40000 new training points and since I don't want to waste time on cross validation, I stick to the original C and \sigma that I obtained from the first 10000 points, and train the entire 50000 points on these parameters. Is there any potentially major problem with this? It seems that for C and \sigma in some range, the final test error wouldn't be that bad, and thus the above process seems okay.

There is one major pitfal of such appraoch. Both C and sigma are data dependant. In particular, it can be shown, that optimal C strongly depends on the size of the training set. So once you make your training data 5 times bigger, even if it brings no "new" knowledge - you should still find new C to get the exact same model as before. So, you can do such procedure, but keep in mind, that best parameters for smaller training set do not have to be the best for the bigger one (even though, they sometimes still are).
To better see the picture. If this procedure would be fully "ok" than why not fit C on even smaller data? 5 times? 25 times smaller? Mayone on one single point per class? 10,000 may seem "a lot", but it depends on the problem considered. In many real life domains this is just a "regular" (biology) or even "very small" (finance) dataset, so you won't be sure, if your procedure is fine for this particular problem until you test it.

Related

The bounding box's position and size is incorrect, how to improve it's accuracy?

I'm using detectron2 for solving a segmentation task,
I'm trying to classify an object into 4 classes,
so I have used COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml.
I have applied 4 kind of augmentation transforms and after training I get about 0.1
total loss.
But for some reason the accuracy of the bbox is not great on some images on the test set,
the bbox is drawn either larger or smaller or doesn't cover the whole object.
Moreover sometimes the predictor draws few bboxes, it assumes there are few different objects although there is only a single object.
Are there any suggestions how to improve it's accuracy?
Are there any good practice approaches how to resolve this issue?
Any suggestion or reference material will be helpful.
I would suggest the following:
Ensure that your training set has the object you want to detect in all sizes: in this way, the network learns that the size of the object can be different and less prone to overfitting (detector could assume your object should be only big for example).
Add data. Rather than applying all types of augmentations, try adding much more data. The phenomenon of detecting different objects although there is only one object leads me to believe that your network does not generalize well. Personally I would opt for at least 500 annotations per class.
The biggest step towards improvement will be achieved by means of (2).
Once you have a decent baseline, you could also experiment with augmentations.

Under what conditions can two classes have different average values, yet be indistinguishable to a SVM?

I am asking because I have observed sometimes in neuroimaging that a brain region might have different average activation between two experimental conditions, but sometimes an SVM classifier somehow can't distinguish the patterns of activation between the two conditions.
My intuition is that this might happen in cases where the within-class variance is far greater than the between-class variance. For example, suppose we have two classes, A and B, and that for simplicity our data consists just of integers (rather than vectors). Let the data falling under class A be 0,0,0,0,0,10,10,10,10,10. Let the data falling under class B be 1,1,1,1,1,11,11,11,11,11. Here, A and B are clearly different on average, yet there's no decision boundary that would allow A and B to be distinguished. I believe this logic would hold even if our data consisted of vectors, rather than integers.
Is this a special case of some broader range of cases where an SVM would fail to distinguish two classes that are different on average? Is it possible to delineate the precise conditions under which an SVM classifier would fail to distinguish two classes that differ on average?
EDIT: Assume a linear SVM.
As described in the comments - there are no such conditions because SVM will separate data just fine (I am not talking about any generalisation here, just separating training data). For the rest of the answer I am assuming there are no two identical points with different labels.
Non-linear case
For a kernel case, using something like RBF kernel, SVM will always perfectly separate any training set, given that C is big enough.
Linear case
If data is linearly separable then again - with big enough C it will separate data just fine. If data is not linearly separable, cranking up C as much as possible will lead to smaller and smaller training error (of course it will not get 0 since data is not linearly separable).
In particular for the data you provided kernelized SVM will get 100%, and any linear model will get 50%, but it has nothing to do with means being different or variances relations - it is simply a dataset where any linear separator has at most 50% accuracy, literally every decision point, thus it has nothing to do with SVM. In particular it will separate them "in the middle", meaning that the decision point will be somewhere around "5".

How to handle a highly unbalanced dataset

I was checking the dataset CERT V4.1 which was synthesized to simulate insider threats. I realized that it contains about 850K samples and there are about 200 samples considered as malicious data. Is this normal? am I missing something here? If this is the case, how can I handle such data if I want to use deep learning?
If you have unbalanced Data you have many options (see link below).
Additional to these there is a really interesting approach that works like this:
1: you randomly split your 850K negative samples in blocks of 200
2: you build one classifier for every block where you put all positive samples in together with one block of the negative samples
3: Use all classifiers in paralell and let them vote, find a good threshold of how many positive votes you need to be "sure enough" to classify the test sample as positive
Regarding that your data is 200 vs 850K (meaning around 4250 Classifiers) you might consider to combine this approach with one of the others, like duplicating mentioned by #Prune or one of the approaches explained in the link below.
Here you have some approaches dealing with imbalanced data
http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Yes, this is normal in many paradigms: a large majority of the traffic is "normal". You handle this simply by being careful to distribute the negative samples proportionately in your train, test, and validation sets. For instance, if your desired proportions are 50-30-20, make sure that you have about 100 malicious samples in the training set, 60 in testing, and 20 in validation.
If the training fails in this paradigm, you can also try adding multiple instances of each malicious sample to each of the sets: duplicate those 100 records several times; for instance, add 10 copies of each sample to each of the data sets (but still do not cross from one set to another -- you would now have 1000 malicious samples in the training set, not 10 copies each of the original 200).

How to interpret some syntax (n.adapt, update..) in jags?

I feel very confused with the following syntax in jags, for example,
n.iter=100,000
thin=100
n.adapt=100
update(model,1000,progress.bar = "none")
Currently I think
n.adapt=100 means you set the first 100 draws as burn-in,
n.iter=100,000 means the MCMC chain has 100,000 iterations including the burn-in,
I have checked the explanation for this question a lot of time but still not sure whether my interpretation about n.iter and n.adapt is correct and how to understand update() and thinning.
Could anyone explain to me?
This answer is based on the package rjags, which takes an n.adapt argument. First I will discuss the meanings of adaptation, burn-in, and thinning, and then I will discuss the syntax (I sense that you are well aware of the meaning of burn-in and thinning, but not of adaptation; a full explanation may make this answer more useful to future readers).
Burn-in
As you probably understand from introductions to MCMC sampling, some number of iterations from the MCMC chain must be discarded as burn-in. This is because prior to fitting the model, you don't know whether you have initialized the MCMC chain within the characteristic set, the region of reasonable posterior probability. Chains initialized outside this region take a finite (sometimes large) number of iterations to find the region and begin exploring it. MCMC samples from this period of exploration are not random draws from the posterior distribution. Therefore, it is standard to discard the first portion of each MCMC chain as "burn-in". There are several post-hoc techniques to determine how much of the chain must be discarded.
Thinning
A separate problem arises because in all but the simplest models, MCMC sampling algorithms produce chains in which successive draws are substantially autocorrelated. Thus, summarizing the posterior based on all iterations of the MCMC chain (post burn-in) may be inadvisable, as the effective posterior sample size can be much smaller than the analyst realizes (note that STAN's implementation of Hamiltonian Monte-Carlo sampling dramatically reduces this problem in some situations). Therefore, it is standard to make inference on "thinned" chains where only a fraction of the MCMC iterations are used in inference (e.g. only every fifth, tenth, or hundredth iteration, depending on the severity of the autocorrelation).
Adaptation
The MCMC samplers that JAGS uses to sample the posterior are governed by tunable parameters that affect their precise behavior. Proper tuning of these parameters can produce gains in the speed or de-correlation of the sampling. JAGS contains machinery to tune these parameters automatically, and does so as it draws posterior samples. This process is called adaptation, but it is non-Markovian; the resulting samples do not constitute a Markov chain. Therefore, burn-in must be performed separately after adaptation. It is incorrect to substitute the adaptation period for the burn-in. However, sometimes only relatively short burn-in is necessary post-adaptation.
Syntax
Let's look at a highly specific example (the code in the OP doesn't actually show where parameters like n.adapt or thin get used). We'll ask rjags to fit the model in such a way that each step will be clear.
n.chains = 3
n.adapt = 1000
n.burn = 10000
n.iter = 20000
thin = 50
my.model <- jags.model(mymodel.txt, data=X, inits=Y, n.adapt=n.adapt) # X is a list pointing JAGS to where the data are, Y is a vector or function giving initial values
update(my.model, n.burn)
my.samples <- coda.samples(my.model, params, n.iter=n.iter, thin=thin) # params is a list of parameters for which to set trace monitors (i.e. we want posterior inference on these parameters)
jags.model() builds the directed acyclic graph and then performs the adaptation phase for a number of iterations given by n.adapt.
update() performs the burn-in on each chain by running the MCMC for n.burn iterations without saving any of the posterior samples (skip this step if you want to examine the full chains and discard a burn-in period post-hoc).
coda.samples() (from the coda package) runs the each MCMC chain for the number of iterations specified by n.iter, but it does not save every iteration. Instead, it saves only ever nth iteration, where n is given by thin. Again, if you want to determine your thinning interval post-hoc, there is no need to thin at this stage. One advantage of thinning at this stage is that the coda syntax makes it simple to do so; you don't have to understand the structure of the MCMC object returned by coda.samples() and thin it yourself. The bigger advantage to thinning at this stage is realized if n.iter is very large. For example, if autocorrelation is really bad, you might run 2 million iterations and save only every thousandth (thin=1000). If you didn't thin at this stage, you (and your RAM) would need to manipulate an object with three chains of two million numbers each. But by thinning as you go, the final object only has 2 thousand numbers in each chain.

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

Resources