dimension reduction makes data non-linearly separable - svm

I am working on a project to classify hearing disorders using SVM. I have collected real time data from the site (http://archive.ics.uci.edu/ml/machine-learning-databases/audiology/) and initially decided to go for two classes to classify patients with normal ear and patients with any disorder. Varying the optimization parameter C from 0.1 to 10 I get one miss-classification between the two classes (C=10).
However I wan to plot the data with the decision boundary but the data set has around 68 features so it is not possible to plot it. I used PCA to reduce to 2D and used svm on this data to see the results. But when I use PCA, the data no longer remains linearly separable and linear decision boundary cannot separate the 2D PCA data. So I want to know if there is a way to reduce dimension but to retain the nature of the data (nature as in separability using linear decision boundary). Can anyone please help me?
Thanks

Related

Using discretization before or after splitting data?

I am new to data mining concepts and have a question regarding implementation of a technique.
I am using the a dataset with large continuous values.
Now, I am trying to code an algorithm where I need to discretize data (not scale as it makes no impact on data along with the fact that algorithm is not a distance based one, hence no scaling needed).
Now for discretization, I have a similar question with regards to scaling and train test split.
For scaling, I know we should split data and then fit transform the train and transform the test based on what we fit from train.
But what do we do for discretization? I am using scikit learns KBinsDiscretizer and trying to make sense of whether I should split first and discretize the same way we normally scale or discretize first then scale.
The issue came up because I used the 17 bins, uniform strategy (0-16 value range)
With split then discretize, I get (0-16) range throughout in train but not in test.
With discretize and split, I get (0-16) range in both.
With former strategy, my accuracy is around 85% but with the latter, its a whopping 97% which leads me to believe I have definitely overfit the data.
Please advise on what I should be doing for discretization and whether the data interpretation was correct.

Why does more features in a random forest decrease accuracy dramatically?

I am using sklearn's random forests module to predict values based on 50 different dimensions. When I increase the number of dimensions to 150, the accuracy of the model decreases dramatically. I would expect more data to only make the model more accurate, but more features tend to make the model less accurate.
I suspect that splitting might only be done across one dimension which means that features which are actually more important get less attention when building trees. Could this be the reason?
Yes, the additional features you have added might not have good predictive power and as random forest takes random subset of features to build individual trees, the original 50 features might have got missed out. To test this hypothesis, you can plot variable importance using sklearn.
Your model is overfitting the data.
From Wikipedia:
An overfitted model is a statistical model that contains more parameters than can be justified by the data.
https://qph.fs.quoracdn.net/main-qimg-412c8556aacf7e25b86bba63e9e67ac6-c
There are plenty of illustrations of overfitting, but for instance, this 2d plot represents the different functions that would have been learned for a binary classification task. Because the function on the right has too many parameters, it learns wrongs data patterns that don't generalize properly.

LSTM prediction how to incorporate multiple autocorrelation

I am working on a project which aims at prediction of highly autocorrelated time series. LSTM seems very ideal for my purpose. However, does anyone know how I can incorporate multiple large autocorrelation into my prediction networks? i.e., there is a very strong yearly correlation, and seasonal correlation; how am I able to include these information into the LSTM network?
Thank you sincerely
if there is autocorrelation the correlation is linear ( not non-linear ) because common autocorrelation tests for linear correlation. Any LSTM is able to capture this linear correlations by default, it does not matter how many linear correlations are in the time series, the LSTM will capture it. A problem could be the length of memory , a LSTM has a memory between 200 and 500 timesteps ( https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/ ), so if the long-term linear correlations are in the time series at positions more extent than this the LSTM will not be able to capture because it lacks the memory ( not physical computer memory, the memory in the structure of LSTMs )
So simply build the LSTM model in keras and let it predict,
as Upasana Mittal said in his comment, cf http://www.business-science.io/timeseries-analysis/2018/04/18/keras-lstm-sunspots-time-series-prediction.html
updated answer because there is not enough space in the comments. In http://www.business-science.io/timeseries-analysis/2018/04/18/keras-lstm-sunspots-time-series-prediction.html is used a lagged time series to determine ACF, this is objective else it would be impossible to determine ACF :
First, we need to review the Autocorrelation Function (ACF), which is
the correlation between the time series of interest in lagged versions
of itself. The acf() function from the stats library returns the ACF
values for each lag as a plot. However, we’d like to get the ACF
values as data so we can investigate the underlying data. To do so,
we’ll create a custom function, tidy_acf(), to return the ACF values
in a tidy tibble.
There is no use of a specially lagged time series as input and using the history of the system or past system states to predict the future system states i also an objective ansatz and essential in any RNN.
So the way of proceeding in http://www.business-science.io/timeseries-analysis/2018/04/18/keras-lstm-sunspots-time-series-prediction.html is objective.
Another point you could mean is the stateful mode however it is vital that you use it because only in stateful mode the samples are not shuffled and accuracy is increased. Stateless neural nets work on probability distributions and shuffling a probability distribution does not change it ( permutation invariance ), stateful neural nets include the sequential ordering of the data so shuffling changes the result, search net for 'shuffling multifractal data' :
In normal (or “stateless”) mode, Keras shuffles the samples, and the
dependencies between the time series and the lagged version of itself
are lost. However, when run in “stateful” mode, we can often get high
accuracy results by leveraging the autocorrelations present in the
time series.
LSTMs by definition use a time series and a lagged version of the time series (timesteps,...), so this is also an objective ansatz.
If you want to dig deeper into the matter, and go beyond linear correlations that are captured by ACF, you should learn about nonlinear dynamical systems ( chaos theory, fractality, multifractality ) because it involves nonlinear systems and nonlinear correlations, i.e. the lag plot of a time series of a nonlinear dynamical systems in its chaotic state always exhibits the species of nonlinearity. The lag plot of the Logistic Map in its chaotic region shows a parabola, the lag plot of a cubic nonlinear map shows a cubic curve,.... RNNs are only capable to model / approximate systems perfectly accurate whichs lag plot shows a sufficiently simple structure ( circles, spirals, lemniscates, cubic curves, quadratic curves , ... ), i.e. for a neural net it is impossible to approximate the sequence of the primegaps because the lag plot of the sequence of primegaps is structured to complex ( however it shows a clear pattern for lag = 1, when neglecting the sequential ordering )

Using SVM to perform classification on multi-dimensional time series datasets

I would like to use scikit-learn's svm.SVC() estimator to perform classification tasks on multi-dimensional time series - that is, on time series where the points in the series take values in R^d, where d > 1.
The issue with doing this is that svm.SVC() will only take ndarray objects of dimension at most 2, whereas the dimension of such a dataset would be 3. Specifically, the shape of a given dataset would be (n_samples, n_features, d).
Is there a workaround available? One simple solution would just be to reshape the dataset so that it is 2-dimensional, however I imagine this would lead to the classifier not learning from the dataset properly.
Without any further knowledge about the data reshaping is the best you can do. Feature engineering is a very manual art that depends heavily on domain knowledge.
As a rule of thumb: if you don't really know anything about the data throw in the raw data and see if it works. If you have an idea what properties of the data may be beneficial for classification, try to work it in a feature.
Say we want to classify swiping patterns on a touch screen. This closely resembles your data: We acquired many time series of such patterns by recording the 2D position every few milliseconds.
In the raw data, each time series is characterized by n_timepoints * 2 features. We can use that directly for classification. If we have additional knowledge we can use that to create additional/alternative features.
Let's assume we want to distinguish between zig-zag and wavy patterns. In that case smoothness (however that is defined) may be a very informative feature that we can add as a further column to the raw data.
On the other hand, if we want to distinguish between slow and fast patterns, the instantaneous velocity may be a good feature. However, the velocity can be computed as a simple difference along the time axis. Even linear classifiers can model this easily so it may turn out that such features, although good in principle, do not improve classification of raw data.
If you have lots and lots and lots and lots of data (say an internet full of good examples) Deep Learning neural networks can automatically learn features to some extent, but let's say this is rather advanced. In the end, most practical applications come down to try and error. See what features you can come up with and try them out in practice. And beware the overfitting gremlin.

Linear SVM vs Nonlinear SVM high dimensional data

I am working on a project where I use Spark Mllib Linear SVM to classify some data (l2 regularization). I have like 200 positive observation, and 150 (generated) negative observation, each with 744 features, which represent the level of activity of a person in different region of a house.
I have run some tests and the "areaUnderROC" metric was 0.991 and it seems that the model is quite good in classify the data that I provide to it.
I did some research and I found that the linear SVM is good in high dimensional data, but the problem is that I don't understand how something linear can divide my data so well.
I think in 2D, and maybe this is the problem but looking at the bottom image, I am 90% sure that my data looks more like a non linear problem
So it is normal that I have good results on the tests? Am I doing something wrong? Should I change the approach?
I think you question is about 'why linear SVM could classfy my hight Dimensions data well even the data should be non-linear'
some data set look like non-linear in low dimension just like you example image on right, but it is literally hard to say the data set is definitely non-linear in high dimension because a nD non-linear may be linear in (n+1)D space.So i dont know why you are 90% sure your data set is non-linear even it is a high Dimension one.
At the end, I think it is normal that you have a good test result in test samples, because it indicates that your data set just is linear or near linear in high Dimension or it wont work so well.Maybe cross-validation could help you comfirm that your approach is suitable or not.

Resources