Classification of unknown dataset into known categories - python-3.x

I have a number of datasets where I have an array of x, y, z coordinates of the endpoints of segments. First and second point represents a segment, so does third, fourth and so on...
The above data represents just a part of dataset... The entire dataset is a lot bigger.
I am required to train my machine with several datasets like this, so that it can predict the category of any unknown dataset further... The test dataset will also be the same as the above.
I need help with the approach. Which algorithm or approach can I use here to classify any unknown dataset into these known categories?

Its an unsupervised learning problem. If you know roughly in how many classes your data should be split use K-Means (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
Otherwise, a combination of TSNE (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) and Kmeans usually works well. Basically transform data using TSNE and run Kmeans in transformed data.

Related

How to deal with the unlabeled nodes in Pytorch Geometric?

I have a dataset on my own, and the dataset contains two classes, let's say 0 and 1. Besides, there is a large part of nodes which class is unlabeled. My goal is to predict these unlabeled nodes using GCN. But I am confused about how to deal with these unlabeled nodes in Pytorch Geometric.
As far as I can think about, I can label the nodes into 3 classes, 0, 1 and unknown. But if I do it this way, that means I am trying to classify the dataset into three classes? (But that's not what I want since unknown is not a class).
And another way to deal with these node is to ignore them, simply run PyG on the labeled node. But in this way, it seems that these unlabeled node(with feature) is useless in the dataset?
That very much depends on your use case and the data!
Case 1 - Graph Autoencoder
For this case let's assume the task is to find similar tweets. A way of doing this is to train a Graph Autoencoder (see example).
This approach is completely unsupervised and thus does not need any data to be labeled.
The resulting model should be able to generate an embedding for each node (in this case each tweet) so that the distance between similar tweets is lower than between non-similar (measured e. g. by cosine distance).
Case 2 - Semi-Supervised GCN
Another case would be to classify tweets as advertisement vs. non-advertisement. Since the idea behind GCNs is to train in a semi-supervised manner it would be no problem to only have labels for some of the tweets.
In order to tell PYG which ones have labels and should be used for training you can define a train_mask. All nodes with missing labels will still technically need a y-value which can be set to -1.
Source & Credits

What steps should I take next to improve my accuracy? Can data be the problem?

I built various ML models using sklearn for a binary classification problem. The data-set is provided to me by my professor for this comparative study.
my jupyter notebook and dataset can be found here
As I am getting very low accuracy, I fear that I must be doing something wrong while building the model. So I tested my decision tree on the inbuilt data-set in sklearn (breast cancer data-set) which is very similar to my data-set as both are binary classifications. Here I get an mean accuracy of 95 %. So I think right now that the problem might be my data-set. Can I get some help on how do I pre-process my data or any other steps that I might look into to improve accuracy.
Encode labels
Categorical data are variables that contain label values rather than numeric values.The number of possible values is often limited to a fixed set.
For example, users are typically described by country, gender, age group etc. We will use Label Encoder to label the categorical data. Label Encoder is the part of SciKit Learn library in Python and used to convert categorical data, or text data, into numbers, which our predictive models can better understand.
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
Feature scaling
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations. We need to bring all features to the same level of magnitudes. This can be achieved by scaling. This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1. We will use StandardScaler method from SciKit-Learn library.
#Feature Scalingfrom sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Choosing Right model
You kight also want to vhoose the appropriate model. You can't just use neural nets or so for all problems it's the no free luch theorem. For this you could use K-fold cross validation, AIC and BIC

Clustering data after dimension reduction with PCA

Say we have a dataset of a large dimension, which we have reduced to a lower dimension using PCA, would it be wise/accurate to then use a clustering algorithm on said data? Assuming that we do not know how many clusters to expect.
Using PCA on the Iris dataset(with the data in the csv ordered such that all of the first class are listed, then the second, then the third) yields the following plot:-
It can be seen that the three classes in the Iris dataset have been retained. However, when the order of the samples is randomised, the following plot is produced:-
Above, it is not clear how many clusters/classes are contained in the data set. In this case(the more real world case), how would one identify the number of classes, would a clustering algorithm such as K-Means be effective?
Would there be innacuracies due to the discarding of lower order Principal Components?
EDIT:- To be clear, I am asking if a dataset can be clustered after running PCA, and if so, what the most accurate method would be.
Say we have a dataset of a large dimension, which we have reduced to a lower
dimension using PCA, would it be wise/accurate to then use a clustering
algorithm on said data? Assuming that we do not know how many clusters to
expect.
Your data might well separate in a low-variance dimension. I would not recommend running PCA prior to clustering.
Above, it is not clear how many clusters/classes are contained in the data
set. In this case(the more real world case), how would one identify the
number of classes, would a clustering algorithm such as K-Means be effective?
There are effective clustering algorithms that do not require prior knowledge of the number of classes, such as Mean Shift and DBSCAN.
Try sorting the dataset after PCA, then plotting it.
The iris data set is much to simple to draw any valid conclusions about the behaviour of high-dimensional data, and the benefits of PCA.
Plus, "wise" - in which sense? If you want to eat pizza, it is not wise to plot the iris data set.

What are the ways of pre-processing categorical data before applying classification algorithms?

I am new to machine learning and I am working on a classification problem with Categorical (nominal) data. I have tried applying BayesNet and a couple of Trees and Rules classification algorithms to the raw data. I am able to achieve an AUC of 0.85.
I further want to improve the AUC by pre-processing or transforming the data. However since the data is categorical I don't think that log transform, addition, multiplication etc. of different columns will work here.
Can somebody list down what are most common transformations applied on categorical data-sets? ( I tried one-hot encoding but it takes a lot of memory!!)
Categorical is in my experience best dealt with one-hot encoding (e.g converting to a binary vector) as you've mentioned. If memory is an issue, it may be worthwhile using an online classification algorithm and generate the modified vectors on the fly.
Apart from this, if the categories represent a range (for example, if the categories represent a range of values such as age, height or income) it may be possible to treat the centre (or some appropriate mean, if there's an intra-label distribution) of the category ranges as a real number.
If you were applying clustering you could also treat the categorical labels as points on an axis (1,2,3,4,5 etc), scaled appropriately to the other features.

dimension reduction makes data non-linearly separable

I am working on a project to classify hearing disorders using SVM. I have collected real time data from the site (http://archive.ics.uci.edu/ml/machine-learning-databases/audiology/) and initially decided to go for two classes to classify patients with normal ear and patients with any disorder. Varying the optimization parameter C from 0.1 to 10 I get one miss-classification between the two classes (C=10).
However I wan to plot the data with the decision boundary but the data set has around 68 features so it is not possible to plot it. I used PCA to reduce to 2D and used svm on this data to see the results. But when I use PCA, the data no longer remains linearly separable and linear decision boundary cannot separate the 2D PCA data. So I want to know if there is a way to reduce dimension but to retain the nature of the data (nature as in separability using linear decision boundary). Can anyone please help me?
Thanks

Resources