How to find out what a cluster represents on a PCA biplot? - python-3.x

I am building a K means algorithm and have multiple variables to feed into it. As of this I am using PCA to transform the data to two dimensions. When I display the PCA biplot I don't understand what similarities the data has to be grouped into a specific cluster. I am using a customer segmentation dataset. I.E: I want to be able to know that a specific cluster is a cluster as a customer has a low income but spends a lot of money on products.

Since you are using k-means:
Compute the mean of each cluster on the original data. Now you can compare these attributes.
Alternatively: don't use PCA in the first place, if it had your analysis... k-means is as good as PCA at coping with several dozen variables.

Related

How to deal with the unlabeled nodes in Pytorch Geometric?

I have a dataset on my own, and the dataset contains two classes, let's say 0 and 1. Besides, there is a large part of nodes which class is unlabeled. My goal is to predict these unlabeled nodes using GCN. But I am confused about how to deal with these unlabeled nodes in Pytorch Geometric.
As far as I can think about, I can label the nodes into 3 classes, 0, 1 and unknown. But if I do it this way, that means I am trying to classify the dataset into three classes? (But that's not what I want since unknown is not a class).
And another way to deal with these node is to ignore them, simply run PyG on the labeled node. But in this way, it seems that these unlabeled node(with feature) is useless in the dataset?
That very much depends on your use case and the data!
Case 1 - Graph Autoencoder
For this case let's assume the task is to find similar tweets. A way of doing this is to train a Graph Autoencoder (see example).
This approach is completely unsupervised and thus does not need any data to be labeled.
The resulting model should be able to generate an embedding for each node (in this case each tweet) so that the distance between similar tweets is lower than between non-similar (measured e. g. by cosine distance).
Case 2 - Semi-Supervised GCN
Another case would be to classify tweets as advertisement vs. non-advertisement. Since the idea behind GCNs is to train in a semi-supervised manner it would be no problem to only have labels for some of the tweets.
In order to tell PYG which ones have labels and should be used for training you can define a train_mask. All nodes with missing labels will still technically need a y-value which can be set to -1.
Source & Credits

Multi-features modeling based on one binary-feature which is rarely 1 (imbalanced data) when there is a cost

I need to model a multi-variate time-series data to predict a binary-target which is rarely 1 (imbalanced data).
This means that we want to model based on one feature is binary (outbreak), rarely 1?
All of the features are binary and rarely 1.
What is the suggested solution?
This features has an effect on cost function based on the following cost function. We want to know prepared or not prepared if the cost is the same as following.
Problem Definition:
Model based on outbreak which is rarely 1.
Prepared or not prepared to avoid the outbreak of a disease and the cost of outbreak is 20 times of preparation
cost of each day(next day):
cost=20*outbreak*!prepared+prepared
Model:prepare(prepare for next day)for outbreak for which days?
Questions:
Build a model to predict outbreaks?
Report the cost estimation for every year
csv file is uploaded and data is for end of the day
The csv file contains rows which each row is a day with its different features some of them are binary and last feature is outbreak which is rarely 1 and a main features considering in the cost.
You are describing class imbalance.
Typical approach is to generate balanced training data
by repeatedly running through examples containing
your (rare) positive class,
and each time choosing a new random sample
from the negative class.
Also, pay attention to your cost function.
You wouldn't want to reward a simple model
for always choosing the majority class.
My suggestions:
Supervised Approach
SMOTE for upsampling
Xgboost by tuning scale_pos_weight
replicate minority class eg:10 times
Try to use ensemble tree algorithms, trying to generate a linear surface is risky for your case.
Since your data is time series you can generate days with minority class just before real disease happened. For example you have minority class at 2010-07-20. Last observations before that time is 2010-06-27. You can generate observations by slightly changing variance as 2010-07-15, 2010-07-18 etc.
Unsupervised Approach
Try Anomaly Detection algorithms. Such as IsolationForest (try extended version of it also).
Cluster your observations check minority class becomes a cluster itself or not. If its successful you can label your data with cluster names (cluster1, cluster2, cluster3 etc) then train a decision tree to see split patterns. (Kmeans + DecisionTreeClassifier)
Model Evaluation
Set up a cost matrix. Do not use confusion matrix precision etc directly. You can find further information about cost matrix in here: http://mlwiki.org/index.php/Cost_Matrix
Note:
According to OP's question in comments groupby year could be done like this:
df["date"] = pd.to_datetime(df["date"])
df.groupby(df["date"].dt.year).mean()
You can use other aggregators also (mean, sum, count, etc)

How to use clustering to group sentences with similar intents?

I'm trying to develop an program in Python that can process raw chat data and cluster sentences with similar intents so they can be used as training examples to build a new chatbot. The goal is to make it as quick and automatic (i.e. no parameters to enter manually) as possible.
1- For feature extraction, I tokenize each sentence, stem its words and vectorize it using Sklearn's TfidfVectorizer.
2- Then I perform clustering on those sentence vectors with Sklearn's DBSCAN. I chose this clustering algorithm because it doesn't require the user to specify the desired number of clusters (like the k parameter in k-means). It throws away a lot of sentences (considering them as outliers), but at least its clusters are homogeneous.
The overall algorithm works on relatively small datasets (10000 sentences) and generates meaningful clusters, but there are a few issues:
On large datasets (e.g. 800000 sentences), DBSCAN fails because it requires too much memory, even with parallel processing on a powerful machine in the cloud. I need a less computationally-expensive method, but I can't find another algorithm that doesn't make weird and heterogeneous sentence clusters. What other options are there? What algorithm can handle large amounts of high-dimensional data?
The clusters that are generated by DBSCAN are sentences that have similar wording (due to my feature extraction method), but the targeted words don't always represent intents. How can I improve my feature extraction so it better captures the intent of a sentence? I tried Doc2vec but it didn't seem to work well with small datasets made of documents that are the size of a sentence...
A standard implementation of DBSCAN is supposed to need only O(n) memory. You cannot get lower than this memory requirement. But I read somewhere that sklearn's DBSCAN actually uses O(n²) memory, so it is not the optimal implementation. You may need to implement this yourself then, to use less memory.
Don't expect these methods to be able to cluster "by intent". There is no way an unsupervised algorithm can infer what is intended. Most likely, the clusters will just be based on a few key words. But this could be whether people say "hi" or "hello". From an unsupervised point of view, this distinction gives two nice clusters (and some noise, maybe also another cluster "hola").
I suggest to train a supervised feature extraction based on a subset where you label the "intent".

scikit-learn clustering: predict(X) vs. fit_predict(X)

In scikit-learn, some clustering algorithms have both predict(X) and fit_predict(X) methods, like KMeans and MeanShift, while others only have the latter, like SpectralClustering. According to the doc:
fit_predict(X[, y]): Performs clustering on X and returns cluster labels.
predict(X): Predict the closest cluster each sample in X belongs to.
I don't really understand the difference between the two, they seem equivalent to me.
In order to use the 'predict' you must use the 'fit' method first. So using 'fit()' and then 'predict()' is definitely the same as using 'fit_predict()'. However, one could benefit from using only 'fit()' in such cases where you need to know the initialization parameters of your models rather than if you use 'fit_predict()', where you will just be obtained the labeling results of running your model on the data.
fit_predict is usually used for unsupervised machine learning transductive estimator.
Basically, fit_predict(x) is equivalent to fit(x).predict(x).
This might be very late to add an answer here, It just that someone might get benefitted in future
The reason I could relate for having predict in kmeans and only fit_predict in dbscan is
In kmeans you get centroids based on the number of clusters considered. So once you trained your datapoints using fit(), you can use that to predict() a new single datapoint to assign to a specific cluster.
In dbscan you don't have centroids , based on the min_samples and eps (min distance between two points to be considered as neighbors) you define, clusters are formed . This algorithm returns cluster labels for all the datapoints. This behavior explains why there is no predict() method to predict a single datapoint. Difference between fit() and fit_predict() was already explained by other user -
In another spatial clustering algorithm hdbscan gives us an option to predict using approximate_predict(). Its worth to explore that.
Again its my understanding based on the source code I explored. Any experts can highlight any difference.

Clustering data after dimension reduction with PCA

Say we have a dataset of a large dimension, which we have reduced to a lower dimension using PCA, would it be wise/accurate to then use a clustering algorithm on said data? Assuming that we do not know how many clusters to expect.
Using PCA on the Iris dataset(with the data in the csv ordered such that all of the first class are listed, then the second, then the third) yields the following plot:-
It can be seen that the three classes in the Iris dataset have been retained. However, when the order of the samples is randomised, the following plot is produced:-
Above, it is not clear how many clusters/classes are contained in the data set. In this case(the more real world case), how would one identify the number of classes, would a clustering algorithm such as K-Means be effective?
Would there be innacuracies due to the discarding of lower order Principal Components?
EDIT:- To be clear, I am asking if a dataset can be clustered after running PCA, and if so, what the most accurate method would be.
Say we have a dataset of a large dimension, which we have reduced to a lower
dimension using PCA, would it be wise/accurate to then use a clustering
algorithm on said data? Assuming that we do not know how many clusters to
expect.
Your data might well separate in a low-variance dimension. I would not recommend running PCA prior to clustering.
Above, it is not clear how many clusters/classes are contained in the data
set. In this case(the more real world case), how would one identify the
number of classes, would a clustering algorithm such as K-Means be effective?
There are effective clustering algorithms that do not require prior knowledge of the number of classes, such as Mean Shift and DBSCAN.
Try sorting the dataset after PCA, then plotting it.
The iris data set is much to simple to draw any valid conclusions about the behaviour of high-dimensional data, and the benefits of PCA.
Plus, "wise" - in which sense? If you want to eat pizza, it is not wise to plot the iris data set.

Resources