Separating assignment from centers update k-means sklearn - scikit-learn

Is there a way to execute separately the assignment and the update of centroids using sklearn implementation of k-means?
I would like to measure the execution time of these two steps, and I would like to do this measurement on a good implementation of k-means (as the one of sklearn seems to be).
Therefore, I would like to do something like:
Assign initial centroids
Measure time execution of assignment phase
Measure time execution of centroids update

Related

How to make any sklearn model verbose?

I am trying to implement a data clustering algorithm, specifically DBSCAN, using Scikit learn. I am using the Jaccard Index for my metric. However, DBSCAN() doesn't have the verbose parameter that other models have. This means I can't see which epoch my DBSCAN is on and I have no intuition of how long it is going to take. Also, to my (somewhat limited) knowledge of clustering algorithms, they may fail to ever converge if they get stuck in a loop; hence, knowing which iteration the algorithm is in is quite important.
Is there any way that I can have scikit print info on which epoch I am on? If not, is there a way to code such a function myself and have scikit learn run this function at the end of every output (or something like that)? Or do I have to code the entire DBSCAN() function myself to have printed statements about the epoch and the associated accuracy scores?
Thanks!
I am not familiar with an option to let Scikit's implementation of DBSCAN() print the iteration it is in. Nevertheless, you could try to reason about your data whether it would make sense that it takes so long to converge.
DBSCAN() works really well if you have regions with dense clusters (in any shape; which is one of its main advantages) and other regions with few datapoints. So, if you first try to visualize your data in 2D or 3D after PCA, you could obtain a first indication of whether your data is one blob or whether there are high and low density regions. If the data is indeed a blob, then the DBSCAN() likely will have a hard time converging and if it converges it would like choose one cluster with many anomalies. Moreover, your epsilon parameter is a very important one in DBSCAN() because that will actually determine the proximity of points that will be regarded to one cluster. The lower the epsilon, the more clusters you likely find/
I think the above points might explain why your clustering algorithm takes so long to run, because the DBSCAN() normally has a roughly linear (to the number of datapoints) computational complexity.

Multi-features modeling based on one binary-feature which is rarely 1 (imbalanced data) when there is a cost

I need to model a multi-variate time-series data to predict a binary-target which is rarely 1 (imbalanced data).
This means that we want to model based on one feature is binary (outbreak), rarely 1?
All of the features are binary and rarely 1.
What is the suggested solution?
This features has an effect on cost function based on the following cost function. We want to know prepared or not prepared if the cost is the same as following.
Problem Definition:
Model based on outbreak which is rarely 1.
Prepared or not prepared to avoid the outbreak of a disease and the cost of outbreak is 20 times of preparation
cost of each day(next day):
cost=20*outbreak*!prepared+prepared
Model:prepare(prepare for next day)for outbreak for which days?
Questions:
Build a model to predict outbreaks?
Report the cost estimation for every year
csv file is uploaded and data is for end of the day
The csv file contains rows which each row is a day with its different features some of them are binary and last feature is outbreak which is rarely 1 and a main features considering in the cost.
You are describing class imbalance.
Typical approach is to generate balanced training data
by repeatedly running through examples containing
your (rare) positive class,
and each time choosing a new random sample
from the negative class.
Also, pay attention to your cost function.
You wouldn't want to reward a simple model
for always choosing the majority class.
My suggestions:
Supervised Approach
SMOTE for upsampling
Xgboost by tuning scale_pos_weight
replicate minority class eg:10 times
Try to use ensemble tree algorithms, trying to generate a linear surface is risky for your case.
Since your data is time series you can generate days with minority class just before real disease happened. For example you have minority class at 2010-07-20. Last observations before that time is 2010-06-27. You can generate observations by slightly changing variance as 2010-07-15, 2010-07-18 etc.
Unsupervised Approach
Try Anomaly Detection algorithms. Such as IsolationForest (try extended version of it also).
Cluster your observations check minority class becomes a cluster itself or not. If its successful you can label your data with cluster names (cluster1, cluster2, cluster3 etc) then train a decision tree to see split patterns. (Kmeans + DecisionTreeClassifier)
Model Evaluation
Set up a cost matrix. Do not use confusion matrix precision etc directly. You can find further information about cost matrix in here: http://mlwiki.org/index.php/Cost_Matrix
Note:
According to OP's question in comments groupby year could be done like this:
df["date"] = pd.to_datetime(df["date"])
df.groupby(df["date"].dt.year).mean()
You can use other aggregators also (mean, sum, count, etc)

Is every step in a scikit-learn pipeline refitted for each grid-searched hyper-parameter configuration?

Assume you have a scikit-learn pipeline with two steps: first, a transformer object (eg, PCA) and, second and last, an object with a predict method (eg, a logistic regression). Suppose that your grid contains four different hyperparameter configurations consisting of two different number of principal components (for PCA) and two different regularization parameters (for logistic regressions).
Is the first object (PCA) fitted 4 times (one for each configuration)? Or, on the contrary, it is fitted two times only: for each number of principal components, fits PCA once and twice the logistic regression. This way seems more efficient and equivalent in terms of results.
I thought that it was the second way, but I got confused by a comment on scikit-learn github (https://github.com/scikit-learn/scikit-learn/issues/4813#issuecomment-205185156)

LSTM prediction how to incorporate multiple autocorrelation

I am working on a project which aims at prediction of highly autocorrelated time series. LSTM seems very ideal for my purpose. However, does anyone know how I can incorporate multiple large autocorrelation into my prediction networks? i.e., there is a very strong yearly correlation, and seasonal correlation; how am I able to include these information into the LSTM network?
Thank you sincerely
if there is autocorrelation the correlation is linear ( not non-linear ) because common autocorrelation tests for linear correlation. Any LSTM is able to capture this linear correlations by default, it does not matter how many linear correlations are in the time series, the LSTM will capture it. A problem could be the length of memory , a LSTM has a memory between 200 and 500 timesteps ( https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/ ), so if the long-term linear correlations are in the time series at positions more extent than this the LSTM will not be able to capture because it lacks the memory ( not physical computer memory, the memory in the structure of LSTMs )
So simply build the LSTM model in keras and let it predict,
as Upasana Mittal said in his comment, cf http://www.business-science.io/timeseries-analysis/2018/04/18/keras-lstm-sunspots-time-series-prediction.html
updated answer because there is not enough space in the comments. In http://www.business-science.io/timeseries-analysis/2018/04/18/keras-lstm-sunspots-time-series-prediction.html is used a lagged time series to determine ACF, this is objective else it would be impossible to determine ACF :
First, we need to review the Autocorrelation Function (ACF), which is
the correlation between the time series of interest in lagged versions
of itself. The acf() function from the stats library returns the ACF
values for each lag as a plot. However, we’d like to get the ACF
values as data so we can investigate the underlying data. To do so,
we’ll create a custom function, tidy_acf(), to return the ACF values
in a tidy tibble.
There is no use of a specially lagged time series as input and using the history of the system or past system states to predict the future system states i also an objective ansatz and essential in any RNN.
So the way of proceeding in http://www.business-science.io/timeseries-analysis/2018/04/18/keras-lstm-sunspots-time-series-prediction.html is objective.
Another point you could mean is the stateful mode however it is vital that you use it because only in stateful mode the samples are not shuffled and accuracy is increased. Stateless neural nets work on probability distributions and shuffling a probability distribution does not change it ( permutation invariance ), stateful neural nets include the sequential ordering of the data so shuffling changes the result, search net for 'shuffling multifractal data' :
In normal (or “stateless”) mode, Keras shuffles the samples, and the
dependencies between the time series and the lagged version of itself
are lost. However, when run in “stateful” mode, we can often get high
accuracy results by leveraging the autocorrelations present in the
time series.
LSTMs by definition use a time series and a lagged version of the time series (timesteps,...), so this is also an objective ansatz.
If you want to dig deeper into the matter, and go beyond linear correlations that are captured by ACF, you should learn about nonlinear dynamical systems ( chaos theory, fractality, multifractality ) because it involves nonlinear systems and nonlinear correlations, i.e. the lag plot of a time series of a nonlinear dynamical systems in its chaotic state always exhibits the species of nonlinearity. The lag plot of the Logistic Map in its chaotic region shows a parabola, the lag plot of a cubic nonlinear map shows a cubic curve,.... RNNs are only capable to model / approximate systems perfectly accurate whichs lag plot shows a sufficiently simple structure ( circles, spirals, lemniscates, cubic curves, quadratic curves , ... ), i.e. for a neural net it is impossible to approximate the sequence of the primegaps because the lag plot of the sequence of primegaps is structured to complex ( however it shows a clear pattern for lag = 1, when neglecting the sequential ordering )

scikit-learn clustering: predict(X) vs. fit_predict(X)

In scikit-learn, some clustering algorithms have both predict(X) and fit_predict(X) methods, like KMeans and MeanShift, while others only have the latter, like SpectralClustering. According to the doc:
fit_predict(X[, y]): Performs clustering on X and returns cluster labels.
predict(X): Predict the closest cluster each sample in X belongs to.
I don't really understand the difference between the two, they seem equivalent to me.
In order to use the 'predict' you must use the 'fit' method first. So using 'fit()' and then 'predict()' is definitely the same as using 'fit_predict()'. However, one could benefit from using only 'fit()' in such cases where you need to know the initialization parameters of your models rather than if you use 'fit_predict()', where you will just be obtained the labeling results of running your model on the data.
fit_predict is usually used for unsupervised machine learning transductive estimator.
Basically, fit_predict(x) is equivalent to fit(x).predict(x).
This might be very late to add an answer here, It just that someone might get benefitted in future
The reason I could relate for having predict in kmeans and only fit_predict in dbscan is
In kmeans you get centroids based on the number of clusters considered. So once you trained your datapoints using fit(), you can use that to predict() a new single datapoint to assign to a specific cluster.
In dbscan you don't have centroids , based on the min_samples and eps (min distance between two points to be considered as neighbors) you define, clusters are formed . This algorithm returns cluster labels for all the datapoints. This behavior explains why there is no predict() method to predict a single datapoint. Difference between fit() and fit_predict() was already explained by other user -
In another spatial clustering algorithm hdbscan gives us an option to predict using approximate_predict(). Its worth to explore that.
Again its my understanding based on the source code I explored. Any experts can highlight any difference.

Resources