I am trying to implement a data clustering algorithm, specifically DBSCAN, using Scikit learn. I am using the Jaccard Index for my metric. However, DBSCAN() doesn't have the verbose parameter that other models have. This means I can't see which epoch my DBSCAN is on and I have no intuition of how long it is going to take. Also, to my (somewhat limited) knowledge of clustering algorithms, they may fail to ever converge if they get stuck in a loop; hence, knowing which iteration the algorithm is in is quite important.
Is there any way that I can have scikit print info on which epoch I am on? If not, is there a way to code such a function myself and have scikit learn run this function at the end of every output (or something like that)? Or do I have to code the entire DBSCAN() function myself to have printed statements about the epoch and the associated accuracy scores?
Thanks!
I am not familiar with an option to let Scikit's implementation of DBSCAN() print the iteration it is in. Nevertheless, you could try to reason about your data whether it would make sense that it takes so long to converge.
DBSCAN() works really well if you have regions with dense clusters (in any shape; which is one of its main advantages) and other regions with few datapoints. So, if you first try to visualize your data in 2D or 3D after PCA, you could obtain a first indication of whether your data is one blob or whether there are high and low density regions. If the data is indeed a blob, then the DBSCAN() likely will have a hard time converging and if it converges it would like choose one cluster with many anomalies. Moreover, your epsilon parameter is a very important one in DBSCAN() because that will actually determine the proximity of points that will be regarded to one cluster. The lower the epsilon, the more clusters you likely find/
I think the above points might explain why your clustering algorithm takes so long to run, because the DBSCAN() normally has a roughly linear (to the number of datapoints) computational complexity.
Related
I am new to data mining concepts and have a question regarding implementation of a technique.
I am using the a dataset with large continuous values.
Now, I am trying to code an algorithm where I need to discretize data (not scale as it makes no impact on data along with the fact that algorithm is not a distance based one, hence no scaling needed).
Now for discretization, I have a similar question with regards to scaling and train test split.
For scaling, I know we should split data and then fit transform the train and transform the test based on what we fit from train.
But what do we do for discretization? I am using scikit learns KBinsDiscretizer and trying to make sense of whether I should split first and discretize the same way we normally scale or discretize first then scale.
The issue came up because I used the 17 bins, uniform strategy (0-16 value range)
With split then discretize, I get (0-16) range throughout in train but not in test.
With discretize and split, I get (0-16) range in both.
With former strategy, my accuracy is around 85% but with the latter, its a whopping 97% which leads me to believe I have definitely overfit the data.
Please advise on what I should be doing for discretization and whether the data interpretation was correct.
I am using K-means for topic modelling using Word2Vec and would like to understand the implications of vectorizing up to, let's say, 10 dimensions, against embedding it with 200 dimensions and then using PCA to get down to 10. Does the second approach make sense at all?
Which one worked better for your specific purposes, & your specific data, after trying both & comparing the end-results against each other, either in some ad-hoc ("eyeballing") or rigorous way?
There's no reason to prematurely reject any approach, given how many details about your data & ultimate end-goals are unstated.
It would be atypical to train a word2vec model to have only 10 dimensions. Published work most often shows the use of 100 to 1000 dimensions, often 300 or 400, assuming you've got enough bulk training data to make the algorithm worthwhile.
(Word2vec needs a lot of varied training text, with many contrasting usage examples for every word of interest, to generate good results. You may occasionally see toy-sized demos, on smaller amounts of data, just to quickly show steps, or some major qualities of the results. But good results, in the aspects for which word2vec is most appreciated, depend on plentiful training data.)
Also, whether or not your aims would be helped by the extra step of PCA to reduce the dimensionality of a larger word2vec model seems another separable question, to be determined experimentally by comparing results with and without that step, on your actual data/problem, rather than guessed at from intuitions from other projects that might not be comparable.
Sklearn EllipticEnvelope calculates the covariance between two or more features and estimates the outliers. Instead of using two features, I created one new feature by dividing first with the second. When I apply EllipticEnvelope on just this one new feature. It works well. But my question is this a correct way to do it since the model relies on the covariance of two or more features?
I found the answer. It works for both univariate and multivariate. But still would love to see more answers about how it works with a single feature.
“EllipticEnvelope is a function that tries to figure out the key parameters of your data's general distribution by assuming that your entire data is an expression of an underlying multivariate Gaussian distribution. That's an assumption that cannot hold true for all datasets, yet when it does, it proves an effective method indeed for spotting outliers. Simplifying the complex estimations working behind the algorithm as much as possible, we can say that it checks the distance of each observation with respect to a grand mean that takes into account all the variables in your dataset. For this reason, it is able to spot both univariate and multivariate outliers.”
Source: Alberto Boschetti. “Python Data Science Essentials.”.
I have a particular classification problem that I was able to improve using Python's abs() function. I am still somewhat new when it comes to machine learning, and I wanted to know if what I am doing is actually "allowed," so to speak, for improving a regression problem. The following line describes my method:
lr = linear_model.LinearRegression()
predicted = abs(cross_val_predict(lr, features, labels_postop_IS, cv=10))
I attempted this solution because linear regression can sometimes produce negative predictions values, even though my particular case, these predictions should never be negative, as they are a physical quantity.
Using the abs() function, my predictions produce a better fit for the data.
Is this allowed?
Why would it not be "allowed". I mean if you want to make certain statistical statements (like a 95% CI e.g.) you need to be careful. However, most ML practitioners do not care too much about underlying statistical assumptions and just want a blackbox model that can be evaluated based on accuracy or some other performance metric. So basically everything is allowed in ML, you just have to be careful not to overfit. Maybe a more sensible solution to your problem would be to use a function that truncates at 0 like f(x) = x if x > 0 else 0. This way larger negative values don't suddenly become large positive ones.
On a side note, you should probably try some other models as well with more parameters like a SVR with a non-linear kernel. The thing is obviously that a LR fits a line, and if this line is not parallel to your x-axis (thinking in the single variable case) it will inevitably lead to negative values at some point on the line. That's one reason for why it is often advised not to use LRs for predictions outside the "fitted" data.
A straight line y=a+bx will predict negative y for some x unless a>0 and b=0. Using logarithmic scale seems natural solution to fix this.
In the case of linear regression, there is no restriction on your outputs.
If your data is non-negative (as in your case the values are physical quantities and cannot be negative), you could model using a generalized linear model (GLM) with a log link function. This is known as Poisson regression and is helpful for modeling discrete non-negative counts such as the problem you described. The Poisson distribution is parameterized by a single value λ, which describes both the expected value and the variance of the distribution.
I cannot say your approach is wrong but a better way is to go towards the above method.
This results in an approach that you are attempting to fit a linear model to the log of your observations.
I am trying to predict the inter-arrival time of the incoming network packets. I measure the inter-arrival times of network packets and represent this data in the form of binary features: xi= 0,1,1,1,0,... where xi=0 if the inter-arrival time is less than a break-even-time and 1 otherwise. The data has to be mapped into two possible classes C={0,1}, where C=0 represents a short inter-arrival time and 1 represents a long inter-arrival time. Since I want to implement the classifier in an online feature, where as soon as I observe a vector of features xi=0,1,1,0..., I calculate the MAP class. Since I don't have a prior estimation of the conditional and prior probabilities, I initialize them as follows:
p(x=0|c=0)=p(x=1|c=0)=p(x=0|c=1)=p(x=1|c=1)=0.5
p(c=0)=p(c=1)=0.5
For each feature vector (x1=m1,x2=m2,...,xn=mn), when I output a class C, I update the conditional and prior probabilities as follows:
p(xi=mi|y=c)=a+(1-a)*p(p(xi=mi|c)
p(y=c)=b+(1-b)*p(y=c)
The problem is that, I am always getting a biased prediction. Since the number of long inter-arrival times are comparatively less than the short, the posterior of short always remains higher than the long. Is there any way to improve this? or am I doing something wrong? Any help will be appreciated.
Since you have a long time series, the best path would probably be to take into account more than a single previous value. the standard way of doing this would be to use a time-window, i.e. split the long vector Xi to overlapping pieces of a constant length, with the last value treated as the class, and use them as the train set. This could be also done on streaming data in an online manner, by incrementally updating the NB model with new data as it arrives.
Note that Using this method, other regression algorithms might end up being a better choice than NB.
Weka (version 3.7.3 and up) has a very nice dedicated tool supporting time-series analysis. alternatively, MOA is also based on Weka, and supports modeling of streaming data.
EDIT: it might also be a good idea to move from binary features to the real values (maybe normalized), and apply the threshold post-classification. This might give more information to the regression model (NB or other), allowing better accuracy.