Sklearn recommended cluster algorithms for Unsupervised Learning [closed] - scikit-learn

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Slowly getting into the word of Sklearn, more specific unsupervised clustering algorithms.
I’m working on a project that flattens xml file into csv file, that part is done.
Now I want to implement any of sklearn method to detect anomalies on my data.
The csv file is in a Dataframe format where there are some column with descriptions and others have values. These values might also be decimals values 55,2
Which of the Sklearn algorithms are more suggested for anomaly detection using unsupervised learning ?
At the beginning I just want to try and find the anomalies for the numbers, if there is any number that doesn’t belong there.

First of all, clustering algorithm and anomaly detection algorithm are not the same things.
In clustering, the goal is to assign each of you instances into a group (cluster), wherein each group you have similar instances.
In anomaly detection, the goal is to find instnaces that are not similar to any of the other instances.
Some clustering algorithms, for example DB-SCAN, create an "anomaly cluster". This cluster has all the instances that don't belong in any other cluster. I would suggest to try and see if it solve your problem.
Almost all of the clustering algorithms expect vector of numbers as input. If you want to use string columns you can use methods like One Hot Encoding to transform the string into a vector of numbers. There are many ways to do that, and you can find some sk-learn implementations here.

Which of the Sklearn algorithms are more suggested for anomaly detection using
unsupervised learning?
The most used algorithms are these ones recommended by sklearn.
At the beginning I just want to try and find the anomalies for the
numbers, if there is any number that doesn’t belong there.
As I see it, you can try a novelty detection approach, here you have a basic explanation. In my experience OneClassSVM is a reliable algorithm.

Related

Tensorflow Object Detection: Using new detected images to retrain model [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have searched through this forum for similar questions but was unanswered (Updating Tensorflow Object detection model with new images). I have managed to create my custom train model (lets name it model1). Was wondering if can i use new images that are processed by model1 to further train model1? will it improve the accuracy of the model?
Accuracy will depend on the number of correctly classified images and not only on the total number of training images. https://developers.google.com/machine-learning/crash-course/classification/accuracy. If you consider that the new images are to be used for training (have correct labels), then you should consider re-training the model. Take a look at this post https://datascience.stackexchange.com/questions/12761/should-a-model-be-re-trained-if-new-observations-are-available
You can use your current model (model1) in a number of ways:
on new images to detect bad results (hard examples) for new training
on new images to detect good results for evaluation
on the images in the existing dataset to detect bad images (wrong label etc.)
Some of the bad results from new images will be non-objects (adversarial) and not directly usable for training (but see this: https://github.com/tensorflow/models/issues/3578#issuecomment-375267920).
Removal of bad images from the existing dataset requires retraining from scratch unless there is some funky way of "untraining" images from a model.
Eventually one would end up approaching a perfect dataset that makes best use of the capacity of the chosen model architecture, although the domain may evolve over time.
I think the reason this is not much discussed is because most researchers have to work with common datasets so they can compare their approaches (brilliant read: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5697567/).
It might improve it but it is tricky. It would lead to overfitting. Improving the data set would actually help, but not with images detected by its own model. This kind of images are detected cause the model already performs well on them, so not much help.
What you need actually is quite the opposite. You need to teach the model to recognize the images that it didn't recognize before
The main problem of machine learning (that is the approach you are using for object detection here) is that of generalization. In your case, it is the ability to recognize objects of the same type as image you used for training, in images that were not used during training.
Obviously, if you were able to use all the possible images during training, your system would be perfect (actually, it would be a simple exact image matching problem). In a more realistic setup, the more training image you are using, the higher chance you have to obtain a better object detector.
Usually, it is however more valuable to add hard examples to your training set. Hence, if your application allows it (in terms of computation time in particular) you can indeed add all the images that are wrongly detected in your dataset (with the correct label) and it will probably help to get a better model, able to detect the object in harder condition on new images.
However, it really depends on what you are doing. If you want to compare your system to another one, you need to use the same (training and) test images to be fair. For benchmarking, you are not allowed to include test images in the training dataset! When you compute the accuracy (on a validation/test dataset) to compare several settings, be sure you are fair in this comparison.

Why is PyTorch called PyTorch? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
This post was edited and submitted for review 11 months ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
I have been looking into deep learning frameworks lately and have been wondering about the origin of the name of PyTorch.
With Keras, their home page nicely explains the name's origin, and with something like TensorFlow, the reasoning behind the name seems rather clear. For PyTorch, however, I cannot seem to come across why it is so named.
Of course, I understand the "Py-" prefix and also know that PyTorch is a successor in some sense of Torch. But I am still wondering: what is the original idea behind the "-Torch" part? Is it known what the origin of the name is?
Here a short answer, formed as another question:
Torch, SMORCH ???
PyTorch developed from Torch7. A precursor to the original Torch was a library called SVM-Torch, which was developed around 2001. The SVM stands for Support Vector Machines.
SVM-Torch is a decomposition algorithm similar to SVM-Light, but adapted to regression problems, according to this paper.
Also around this time, G.W.Flake described the sequential minimal optimization algorithm (SMO), which could be used to train SVMs on sparse data sets, and this was incorporated into NODElib.
Interestingly, this was called the SMORCH algorithm.
You can find out more about SMORCH in the NODElib docs
Optimization of the SVMs is:
performed by a variation of John Platt's sequential minimal
optimization (SMO) algorithm. This version of SMO is generalized
for regression, uses kernel caching, and incorporates several
heuristics; for these reasons, we refer to the optimization
algorithm as SMORCH.
So SMORCH =
Sequential
Minimal
Optimization
Regression
Caching
Heuristics
I can't answer definitively, but my thinking is "Torch" is a riff or evolution of "Light" from SVM-Light combined with a large helping of SMORCHiness. You'd need to check in with the authors of SVMTorch and SVM-Light to confirm that this is indeed what "sparked" the name. It is reasonable to assume that the "TO" of Torch stands for some other optimization, rather than SMO, such as Tensor Optimization, but I haven't found any direct reference... yet.

'normalize=True' parameter needed in Lasso and Ridge Regressions, if features already scaled? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have my data already standardized with the help of StandardScaler() in Python. while applying Lasso Regression do I need to set the normalize parameter True or not and why?
from sklearn import StandardScaler()
scaler=StandardScaler()
x_new=scaler.fit_transform(x)
Now, i want to use Lasso Regression.
from sklearn.linear_model import Lasso
lreg=Lasso(alpha=0.1,max_iter=100,normalize=True)
I want to know if 'normalize=True' is still needed or not?
Standarize and Normalize are two different actions. If you do both without knowing what they do and why you do it, you'll end up loosing accuracy.
Standarization is removing the mean and dividing by the deviation. Normalization is putting everything between 0 and 1.
Depending on the penalisation (lasso,ridge, elastic net) you'll prefer one over the other, but it's not recommended to to do both.
So no, it's not needed.

SVM for text classification - tutorial on machine learning? How do I get started? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I'm looking for a really good tutorial on machine learning for text classification perhaps using Support vector machine (SVM) or other appropriate technology for large-scale supervised text classification. If there isn't a great tutorial, can anyone give me pointers to how a beginner should get started and do a good job with things like feature detection for English language Text Classification.
Books, articles, anything that can help beginners get started would be super helpful!
In its classical flavour the Support Vector Machine (SVM) is a binary classifier (i.e., it solves classification problems involving two classes). However, it can be also used to solve multi-class classification problems by applying techniques likes One versus One, One Versus All or Error Correcting Output Codes [Alwein et al.]. Also recently, a new modification of the classical SVM the multiclass-SVM allows to solve directly multi-class classification problems [Crammer et al.].
Now as far as it concerns document classification, your main problem is feature extraction (i.e., how to acquire certain classification features from your documents). This is not a trivial task and there's a batch of bibliography on the topic (e.g., [Rehman et al.], [Lewis]).
Once you've overcome the obstacle of feature extraction, and have labeled and placed your document samples in a feature space you can apply any classification algorithm like SVMs, AdaBoost e.t.c.
Introductory books on machine learning:
[Flach], [Mohri], [Alpaydin], [Bishop], [Hastie]
Books specific for SVMs:
[Schlkopf], [Cristianini]
Some specific bibliography on document classification and SVMs:
[Miner et al.], [Srivastava et al.], [Weiss et al.], [Pilászy], [Joachims], [Joachims01], [Joachims97], [Sassano]

Multi layer perceptron for OCR [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I intend to use a multi layer perceptron network trained with backpropagation (one hidden layer, inputs served as 8x8 bit matrices containing the B/W pixels from the image). The following questions arise:
which type of learning should I use: batch or on-line?
how could I estimate the right number of nodes in the hidden layer? I intend to process the 26 letter of english alphabet.
how could I stop the training process, to avoid overfitting?
(not quite related) is there another better NN prved to perform better than MLP? I know about MLP stucking in local minima, overfitting and so on, so is there a better (soft computing-based) approach?
Thanks
Most of these questions are things that you need to try different options to see what works best. That is the problem with ANNs. There is no "best" way to do almost anything. You need to find out what works for your specific problem. Nevertheless, I will give my advice for your questions.
1) I prefer incremental learning. I think it is important for the network weights to be updated after each pattern.
2) This is a tough question. It really depends on the complexity of your network. How many input nodes, output nodes, and training patterns that there are. For your problem, I might start with 100 and try ranges up and down from 100 to see if there is improvement.
3) I usually calculate the total error of the network when applied to the test set (not the training set) after each epoch. If that error increases for about 5 epochs, I will stop training and then use the network that was created before the increase occurred. It is important not to use the error of the training set when deciding to stop training. This is what will cause overfitting.
4) You could also try a probabilistic neural network if you are representing your output as 26 nodes, each representing a letter of the alphabet. This network architecture is good for classification problems. Again, it may be a good idea just to try a few different architectures to see what works best for your problem.
Regarding number 3, one way to find out when your ANN starts to overfit is by graphing the accuracy of the net on your training data and your test data vs the number of epochs performed. At some point, as your training accuracy continues to increase (tending towards 100%), your test accuracy will probably start to actually decrease because the ANN is overfitting to the training data. See what epoch that starts to happen and make sure not to train past that.
If your data is very regular and consistent, then it might not overfit until very late in the game, or not at all. And if your data is highly irregular, then your ANN will start to overfit much earlier.
Also, a way to test how regular your data is is to do something like k-fold cross validation.

Resources