How to use SMOTE in Microsoft Azure - azure

There is a module named SMOTE(Synthetic Minority Oversampling Technique ) which increase the number of samples of under sampled data, I guess we should choose a feature(feature to be predicted) which is under represented. How to choose it? There seems to be no option on choosing the coloumn.

I guess you are referring to the target variable (label column). You can set that using a Metadata Editor module. Choose your label column using the column selector and set the Fields property to Labels.

Here is the SMOTE definition - SMOTE is an approach for the construction of classifiers from imbalanced datasets, which is when classification categories are not approximately equally represented. The classification category is the feature that the classifier is trying to learn. There is not an option of choosing the column in the SMOTE module because it should be the label column
Here is the details on how to use SMOTE in Azure Machine Learning - https://msdn.microsoft.com/en-us/library/azure/dn913076.aspx?f=255&MSPPError=-2147217396

You can do it thru the column selector. In the sample below, the blood donation data (a sample dataset in Azure ML) has 25% of people who donated (class 1).

Related

Regression analysis with All text data

I want to know what is the best approach to handle a regression analysis on all text data type. I have the following data set.
my feature columns are: Strength, area of development, leadership, satisfactory
values of these columns are predefined set of texts eg. "Continuous Improvement,Self-Development,Coaching and Mentoring,Creativity,Adaptability"
based on the value in these columns I want to predict the label (overall Performance) - Outstanding or Exceeding Expectation or Meeting Expectation.
what should be the best approach to deal with this dataset ?

precision, recall, F1 metrics exclude a label sklearn

I have a classifier for a NER task, and since 'O' labels are by far more than all others, I want to exclude it in metrics calculation.
I want to compute macro and micro scores with sklearn package. Macro scores can be calculated with precision_recall_fscore_support, because it returns the precision, recall, F1 and support for each label separetly.
Can I use sklearn package to compute and micro scores as well?
The answer turns out to be very simple. The label parameter of the function determines which labels to include in scores calculation. It is also combined with the macro, micro averages.

Is there an MNIST dataset where the digits are in colour?

I would like to use the MNIST dataset, where each digit is assigned a specific colour. Not the background, the digit itself.
The following dataset colours the background of the image: https://www.wouterbulten.nl/blog/tech/getting-started-with-gans-2-colorful-mnist/
Maybe you are looking for the coloured MNIST dataset?
I have seen two papers proposing it:
Invariant Risk Minimisation: source code to generate the data
PREDICTING WITH HIGH CORRELATION FEATURES: source code to generate
the data

Why does k=1 in KNN give the best accuracy?

I am using Weka IBk for text classificaiton. Each document basically is a short sentence. The training dataset contains 15,000 documents. While testing, I can see that k=1 gives the best accuracy? How can this be explained?
If you are querying your learner with the same dataset you have trained on with k=1, the output values should be perfect barring you have data with the same parameters that have different outcome values. Do some reading on overfitting as it applies to KNN learners.
In the case where you are querying with the same dataset as you trained with, the query will come in for each learner with some given parameter values. Because that point exists in the learner from the dataset you trained with, the learner will match that training point as closest to the parameter values and therefore output whatever Y value existed for that training point, which in this case is the same as the point you queried with.
The possibilities are:
The data training with data tests are the same data
Data tests have high similarity with the training data
The boundaries between classes are very clear
The optimal value for K is depends on the data. In general, the value of k may reduce the effect of noise on the classification, but makes the boundaries between each classification becomes more blurred.
If your result variable contains values of 0 or 1 - then make sure you are using as.factor, otherwise it might be interpreting the data as continuous.
Accuracy is generally calculated for the points that are not in training dataset that is unseen data points because if you calculate the accuracy for unseen values (values not in training dataset), you can claim that my model's accuracy is the accuracy that is been calculated for the unseen values.
If you calculate accuracy for training dataset, KNN with k=1, you get 100% as the values are already seen by the model and a rough decision boundary is formed for k=1. When you calculate the accuracy for the unseen data it performs really bad that is the training error would be very low but the actual error would be very high. So it would be better if you choose an optimal k. To choose an optimal k you should be plotting a graph between error and k value for the unseen data that is the test data, now you should choose the value of the where the error is lowest.
To answer your question now,
1) you might have taken the entire dataset as train data set and would have chosen a subpart of the dataset as the test dataset.
(or)
2) you might have taken accuracy for the training dataset.
If these two are not the cases than please check the accuracy values for higher k, you will get even better accuracy for k>1 for the unseen data or the test data.

How to calculate the percentage of total area of features having specific attributes' values with Qgis?

I'm working in QGis with different layers covering the same geographical extent. Taking the intersection of those layers, I generated a new one whose attribute table contains all the attributes from the different layers. I would then like to know if there is a tool in Qgis that would allow me to generate the percentage areas covered by features corresponding to specific attributes' value. Would it be possible to compare for example the areas in percentages of the features caracterised by value A and Value B of Attibute 1 and 2 with the one of the features caracterised by value C and D of the same attributes?
Thank you very much for your help.
Regards,

Resources