Build confusion matrix for multiclass multilabel classification - confusion-matrix

I would like to build a confusion matrix for multiclass multilabel classification to calculate then precision, recall and F1.
One idea is to build it from all combinations which come in training and test set, e.g.
A1 A2A3 A1A3
A1 x x x
A2A3 x x x
A1A3 x x x
The other idea is to build it like for a simple label classification, but use double values for values ​​of the matrix, e.g.
A1 A2 A3
A1 double double double
A2 double double double
A3 double double double
the question in this case is how to calculate this values meaningful?
Has somebody experience with building of such matrices? Which version is more rational?
If there is some other way to build such confusion matrix, would be glad to hear it from you.
Greetings, Andriy

If it also interests somebody, here is how it works for me:
I used first idea and calculated label based measures due to description from: Gj. Madjarov, et al., An extensive experimental comparison of methods for multi-label learning, Pattern
Recognition (2012).
The corresponding code can be found at dkpro-tc (DKPro Text Classification Framework) in evaluation module.

Related

How to compute linear regression using multivariate least squares method without using scikit-learn library?

My question is classification of the iris dataset using multi-variate linear regression without using the scikit-learn library.
I have this formula that is needed to find the beta values for the dataset.
enter image description here
β^=(X′X)−1X′Y
This is the dataset in question: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
How to compute the linear regression using this formula. I understand that linear regression is
Yi = β0 + β1X1i + ... + βkXki + ϵi
I have computed beta values using the above formula using matrix multiplication. How to find the linear regression equation now? I have assumed the first 4 columns as the A matrix and the label column as the Y matrix with values 1,2,3 respectively.
How do i compute the ϵi values. Do i assume them to be zero? Any help is appreciated. Thanks in advance.

Confusion matrix 5x5 formula for finding accuracy, precision, recall ,and f1-score

im try to study confusion matrix. i know about 2x2 confusion matrix but i still don't understand how to count 5x5 confusion matrix for finding accuracy, precision, recall and, f1 - score. Can anyone help me with this ? i appreciate every help.
See my answer here: Calculating Equal error rate(EER) for a multi class classification problem
In short, one strategy is to split the multiclass problem into a set of binary classification, for each class a "one vs. all others" classification. Then for each binary problem you can calculate F1, precision and recall, and if you want you can average (uniformly or weighted) the scores of each class to get one F1 score which will represent the multiclass problem.
As for confusion matrix larger than 2x2: the rows are the true labels and the columns are predicated labels. Then the number in cell (i,j) is the number of samples from class i which were classified as class j (note that i=j corresponds to correct prediction). The accuracy is the trace of the confusion matrix divided by the number of samples.

How to restore (predict) data based on correlation/regression in Excel?

I have some data in which a feature (height) is correlated with output variable (price). How to restore missing data (nulls) in height feature based on existing dependancy (correlation) between these variables?
To be more clear:
Input and output variables have clear correlation. I guess that predicting missing values for excel is not a difficult procedure. But I need some directions how to implement it.
If you got the slope (m) and intercept (c) of the regression line in E2 and E3 (say):-
=SLOPE(C2:C9,B2:B9)
=INTERCEPT(C2:C9,B2:B9)
you could re-arrange the simple regression equation y=mx+c to predict the x-values
x=(y-c)/m
So your predicted heights would be:-
=IF(ISBLANK(B2),(C2-E$3)/E$2,B2)
starting in D2.
You might try the FORECAST¹ function. The first blank does not have enough preceding data to generate a forecast result so a simple ratio will have to suffice but the remaining values can be generated and take previously generated FORECAST results into consideration for their own result(s).
        
The formula in E2 is,
=IF(ISBLANK(B2), FORECAST(C2, B$2:B$9, C$2:C$9), B2)
¹ See Forecasting functions for alternative algorithms in data prediction.

How do I run the Spark logistic regression with categorical features using python?

I have a data with some categorical variables and I want to run a logistic regression using Mllib , it seems like the model support only continous variables.
Does anyone know how to deal with this please ?
Logistic regression, like the other linear models, takes as input an RDD whereas a LabeledPoint is a Double (the label) and the associated Vector (a double Array).
Categorical values (Strings) are not supported, however you could convert those to binary columns.
For example if you have a column RAG taking values Red, Amber and Green, you would add three binary column isRed, isAmber and isGreen of which only one of them is 1 (true) and the others are 0 (zero) for each sample.
See as further explanation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html

Excel linest formula for weighted polynomial fit

How to specify excel linest weighted polynomial fit formula, something like
LINEST(y*w^0.5,IF({1,0},1,x)*w^0.5,FALSE,TRUE), but this is for linear fit. I'm looking for similar formula for 2nd order and 3rd order polynomial regression fit.
In a reply to the other post in Weighted trendline an approach was already suggested for weighted polynomials. For example for a cubic fit try with CTRL+SHIFT+ENTER in a 4x1 range:
=LINEST(y*w^0.5,(x-1E-99)^{0,1,2,3}*w^0.5,FALSE)
(-1e-99 ensures that 0^0=1). Similar to the linear case for R^2 try:
=INDEX(LINEST((y-SUMPRODUCT(y,w)/SUM(w))*w^0.5,(x-1E-99)^{0,1,2,3}*w^0.5,FALSE,TRUE),3,1)
Derivation
In standard least squares we find the vector b that minimises:|y-Xb|²=(y-Xb)'(y-Xb)
In the weighted case b is chosen to minimise instead: |W(y-Xb)|²=(y-Xb)'W'W(y-Xb)
So the weighted regression is Wy on WX where W'W = W² is the diagonal matrix of the weights.

Resources