Finding the top three relevant category and its corresponding probabilities - python-3.x

From the below script, I find the highest probability and its corresponding category in a multi class text classification problem. How do I find the highest top 3 predicted probability and its corresponding category in a best efficient way without using loops.
probabilities = classifier.predict_proba(X_test)
max_probabilities = probabilities.max(axis=1)
order=np.argsort(probabilities, axis=1)
classification=(classifier.classes_[order[:, -1:]])
print(accuracy_score(classification,y_test))
Thanks in advance.
( I have around 50 categories, I want to extract the top 3 best relevant category among 50 categories for each of my narrations and display them in a dataframe)

You've done most of the hard work here, just missing a bit of numpy foo to finish it off. Your line
order = np.argsort(probabilities, axis=1)
Contains the indices of the sorted probabilities, so [[lowest_prob_class_1, ..., highest_prob_class_1]...] for each of your samples. Which you have used to give your classification with order[:, -1:], i.e. the index of the highest probability class. So to get the top three classes we can just make a simple change
top_3_classes = classifier.classes_[order[:, -3:]]
Then to get the corresponding probabilities we can use
top_3_probabilities = probabilities[np.repeat(np.arange(order.shape[0]), 3),
order[:, -3:].flatten()].reshape(order.shape[0], 3)

Related

Why is ColumnTransformer producing a different output using the same code but different .csv files?

I am trying to finish this course tooth and nail with the hopes of being able to do this kind of stuff entry level by Spring time. This is my first post here on this incredible resource, and will do my best to conform to posting format. As a potential way to enforce my learning and commit to long term memory, I'm trying the same things on my own dataset of > 500 entries containing data more relevant to me as opposed to dummy data.
I'm learning about the data preprocessing phase where you fill in missing values and separate the columns into their respective X and Y to be fed into the models later on, if I understand correctly.
So in the course example, it's the top left dataset of countries. Then the bottom left is my own database of data I've been keeping for about a year on a multiplayer game I play. It has 100 or so characters you can choose from who are played between 5 different categorical roles.
Course data set (top left) personal dataset (bottom left
personal dataset column transformed results
What's up with the different outputs that are produced, with the only difference being the dataset (.csv file)? The course's dataset looks right; that first column of countries (textual categories) gets turned into binary vectors in the output no? Why is the output on my data set omitting columns, and producing these bizarre looking tuples followed by what looks like a random number? I've tried removing the np.array function, I've tried printing each output at each level, unable to see what's causing the difference. I expected on my dataset it would transform the characters' names into binary vectors (combinations of 1s/0s?) so the computer can understand the difference and map them to the appropriate results. Instead I'm getting that weird looking output I've never seen before.
EDIT: It turns out these bizarre number combinations are what's called a "sparse matrix." Had to do some research starting with the type() which yielded csr_array. If I understood what I Read correctly all the stuff inside takes up one column, so I just tried all rows/columns using [:] and I didn't get an error.
Really appreciate your time and assistance.
EDIT: Thanks to this thread I was able to make my way to the end of this data preprocessing/import/cleaning/ phase exercise, to feature scaling using my own dataset of ~ 550 rows.
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
# IMPORT RAW DATA // ASSIGN X AND Y RAW
df = pd.read_csv('datasets/winpredictor.csv')
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
# TRANSFORM CATEGORICAL DATA
ct = ColumnTransformer(transformers=\
[('encoder', OneHotEncoder(), [0, 1])], remainder='passthrough')
le = LabelEncoder()
X = ct.fit_transform(X)
y = le.fit_transform(y)
# SPLIT THE DATA INTO TRAINING AND TEST SETS
X_train, X_test, y_train, y_test = train_test_split(\
X, y, train_size=.8, test_size=.2, random_state=1)
# FEATURE SCALING
sc = StandardScaler(with_mean=False)
X_train[:, :] = sc.fit_transform(X_train[:, :])
X_test[:, :] = sc.transform(X_test[:, :])
First of all I encourage you to keep working with this course and for sure you will be a perfect Data Science in a few weeks.
Let's talk about your problem. It' seems that you only have a problem of visualization due to the big size of different types of "Hero" (I think you have 37 unique values).
I will explain you the results you have plotted. They programm only indicate you the values of the samples that are different of 0:
(0,10)=1 --> 0 refers to the first sample, and 10 refers to the 10th
value of the sample that is equal to 1.
(0,37)=5 --> 0 refers to the first sample, and 37 refers to the 37th, which is equal to 5.
etc..
So your first sample will be something like:
[0,0,0,0,0,0,0,0,0,0,1,.........., 5, 980,-30, 1000, 6023]
Which is the way to express the first sample of "Jakiro".
["Jakiro",5, 980,-30, 1000, 6023]
To sump up, the first 37 values refers to your OneHotEncoder, and last 5 refers to your initial numerical values.
So it seems to be correct, just a different way to plot the result due to the big size of classes of the categorical variable.
You can try to reduce the number of X rows (to 4 for example), and try the same process. Then you will have a similar output as the course.

Evaluation with ground truth label and list of predicted labels

Currently, I am trying to predict the top five/10 subjects to a statistics exercise based on the exercise's description. The subjects and exercises (with ground truth label, as integer) are provided in CSV format. The ground truth label is also present in the subjects' CSV, and is there called "id".
My current model produces a tuple for ever exercise, of which the first element is the ground truth label, the second element is a list of the predicted labels.
Then my question: how to compute (Accuracy,) Precision, Recall, and F1 (if possible also MRR and MAR)?
Also, all exercises and subject are converted to vectors. Furthermore, I calculate accuracy by counting all instances for which the ground truth is present in the top 5/10, and dividing this the total number of exercises.
*note: in the code exercise = question, and subject = kc
My variables are as follows:
question_data = df[['all_text_clean', 'all_text_as_vector', 'groud_truth_id'] ].values
kc_data = subject_df[['id', 'all_text_as_vector']].values
Then, I loop over every exercise-question pair:
question_candidates = []
for qtext, qvec, gt_id in question_data:
scores = []
for kc_id, kc_vec in kc_data:
score = distance.cosine(qvec, kc_vec) # calculate cosine similarities
scores.append((kc_id, score)) # kc_id and related store cos-sim
scores = sorted(scores, key=itemgetter(1)) # sort cos-sims and related ids
candites = [id for id, score in scores][:5] # only id is relevant. These are the suggestions
question_candidates.append((gt_id, candites))
Accuracy is moderate: around 0,59. I don't expect anything higher since this is just a baseline model.

How to feed multiple time series sets into LSTM model for prediction?

I am trying to take this dataset and predict news popularity levels over time.
The dataset is made up of 145 columns (1 being the ID linked to the actual news story in a separate file, 2 - 145 for 144 20-minute time slices where each cell in a row records the popularity level of the corresponding news story).
I have already normalized the dataset "Facebook_Economy.csv" to range from 0 to 1. At the moment I can only feed a single time series set into my model (train ~100 time slices and test ~44 time slices). My aim is to take several rows of 144 time slices to train on and test on several other rows, e.g., take the time series data for news stories 1-20 and train on news stories 21-30, etc.
This is how I am currently feeding data into my model:
def run(filename):
series = read_csv(filename, header=0, index_col=0)
repeats = 1
results = DataFrame()
timesteps = 1
for i in range(len(series)):
results['results'] = experiment(repeats, series.iloc[i].squeeze(), timesteps)
# Where experiment(repeats, series, timesteps)
print(results.describe())
As well (for some insight into how the rest of my code looks) I have been following this tutorial from Jason Brownlee for some guidance.
I'm not sure I understood well the question but I think I already did something similar. First, you have to stack all your features in an array. I found this link very helpful : https://machinelearningmastery.com/how-to-develop-lstm-models-for-multi-step-time-series-forecasting-of-household-power-consumption/
Here is the code I use : (fen_pred = input size, n_output = output size)
dataset_train_features = np.hstack((dataset_train,new_features_train,ssa_feature1_train,ssa_feature2_train))
dataset_train_labels = dataset_train
features_set = list()
labels = list()
# X_train
for i in range(fen_pred, len_train):
if(len(dataset_train_features[i:i+n_output])<n_output):
break
features_set.append(dataset_train_features[i-fen_pred:i])
for i in range(fen_pred, len_train):
if(len(dataset_train_labels[i:i+n_output])<n_output):
break
labels.append(dataset_train_labels[i:i+n_output])
X_train=np.array(features_set)
y_train=np.array(labels)
(please note that i want to predict timesteps on one timeserie, that's why I don't predict multiple features)
In your case I would edit the array dataset_train_features, add every feature you need to train your model in this array and then reproduce this technique to create your test set.

Random Forest Classifier :To which class corresponds the probabilities

I am using the RandomForestClassifier from pyspark.ml.classification
I run the model on a binary class dataset and display the probabilities.
I have the following in the col probabilities :
+-----+----------+---------------------------------------+
|label|prediction|probability |
+-----+----------+---------------------------------------+
|0.0 |0.0 |[0.9005918461098429,0.0994081538901571]|
|1.0 |1.0 |[0.6051335859900139,0.3948664140099861]|
+-----+----------+---------------------------------------+
I have a list of 2 elements which obviously correspond to the probabilities of the predicted class.
My question : probability[0 corresponds always to the value of prediction whereas in the spark documentation it is not clear!
I am interpreting your question as asking: does the first element in the array under the column 'predictions' always correspond to the "predicted class", by which you mean the label the Random Forest Classifier predicted the observation should have.
If I have that correct, the answer is Yes.
The items in the arrays in both probability rows can be read as the model telling you:
['My confidence that the predicted label = the true label',
'My confidence that the label != the true label']
In the case of multiple labels being predicted, then you would have the model telling you:
['My confidence that the label I predict = specific label 1',
'My confidence that the label I predict = specific label 2',
...'My confidence that the label I predict = specific label N']
This is indexed by the N labels you are trying to predict (which means you have to be careful about the way the labels are structured).
Perhaps it would help to take a look at this answer. You could do something like:
model = pipeline.fit(trainig_data)
predictions = model.transform(test_data)
print predictions.show(10)
(Using the relevant pipeline and data from your examples.)
This will show you the probabilities for each class.
I post almost the same question here and I think the answer might help you:
Scala: how to know which probability correspond to which class?
The answer is before the fit of the model.
To fit the model we use a labelIndexer on the target. This label indexer transform the target into an indexe by descending frequency.
ex: if, in my target I have 20% of "aa" and 80% of "bb" label indexer will create a column "label" that took the value 0 for "bb" and 1 for "aa" (because I "bb" is ore frequent than "aa")
When we fit a random forest, the probabilities correspond to the order of frequency.
In binary classification:
first proba = probability that the class is the most frequent class in the train set
second proba = probability that the class is the less frequent class in the train set

Scikit-Learn Linear Regression how to get coefficient's respective features?

I'm trying to perform feature selection by evaluating my regressions coefficient outputs, and select the features with the highest magnitude coefficients. The problem is, I don't know how to get the respective features, as only coefficients are returned form the coef._ attribute. The documentation says:
Estimated coefficients for the linear regression problem. If multiple
targets are passed during the fit (y 2D), this is a 2D array of
shape (n_targets, n_features), while if only one target is passed,
this is a 1D array of length n_features.
I am passing into my regression.fit(A,B), where A is a 2-D array, with tfidf value for each feature in a document. Example format:
"feature1" "feature2"
"Doc1" .44 .22
"Doc2" .11 .6
"Doc3" .22 .2
B are my target values for the data, which are just numbers 1-100 associated with each document:
"Doc1" 50
"Doc2" 11
"Doc3" 99
Using regression.coef_, I get a list of coefficients, but not their corresponding features! How can I get the features? I'm guessing I need to modfy the structure of my B targets, but I don't know how.
What I found to work was:
X = your independent variables
coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logistic.coef_))], axis = 1)
The assumption you stated: that the order of regression.coef_ is the same as in the TRAIN set holds true in my experiences. (works with the underlying data and also checks out with correlations between X and y)
You can do that by creating a data frame:
cdf = pd.DataFrame(regression.coef_, X.columns, columns=['Coefficients'])
print(cdf)
coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(logistic.coef_)})
I suppose you are working on some feature selection task. Well using regression.coef_ does get the corresponding coefficients to the features, i.e. regression.coef_[0] corresponds to "feature1" and regression.coef_[1] corresponds to "feature2". This should be what you desire.
Well I in its turn recommend tree model from sklearn, which could also be used for feature selection. To be specific, check out here.
Coefficients and features in zip
print(list(zip(X_train.columns.tolist(),logreg.coef_[0])))
Coefficients and features in DataFrame
pd.DataFrame({"Feature":X_train.columns.tolist(),"Coefficients":logreg.coef_[0]})
This is the easiest and most intuitive way:
pd.DataFrame(logisticRegr.coef_, columns=x_train.columns)
or the same but transposing index and columns
pd.DataFrame(logisticRegr.coef_, columns=x_train.columns).T
Suppose your train data X variable is 'df_X' then you can map into a dictionary and feed into pandas dataframe to get the mapping:
pd.DataFrame(dict(zip(df_X.columns,model.coef_[0])),index=[0]).T
Try putting them in a series with the data columns names as index:
coeffs = pd.Series(model.coef_[0], index=X.columns.values)
coeffs.sort_values(ascending = False)

Resources