Keras Prediction result (getting score,use of argmax) - keras

I am trying to use the elmo model for text classification for my own dataset. The training is completed and the number of classes is 4(used keras model and elmo embedding).In the prediction, I got a numpy array. I am attaching the sample code and the result below...
import tensorflow as tf
import keras.backend as K
new_text_pr = np.array(data, dtype=object)[:, np.newaxis]
with tf.Session() as session:
K.set_session(session)
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
model_elmo = build_model(classes)
model_elmo.load_weights(model+"/"+elmo_model)
import time
t = time.time()
predicted = model_elmo.predict(new_text_pr)
print("time: ", time.time() - t)
print(predicted)
# print(predicted[0][0])
print("result:",np.argmax(predicted[0]))
return np.argmax(predicted[0])
when I print the predicts variable I got this.
time: 1.561854362487793
[[0.17483692 0.21439584 0.24001297 0.3707543 ]
[0.15607062 0.24448264 0.4398888 0.15955798]
[0.06494818 0.3439018 0.42254424 0.16860574]
[0.08343349 0.37218323 0.32528472 0.2190985 ]
[0.14868192 0.25948635 0.32722548 0.2646063 ]
[0.0365712 0.4194748 0.3321385 0.21181548]
[0.05350104 0.18225929 0.56712115 0.19711846]
[0.08343349 0.37218323 0.32528472 0.2190985 ]
[0.09541835 0.19085276 0.41069734 0.30303153]
[0.03930932 0.40526104 0.45785302 0.09757669]
[0.06377257 0.33980298 0.32396355 0.27246094]
[0.09784496 0.2292052 0.44426462 0.22868524]
[0.06089798 0.31685832 0.47317514 0.14906852]
[0.03956613 0.46605557 0.3502095 0.14416872]
[0.10513227 0.26166025 0.36598155 0.26722598]
[0.15165758 0.22900137 0.50939053 0.10995051]
[0.06377257 0.33980298 0.32396355 0.27246094]
[0.11404029 0.21311268 0.46880838 0.2040386 ]
[0.07556026 0.20502563 0.52019936 0.19921473]
[0.11096822 0.23295449 0.36192006 0.29415724]
[0.05018891 0.16656907 0.60114646 0.18209551]
[0.08880813 0.2893545 0.44374797 0.1780894 ]
[0.14868192 0.25948635 0.32722548 0.2646063 ]
[0.09596984 0.18282187 0.5053091 0.2158991 ]
[0.09428936 0.13995855 0.62395805 0.14179407]
[0.10513227 0.26166025 0.36598155 0.26722598]
[0.08244281 0.15743142 0.5462735 0.21385226]
[0.07199708 0.2446867 0.44568574 0.23763043]
[0.1339082 0.27288827 0.43478844 0.15841508]
[0.07354636 0.24499843 0.44873005 0.23272514]
[0.08880813 0.2893545 0.44374797 0.1780894 ]
[0.14868192 0.25948635 0.32722548 0.2646063 ]
[0.08924995 0.36547357 0.40014726 0.14512917]
[0.05132649 0.28190497 0.5224545 0.14431408]
[0.06377257 0.33980292 0.32396355 0.27246094]
[0.04849219 0.36724472 0.39698333 0.1872797 ]
[0.07206573 0.31368822 0.4667826 0.14746341]
[0.05948553 0.28048623 0.41831577 0.2417125 ]
[0.07582933 0.18771031 0.54879296 0.18766735]
[0.03858965 0.20433436 0.5596278 0.19744818]
[0.07443814 0.20681688 0.3933627 0.32538226]
[0.0639974 0.23687115 0.5357675 0.16336392]
[0.11005415 0.22901568 0.4279426 0.23298755]
[0.12625505 0.22987585 0.31619486 0.32767424]
[0.08893713 0.14554602 0.45740074 0.30811617]
[0.07906891 0.18683094 0.5214609 0.21263924]
[0.06316617 0.30398315 0.4475617 0.185289 ]
[0.07060979 0.17987429 0.4829593 0.26655656]
[0.0720717 0.27058697 0.41439256 0.24294883]
[0.06377257 0.33980292 0.32396355 0.27246094]
[0.04745338 0.25831962 0.46751252 0.22671448]
[0.06624557 0.20708969 0.54820716 0.17845756]]
result:3
Anyone have any idea about what is the use of taking the 0th index value only. Considering this as a list of lists 0th index means first list and the argmax returns index the maximum value from the list. Then what is the use of other values in the lists?. Why isn't it considered?. Also is it possible to get the score from this? I hope the question is clear. Is it the correct way or is it wrong?
I have found the issue. just posting it others who met the same problem.
Answer: When predicting with Elmo model, it expects a list of strings. In code, the prediction data were split and the model predicted for each word. That's why I got this huge array. I have used a temporary fix. The data is appended to a list then an empty string is also appended with the list. The model will predict the both list values but I took only the first predicted data. This is not the correct way but I have done this as a quick fix and hoping to find a fix in the future

To find the predicted class for each test example, you need to use axis=1. So, in your case the predicted classes will be:
>>> predicted_classes = predicted.argmax(axis=1)
>>> predicted_classes
[3 2 2 1 2 1 2 1 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
2 2 2 2 2 2 3 2 2 2 2 2 1 2 2]
Which means that the first test example belongs to the third class, and the second test example belongs to the second class and so on.
The previous part answers your question (I think), now let's see what the np.argmax(predicted) does. Using np.argmax() alone without specifying the axis will flatten your predicted matrix and get the argument of the maximum number.
Let's see this simple example to know what I mean:
>>> x = np.matrix(np.arange(12).reshape((3,4)))
>>> x
matrix([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> x.argmax()
11
11 is the index of the 11 which is the biggest number in the whole matrix.

Related

Python all possible combinations/permutation of X items and of length X+1

I've been searching everywhere but can't find a thing for my issue.
Let's say I've got three numbers : ['1','2','3'].
I want, using itertool or not, all possible combinations/permutations with a length of 4 and I only want combinations containing these 3 numbers (I don't want '1111' or '1221' and so).
The wanted result would be like that :
1 2 3 1
1 1 2 3
2 2 3 1
from itertools import combinations_with_replacement as irep
res = [' '.join(x) for x in irep('123',4) if {'1','2','3'}.issubset(x)]
# output
# ['1123', '1223', '1233']
OR
from itertools import product
res= [' '.join(x) for x in product('123',repeat=4) if {'1','2','3'}.issubset(x)]
# output
# ['1123', '1132', '1213', '1223', '1231', '1232', '1233',
# '1312', '1321', '1322', '1323', '1332', '2113', '2123',
# '2131', '2132', '2133', '2213', '2231', '2311', '2312',
# '2313', '2321', '2331', '3112', '3121', '3122', '3123',
# '3132', '3211', '3212', '3213', '3221', '3231', '3312', '3321']
import itertools
elements = ['1', '2', '3']
permutations = [''.join(combination) for combination in
itertools.product(elements, repeat = 4) if
all(elem in combination for elem in elements)]
Does this produce what you are looking for?
The code produces the following output:
['1123', '1132', '1213', '1223', '1231', '1232', '1233', '1312', '1321', '1322', '1323', '1332', '2113', '2123', '2131', '2132', '2133', '2213', '2231', '2311', '2312', '2313', '2321', '2331', '3112', '3121', '3122', '3123', '3132', '3211', '3212', '3213', '3221', '3231', '3312', '3321']

svm train output file has less lines than that of the input file

I am currently building a binary classification model and have created an input file for svm-train (svm_input.txt). This input file has 453 lines, 4 No. features and 2 No. classes [0,1].
i.e
0 1:15.0 2:40.0 3:30.0 4:15.0
1 1:22.73 2:40.91 3:36.36 4:0.0
1 1:31.82 2:27.27 3:22.73 4:18.18
0 1:22.73 2:13.64 3:36.36 4:27.27
1 1:30.43 2:39.13 3:13.04 4:17.39 ......................
My problem is that when I count the number of lines in the output model generated by svm-train (svm_train_model.txt), this has 12 fewer lines than that of the input file. The line count here shows 450, although there are obviously also 9 lines at the beginning showing the various parameters generated
i.e.
svm_type c_svc
kernel_type rbf
gamma 1
nr_class 2
total_sv 441
rho -0.156449
label 0 1
nr_sv 228 213
SV
Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened?
Thanks in advance
Updated.........
I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same.
To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......
0 1:22 2:30 3:14 4:16
1 1:26 2:15 3:17 4:25
0 1:22 2:30 3:14 4:16
Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?
Whilst both SOME OF the labels and their corresponding feature values are identical within the input file, these are still different miRNAs.
NOTE: The Input file does not have a feature for miRNA name (and this would clearly show the differences in each line) however, in terms of the features used (i.e Nucleotide Percentage Content), some of the miRNAs do have exactly the same percentage content of A,U,G & C and as a result are viewed as duplicates and then removed from the output model as it obviously views them as duplicates even though they are not (hence there are less lines in the output model).
the format of the input file is:
Where:
Column 0 - label (i.e 1 or 0): 1=Yes & 0=No
Column 1 - Feature 1 = Percentage Content "A"
Column 2 - Feature 2 = Percentage Content "U"
Column 3 - Feature 3 = Percentage Content "G"
Column 4 - Feature 4 = Percentage Content "C"
The input file actually looks something like (See the very first two lines below), as they appear identical, however each line represents a different miRNA):
1 1:23 2:36 3:23 4:18
1 1:23 2:36 3:23 4:18
0 1:36 2:32 3:5 4:27
1 1:14 2:41 3:36 4:9
1 1:18 2:50 3:18 4:14
0 1:36 2:23 3:23 4:18
0 1:15 2:40 3:30 4:15
In terms of software, I am using libsvm-3.22 and python 2.7.5
Align your input file properly, is my first observation. The code for libsvm doesnt look for exactly 4 features. I identifies by the string literals you have provided separating the features from the labels. I suggest manually converting your input file to create the desired input argument.
Try the following code in python to run
Requirements - h5py, if your input is from matlab. (.mat file)
pip install h5py
import h5py
f = h5py.File('traininglabel.mat', 'r')# give label.mat file for training
variables = f.items()
labels = []
c = []
import numpy as np
for var in variables:
data = var[1]
lables = (data.value[0])
trainlabels= []
for i in lables:
trainlabels.append(str(i))
finaltrain = []
trainlabels = np.array(trainlabels)
for i in range(0,len(trainlabels)):
if trainlabels[i] == '0.0':
trainlabels[i] = '0'
if trainlabels[i] == '1.0':
trainlabels[i] = '1'
print trainlabels[i]
f = h5py.File('training_features.mat', 'r') #give features here
variables = f.items()
lables = []
file = open('traindata.txt', 'w+')
for var in variables:
data = var[1]
lables = data.value
for i in range(0,1000): #no of training samples in file features.mat
file.write(str(trainlabels[i]))
file.write(' ')
for j in range(0,49):
file.write(str(lables[j][i]))
file.write(' ')
file.write('\n')

python - cannot make corr work

I'm struggling with getting a simple correlation done. I've tried all that was suggested under similar questions.
Here are the relevant parts of the code, the various attempts I've made and their results.
import numpy as np
import pandas as pd
try01 = data[['ESA Index_close_px', 'CCMP Index_close_px' ]].corr(method='pearson')
print (try01)
Out:
Empty DataFrame
Columns: []
Index: []
try04 = data['ESA Index_close_px'][5:50].corr(data['CCMP Index_close_px'][5:50])
print (try04)
Out:
**AttributeError: 'float' object has no attribute 'sqrt'**
using numpy
try05 = np.corrcoef(data['ESA Index_close_px'],data['CCMP Index_close_px'])
print (try05)
Out:
AttributeError: 'float' object has no attribute 'sqrt'
converting the columns to lists
ESA_Index_close_px_list = list()
start_value = 1
end_value = len (data['ESA Index_close_px']) +1
for items in data['ESA Index_close_px']:
ESA_Index_close_px_list.append(items)
start_value = start_value+1
if start_value == end_value:
break
else:
continue
CCMP_Index_close_px_list = list()
start_value = 1
end_value = len (data['CCMP Index_close_px']) +1
for items in data['CCMP Index_close_px']:
CCMP_Index_close_px_list.append(items)
start_value = start_value+1
if start_value == end_value:
break
else:
continue
try06 = np.corrcoef(['ESA_Index_close_px_list','CCMP_Index_close_px_list'])
print (try06)
Out:
****TypeError: cannot perform reduce with flexible type****
Also tried .astype but not made any difference.
data['ESA Index_close_px'].astype(float)
data['CCMP Index_close_px'].astype(float)
Using Python 3.5, pandas 0.18.1 and numpy 1.11.1
Would really appreciate any suggestion.
**edit1:*
Data is coming from an excel spreadsheet
data = pd.read_excel('C:\\Users\\Ako\\Desktop\\ako_files\\for_corr_‌​tool.xlsx') prior to the correlation attempts, there are only column renames and
data = data.drop(data.index[0])
to get rid of a line
regarding the types:
print (type (data['ESA Index_close_px']))
print (type (data['ESA Index_close_px'][1]))
Out:
**edit2*
parts of the data:
print (data['ESA Index_close_px'][1:10])
print (data['CCMP Index_close_px'][1:10])
Out:
2 2137
3 2138
4 2132
5 2123
6 2127
7 2126.25
8 2131.5
9 2134.5
10 2159
Name: ESA Index_close_px, dtype: object
2 5241.83
3 5246.41
4 5243.84
5 5199.82
6 5214.16
7 5213.33
8 5239.02
9 5246.79
10 5328.67
Name: CCMP Index_close_px, dtype: object
Well, I've encountered the same problem today.
try use .astype('float64') to help make the type correct.
data['ESA Index_close_px'][5:50].astype('float64').corr(data['CCMP Index_close_px'][5:50].astype('float64'))
This works well for me. Hope it can help you as well.
You can try as following:
Top15['Citable docs per capita']=(Top15['Citable docs per capita']*100000)
Top15['Citable docs per capita'].astype('int').corr(Top15['Energy Supply per Capita'].astype('int'))
It worked for me.

scikit-learn roc_curve: why does it return a threshold value = 2 some time?

Correct me if I'm wrong: the "thresholds" returned by scikit-learn's roc_curve should be an array of numbers that are in [0,1]. However, it sometimes gives me an array with the first number close to "2". Is it a bug or I did sth wrong? Thanks.
In [1]: import numpy as np
In [2]: from sklearn.metrics import roc_curve
In [3]: np.random.seed(11)
In [4]: aa = np.random.choice([True, False],100)
In [5]: bb = np.random.uniform(0,1,100)
In [6]: fpr,tpr,thresholds = roc_curve(aa,bb)
In [7]: thresholds
Out[7]:
array([ 1.97396826, 0.97396826, 0.9711752 , 0.95996265, 0.95744405,
0.94983331, 0.93290463, 0.93241372, 0.93214862, 0.93076592,
0.92960511, 0.92245024, 0.91179548, 0.91112166, 0.87529458,
0.84493853, 0.84068543, 0.83303741, 0.82565223, 0.81096657,
0.80656679, 0.79387241, 0.77054807, 0.76763223, 0.7644911 ,
0.75964947, 0.73995152, 0.73825262, 0.73466772, 0.73421299,
0.73282534, 0.72391126, 0.71296292, 0.70930102, 0.70116428,
0.69606617, 0.65869235, 0.65670881, 0.65261474, 0.6487222 ,
0.64805644, 0.64221486, 0.62699782, 0.62522484, 0.62283401,
0.61601839, 0.611632 , 0.59548669, 0.57555854, 0.56828967,
0.55652111, 0.55063947, 0.53885029, 0.53369398, 0.52157349,
0.51900774, 0.50547317, 0.49749635, 0.493913 , 0.46154029,
0.45275916, 0.44777116, 0.43822067, 0.43795921, 0.43624093,
0.42039077, 0.41866343, 0.41550367, 0.40032843, 0.36761763,
0.36642721, 0.36567017, 0.36148354, 0.35843793, 0.34371331,
0.33436415, 0.33408289, 0.33387442, 0.31887024, 0.31818719,
0.31367915, 0.30216469, 0.30097917, 0.29995201, 0.28604467,
0.26930354, 0.2383461 , 0.22803687, 0.21800338, 0.19301808,
0.16902881, 0.1688173 , 0.14491946, 0.13648451, 0.12704826,
0.09141459, 0.08569481, 0.07500199, 0.06288762, 0.02073298,
0.01934336])
Most of the time these thresholds are not used, for example in calculating the area under the curve, or plotting the False Positive Rate against the True Positive Rate.
Yet to plot what looks like a reasonable curve, one needs to have a threshold that incorporates 0 data points. Since Scikit-Learn's ROC curve function need not have normalised probabilities for thresholds (any score is fine), setting this point's threshold to 1 isn't sufficient; setting it to inf is sensible but coders often expect finite data (and it's possible the implementation also works for integer thresholds). Instead the implementation uses max(score) + epsilon where epsilon = 1. This may be cosmetically deficient, but you haven't given any reason why it's a problem!
From the documentation:
thresholds : array, shape = [n_thresholds]
Decreasing thresholds on the decision function used to compute
fpr and tpr. thresholds[0] represents no instances being predicted
and is arbitrarily set to max(y_score) + 1.
So the first element of thresholds is close to 2 because it is max(y_score) + 1, in your case thresholds[1] + 1.
this seems like a bug to me - in roc_curve(aa,bb), 1 is added to the first threshold. You should create an issue here https://github.com/scikit-learn/scikit-learn/issues

Weka ignoring unlabeled data

I am working on an NLP classification project using Naive Bayes classifier in Weka. I intend to use semi-supervised machine learning, hence working with unlabeled data. When I test the model obtained from my labeled training data on an independent set of unlabeled test data, Weka ignores all the unlabeled instances. Can anybody please guide me how to solve this? Someone has already asked this question here before but there wasn't any appropriate solution provided. Here is a sample test file:
#relation referents
#attribute feature1 NUMERIC
#attribute feature2 NUMERIC
#attribute feature3 NUMERIC
#attribute feature4 NUMERIC
#attribute class{1 -1}
#data
1, 7, 1, 0, ?
1, 5, 1, 0, ?
-1, 1, 1, 0, ?
1, 1, 1, 1, ?
-1, 1, 1, 1, ?
The problem is that when you specify a training set -t train.arff and a test set test.arff, the mode of operation is to calculate the performance of the model based on the test set. But you can't calculate a performance of any kind without knowing the actual class. Without the actual class, how will you know if your prediction if right or wrong?
I used the data you gave as train.arff and as test.arff with arbitrary class labels assigned by me. The relevant output lines are:
=== Error on training data ===
Correctly Classified Instances 4 80 %
Incorrectly Classified Instances 1 20 %
Kappa statistic 0.6154
Mean absolute error 0.2429
Root mean squared error 0.4016
Relative absolute error 50.0043 %
Root relative squared error 81.8358 %
Total Number of Instances 5
=== Confusion Matrix ===
a b <-- classified as
2 1 | a = 1
0 2 | b = -1
and
=== Error on test data ===
Total Number of Instances 0
Ignored Class Unknown Instances 5
=== Confusion Matrix ===
a b <-- classified as
0 0 | a = 1
0 0 | b = -1
Weka can give you those statistics for the training set, because it knows the actual class labels and the predicted ones (applying the model on the training set). For the test set, it can't get any information about the performance, because it doesn't know about the true class labels.
What you might want to do is:
java -cp weka.jar weka.classifiers.bayes.NaiveBayes -t train.arff -T test.arff -p 1-4
which in my case would give you:
=== Predictions on test data ===
inst# actual predicted error prediction (feature1,feature2,feature3,feature4)
1 1:? 1:1 1 (1,7,1,0)
2 1:? 1:1 1 (1,5,1,0)
3 1:? 2:-1 0.786 (-1,1,1,0)
4 1:? 2:-1 0.861 (1,1,1,1)
5 1:? 2:-1 0.861 (-1,1,1,1)
So, you can get the predictions, but you can't get a performance, because you have unlabeled test data.

Resources