I am trying to obtain the std of this output using numpy.std()
[[array([0.92473118, 0.94117647]), array([0.98850575, 0.69565217]), array([0.95555556, 0.8 ]), 0.923030303030303], [array([0.85555556, 0.8 ]), array([0.95061728, 0.55172414]), array([0.9005848 , 0.65306122]), 0.8353285811932428]]
To obtain that output I used the code (it goes through a loop, in this example, it went through two iterations)
precision, recall, fscore, support = precision_recall_fscore_support(np.argmax(y_test_0, axis=-1), np.argmax(probas_, axis=-1))
eval_test_metric = [precision, recall, fscore, avg_fscore]
test_metric1.append(eval_test_metric)
std_matrix1 = np.std(test_metric1, axis=0)
I would like to get an output similar in structure to when I do np.mean(), Please excuse the 'precision', 'recall' I just made that in my code for clarity.
dr_test_metric = dict(zip(['specificity avg', 'sensitivity avg', 'ppv avg', 'npv avg'], np.mean(test_metric2, axis=0)))
print(dr_test_metric,'\n')
output, (where 0.89014337 in 'precision avg': array([0.89014337, 0.87058824] is the average of precision of class 0 for my model and 0.8705 is the average of the precision for class 1 for my model)
{'precision avg': array([0.89014337, 0.87058824]), 'recall avg': array([0.96956152, 0.62368816]), 'fscore avg': array([0.92807018, 0.72653061]), 'avg_fscore avg': 0.8791794421117729}
Related
To solve a 5 parameter model, I need at least 5 data points to get a unique solution. For x and y data below:
import numpy as np
x = np.array([[-0.24155831, 0.37083184, -1.69002708, 1.4578805 , 0.91790011,
0.31648635, -0.15957368],
[-0.37541846, -0.14572825, -2.19695883, 1.01136142, 0.57288752,
0.32080956, -0.82986857],
[ 0.33815532, 3.1123936 , -0.29317028, 3.01493602, 1.64978158,
0.56301755, 1.3958912 ],
[ 0.84486735, 4.74567324, 0.7982888 , 3.56604097, 1.47633894,
1.38743513, 3.0679506 ],
[-0.2752026 , 2.9110031 , 0.19218081, 2.0691105 , 0.49240373,
1.63213241, 2.4235483 ],
[ 0.89942508, 5.09052174, 1.26048572, 3.73477373, 1.4302902 ,
1.91907482, 3.70126468]])
y = np.array([-0.81388378, -1.59719762, -0.08256274, 0.61297275, 0.99359647,
1.11315445])
I used only 6 data to fit a 8 parameter model (7 slopes and 1 intercept).
lr = LinearRegression().fit(x, y)
print(lr.coef_)
array([-0.83916772, -0.57249998, 0.73025938, -0.02065629, 0.47637768,
-0.36962192, 0.99128474])
print(lr.intercept_)
0.2978781587718828
Clearly, it's using some kind of assignment to reduce the degrees of freedom. I tried to look into the source code but couldn't found anything about that. What method do they use to find the parameter of under specified model?
You don't need to reduce the degrees of freedom, it simply finds a solution to the least squares problem min sum_i (dot(beta,x_i)+beta_0-y_i)**2. For example, in the non-sparse case it uses the linalg.lstsq module from scipy. The default solver for this optimization problem is the gelsd LAPACK driver. If
A= np.concatenate((ones_v, X), axis=1)
is the augmented array with ones as its first column, then your solution is given by
x=numpy.linalg.pinv(A.T*A)*A.T*y
Where we use the pseudoinverse precisely because the matrix may not be of full rank. Of course, the solver doesn't actually use this formula but uses singular value Decomposition of A to reduce this formula.
I made my own Keras CNN and used the code below to predict. The prediction give all the 143 prediction while I only want the four major classes with the highest percentage.
Code:
preds = model.predict(imgs)
for cls in train_generator.class_indices:
x = preds[0][train_generator.class_indices[cls]]
x_pred = "{:.1%}".format(x)
value = (cls+":"+ x_pred)
print (value)
Prediction:
Acacia_abyssinica:0.0%
Acacia_kirkii:0.0%
Acacia_mearnsii:0.0%
Acacia_melanoxylon:0.0%
Acacia_nilotica:0.0%
Acacia_polyacantha:0.0%
Acacia_senegal:0.0%
Acacia_seyal:0.0%
Acacia_xanthophloea:0.0%
Afrocarpus_falcatus:0.0%
Afzelia_quanzensis:0.0%
Albizia_gummifera:0.0%
Albizia_lebbeck:0.0%
Allanblackia_floribunda:0.0%
Artocarpus_heterophyllus:0.0%
Azadirachta_indica:0.0%
Balanites_aegyptiaca:0.0%
Bersama_abyssinica:0.0%
Bischofia_javanica:0.0%
Brachylaena_huillensis:0.0%
Bridelia_micrantha:0.0%
Calodendron_capensis:0.0%
Calodendrum_capense:0.0%
Casimiroa_edulis:0.0%
Cassipourea_malosana:0.0%
Casuarina_cunninghamiana:0.0%
Casuarina_equisetifolia:4.8%
Catha_edulis:0.0%
Cathium_Keniensis:0.0%
Ceiba_pentandra:39.1%
Celtis_africana:0.0%
Chionanthus_battiscombei:0.0%
Clausena_anisat:0.0%
Clerodendrum_johnstonii:0.0%
Combretum_molle:0.0%
Cordia_africana:0.0%
Cordia_africana_Cordia:0.0%
Cotoneaster_Pannos:0.0%
Croton_macrostachyus:0.0%
Croton_megalocarpus:0.0%
Cupressus_lusitanica:0.0%
Cussonia_Spicata:0.2%
Cussonia_holstii:0.0%
Diospyros_abyssinica:0.0%
Dodonaea_angustifolia:0.0%
Dodonaea_viscosa:0.0%
Dombeya_goetzenii:0.0%
Dombeya_rotundifolia:0.0%
Dombeya_torrida:0.0%
Dovyalis_abyssinica:0.0%
Dovyalis_macrocalyx:0.0%
Drypetes_gerrardii:0.0%
Ehretia_cymosa:0.0%
Ekeber_Capensis:0.0%
Erica_arborea:0.0%
Eriobotrya_japonica:0.0%
Erythrina_abyssinica:0.0%
Eucalyptus_camaldulensis:0.0%
Eucalyptus_globulus:55.9%
Eucalyptus_grandis:0.0%
Eucalyptus_grandis_saligna:0.0%
Eucalyptus_hybrids:0.0%
Eucalyptus_saligna:0.0%
Euclea_divinorum:0.0%
Ficus_indica:0.0%
Ficus_natalensi:0.0%
Ficus_sur:0.0%
Ficus_sycomorus:0.0%
Ficus_thonningii:0.0%
Flacourtia_indica:0.0%
Flacourtiaceae:0.0%
Fraxinus_pennsylvanica:0.0%
Grevillea_robusta:0.0%
Hagenia_abyssinica:0.0%
Jacaranda_mimosifolia:0.0%
Juniperus_procera:0.0%
Kigelia_africana:0.0%
Macaranga_capensis:0.0%
Mangifera_indica:0.0%
Manilkara_Discolor:0.0%
Markhamia_lutea:0.0%
Maytenus_senegalensis:0.0%
Melia_volkensii:0.0%
Meyna_tetraphylla:0.0%
Milicia_excelsa:0.0%
Moringa_Oleifera:0.0%
Murukku_Trichilia_emetica:0.0%
Myrianthus_holstii:0.0%
Newtonia_buchananii:0.0%
Nuxia_congesta:0.0%
Ochna_holstii:0.0%
Ochna_ovata:0.0%
Ocotea_usambarensis:0.0%
Olea_Europaea:0.0%
Olea_africana:0.0%
Olea_capensis:0.0%
Olea_hochstetteri:0.0%
Olea_welwitschii:0.0%
Osyris_lanceolata:0.0%
Persea_americana:0.0%
Pinus_radiata:0.0%
Podocarpus _falcatus:0.0%
Podocarpus_latifolius:0.0%
Polyscias_fulva:0.0%
Polyscias_kikuyuensis:0.0%
Pouteria_adolfi_friedericii:0.0%
Prunus_africana:0.0%
Psidium_guajava:0.0%
Rauvolfia_Vomitoria:0.0%
Rhus_natalensis:0.0%
Rhus_vulgaris:0.0%
Schinus_molle:0.0%
Schrebera_alata:0.0%
Sclerocarya_birrea:0.0%
Scolopia_zeyheri:0.0%
Senna_siamea:0.0%
Sinarundinaria_alpina:0.0%
Solanum_mauritianum:0.0%
Spathodea_campanulata:0.0%
Strychnos_usambare:0.0%
Syzygium_afromontana:0.0%
Syzygium_cordatum:0.0%
Syzygium_cuminii:0.0%
Syzygium_guineense:0.0%
Tamarindus_indica:0.0%
Tarchonanthus_camphoratus:0.0%
Teclea_Nobilis:0.0%
Teclea_simplicifolia:0.0%
Terminalia_brownii:0.0%
Terminalia_mantaly:0.0%
Toddalia_asiatica:0.0%
Trema_Orientalis:0.0%
Trichilia_emetica:0.0%
Trichocladus_ellipticus:0.0%
Trimeria_grandifolia:0.0%
Vangueria_madagascariensis:0.0%
Vepris_nobilis:0.0%
Vepris_simplicifolia:0.0%
Vernonia_auriculifera:0.0%
Vitex_keniensis:0.0%
Warburgia_ugandensis:0.0%
Zanthoxylum_gilletii:0.0%
Mahogany_tree:0.0%
You can just get all your predictions, sort them and take top four
preds = model.predict(imgs)
sorted_preds = []
for cls in train_generator.class_indices:
x = preds[0][train_generator.class_indices[cls]]
x_pred = "{:.1%}".format(x)
sorted_preds.append([x, x_pred, cls])
top_4 = sorted(sorted_preds, reverse=True)[:4]
I am working with multivariate linear regression and using stochastic gradient descent to optimize.
Working on this dataSet
http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/
for every run all hyperParameters and all remaining things are same, epochs=200 and alpha=0.1
when I first run then I got final_cost=0.0591, when I run the program again keeping everything same I got final_cost=1.0056
, running again keeping everything same I got final_cost=0.8214
, running again final_cost=15.9591, running again final_cost=2.3162 and so on and on...
As you can see that keeping everything same and running, again and again, each time the final cost changes by large amount sometimes so large like from 0.8 to direct 15.9 , 0.05 to direct 1.00 and not only this the graph of final cost after every epoch within the same run is every zigzag unlike in batch GD in which the cost graph decreases smoothly.
I can't understand that why SGD is behaving so weirdly, different results in the different run.
I tried the same with batch GD and everything is absolutely fine and smooth as per expectations. In case of batch GD no matter how many times I run the same code the result is exactly the same every time.
But in the case of SGD, I literally cried,
class Abalone :
def __init__(self,df,epochs=200,miniBatchSize=250,alpha=0.1) :
self.df = df.dropna()
self.epochs = epochs
self.miniBatchSize = miniBatchSize
self.alpha = alpha
print("abalone created")
self.modelTheData()
def modelTheData(self) :
self.TOTAL_ATTR = len(self.df.columns) - 1
self.TOTAL_DATA_LENGTH = len(self.df.index)
self.df_trainingData =
df.drop(df.index[int(self.TOTAL_DATA_LENGTH * 0.6):])
self.TRAINING_DATA_SIZE = len(self.df_trainingData)
self.df_testingData =
df.drop(df.index[:int(self.TOTAL_DATA_LENGTH * 0.6)])
self.TESTING_DATA_SIZE = len(self.df_testingData)
self.miniBatchSize = int(self.TRAINING_DATA_SIZE / 10)
self.thetaVect = np.zeros((self.TOTAL_ATTR+1,1),dtype=float)
self.stochasticGradientDescent()
def stochasticGradientDescent(self) :
self.finalCostArr = np.array([])
startTime = time.time()
for i in range(self.epochs) :
self.df_trainingData =
self.df_trainingData.sample(frac=1).reset_index(drop=True)
miniBatches=[self.df_trainingData.loc[x:x+self.miniBatchSize-
((x+self.miniBatchSize)/(self.TRAINING_DATA_SIZE-1)),:]
for x in range(0,self.TRAINING_DATA_SIZE,self.miniBatchSize)]
self.epochCostArr = np.array([])
for j in miniBatches :
tempMat = j.values
self.actualValVect = tempMat[ : , self.TOTAL_ATTR:]
tempMat = tempMat[ : , :self.TOTAL_ATTR]
self.desMat = np.append(
np.ones((len(j.index),1),dtype=float) , tempMat , 1 )
del tempMat
self.trainData()
currCost = self.costEvaluation()
self.epochCostArr = np.append(self.epochCostArr,currCost)
self.finalCostArr = np.append(self.finalCostArr,
self.epochCostArr[len(miniBatches)-1])
endTime = time.time()
print(f"execution time : {endTime-startTime}")
self.graphEvaluation()
print(f"final cost :
{self.finalCostArr[len(self.finalCostArr)-1]}")
print(self.thetaVect)
def trainData(self) :
self.predictedValVect = self.predictResult()
diffVect = self.predictedValVect - self.actualValVect
partialDerivativeVect = np.matmul(self.desMat.T , diffVect)
self.thetaVect -=
(self.alpha/len(self.desMat))*partialDerivativeVect
def predictResult(self) :
return np.matmul(self.desMat,self.thetaVect)
def costEvaluation(self) :
cost = sum((self.predictedValVect - self.actualValVect)**2)
return cost / (2*len(self.actualValVect))
def graphEvaluation(self) :
plt.title("cost at end of all epochs")
x = range(len(self.epochCostArr))
y = self.epochCostArr
plt.plot(x,y)
plt.xlabel("iterations")
plt.ylabel("cost")
plt.show()
I kept epochs=200 and alpha=0.1 for all runs but I got a totally different result in each run.
The vector mentioned below is the theta vector where the first entry is the bias and remaining are weights
RUN 1 =>>
[[ 5.26020144]
[ -0.48787333]
[ 4.36479114]
[ 4.56848299]
[ 2.90299436]
[ 3.85349625]
[-10.61906207]
[ -0.93178027]
[ 8.79943389]]
final cost : 0.05917831328836957
RUN 2 =>>
[[ 5.18355814]
[ -0.56072668]
[ 4.32621647]
[ 4.58803884]
[ 2.89157598]
[ 3.7465471 ]
[-10.75751065]
[ -1.03302031]
[ 8.87559247]]
final cost: 1.0056239103948563
RUN 3 =>>
[[ 5.12836056]
[ -0.43672936]
[ 4.25664898]
[ 4.53397465]
[ 2.87847224]
[ 3.74693215]
[-10.73960775]
[ -1.00461585]
[ 8.85225402]]
final cost : 0.8214901206702101
RUN 4 =>>
[[ 5.38794798]
[ 0.23695412]
[ 4.43522951]
[ 4.66093372]
[ 2.9460605 ]
[ 4.13390252]
[-10.60071883]
[ -0.9230675 ]
[ 8.87229324]]
final cost: 15.959132174895712
RUN 5 =>>
[[ 5.19643132]
[ -0.76882106]
[ 4.35445135]
[ 4.58782119]
[ 2.8908931 ]
[ 3.63693031]
[-10.83291949]
[ -1.05709616]
[ 8.865904 ]]
final cost: 2.3162151072779804
I am unable to figure out what is going Wrong. Does SGD behave like this or I did some stupidity while transforming my code from batch GD to SGD. And if SGD behaves like this then how I get to know that how many times I have to rerun because I am not so lucky that every time in the first run I got such a small cost like 0.05 sometimes the first run gives cost around 10.5 sometimes 0.6 and maybe rerunning it a lot of times I got cost even smaller than 0.05.
when I approached the exact same problem with exact same code and hyperParameters just replacing the SGD function with normal batch GD I get the expected result i.e, after each iteration over the same data my cost is decreasing smoothly i.e., a monotonic decreasing function and no matter how many times I rerun the same program I got exactly the same result as this is very obvious.
"keeping everything same but using batch GD for epochs=20000 and alpha=0.1
I got final_cost=2.7474"
def BatchGradientDescent(self) :
self.costArr = np.array([])
startTime = time.time()
for i in range(self.epochs) :
tempMat = self.df_trainingData.values
self.actualValVect = tempMat[ : , self.TOTAL_ATTR:]
tempMat = tempMat[ : , :self.TOTAL_ATTR]
self.desMat = np.append( np.ones((self.TRAINING_DATA_SIZE,1),dtype=float) , tempMat , 1 )
del tempMat
self.trainData()
if i%100 == 0 :
currCost = self.costEvaluation()
self.costArr = np.append(self.costArr,currCost)
endTime = time.time()
print(f"execution time : {endTime - startTime} seconds")
self.graphEvaluation()
print(self.thetaVect)
print(f"final cost : {self.costArr[len(self.costArr)-1]}")
SomeBody help me figure out What actually is going on. Every opinion/solution is big revenue for me in this new field :)
You missed the most important and only difference between GD ("Gradient Descent") and SGD ("Stochastic Gradient Descent").
Stochasticity - Literally means "the quality of lacking any predictable order or plan". Meaning randomness.
Which means that while in the GD algorithm, the order of the samples in each epoch remains constant, in SGD the order is randomly shuffled at the beginning of every epochs.
So every run of GD with the same initialization and hyperparameters will produce the exact same results, while SGD will most defiantly not (as you have experienced).
The reason for using stochasticity is to prevent the model from memorizing the training samples (which will results in overfitting, where accuracy on the training set will be high but accuracy on unseen samples will be bad).
Now regarding to the big differences in final cost values between runs at your case, my guess is that your learning rate is too high. You can use a lower constant value, or better yet, use a decaying learning rate (which gets lower as epochs get higher).
When using JAGS, how does one receive output from a model in the format:
Inference for Bugs model at "model.txt", fit using jags,
3 chains, each with 10000 iterations (first 5000 discarded)
n.sims = 15000 iterations saved
mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff
mu 9.950 0.288 9.390 9.755 9.951 10.146 10.505 1.001 11000
sd.obs 3.545 0.228 3.170 3.401 3.534 3.675 3.978 1.001 13000
deviance 820.611 3.460 818.595 819.132 819.961 821.366 825.871 1.001 15000
I assumed, as with BUGS, it would appear when the model completes however I only get something in the format:
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 1785
Unobserved stochastic nodes: 1843
Total graph size: 61542
Initializing model
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
Apologies for the basic question. If anyone can provide useful JAGS introductory material that would also be useful.
Kind regards.
If you only get the 'plus' signs, it means you only initialized the model. When jags really runs, it typically produces '***' signs after. So you are missing a line here (would have been nice to see your code). For instance if you use r2jags, you would write:
out <- jags(data = data, parameters.to.save = params, n.chains = 3, n.iter = 90000,n.burnin = 5000,
model.file = modFile)
out.upd <- update(abundance.out.mod, n.iter=10000)
Correct me if I'm wrong: the "thresholds" returned by scikit-learn's roc_curve should be an array of numbers that are in [0,1]. However, it sometimes gives me an array with the first number close to "2". Is it a bug or I did sth wrong? Thanks.
In [1]: import numpy as np
In [2]: from sklearn.metrics import roc_curve
In [3]: np.random.seed(11)
In [4]: aa = np.random.choice([True, False],100)
In [5]: bb = np.random.uniform(0,1,100)
In [6]: fpr,tpr,thresholds = roc_curve(aa,bb)
In [7]: thresholds
Out[7]:
array([ 1.97396826, 0.97396826, 0.9711752 , 0.95996265, 0.95744405,
0.94983331, 0.93290463, 0.93241372, 0.93214862, 0.93076592,
0.92960511, 0.92245024, 0.91179548, 0.91112166, 0.87529458,
0.84493853, 0.84068543, 0.83303741, 0.82565223, 0.81096657,
0.80656679, 0.79387241, 0.77054807, 0.76763223, 0.7644911 ,
0.75964947, 0.73995152, 0.73825262, 0.73466772, 0.73421299,
0.73282534, 0.72391126, 0.71296292, 0.70930102, 0.70116428,
0.69606617, 0.65869235, 0.65670881, 0.65261474, 0.6487222 ,
0.64805644, 0.64221486, 0.62699782, 0.62522484, 0.62283401,
0.61601839, 0.611632 , 0.59548669, 0.57555854, 0.56828967,
0.55652111, 0.55063947, 0.53885029, 0.53369398, 0.52157349,
0.51900774, 0.50547317, 0.49749635, 0.493913 , 0.46154029,
0.45275916, 0.44777116, 0.43822067, 0.43795921, 0.43624093,
0.42039077, 0.41866343, 0.41550367, 0.40032843, 0.36761763,
0.36642721, 0.36567017, 0.36148354, 0.35843793, 0.34371331,
0.33436415, 0.33408289, 0.33387442, 0.31887024, 0.31818719,
0.31367915, 0.30216469, 0.30097917, 0.29995201, 0.28604467,
0.26930354, 0.2383461 , 0.22803687, 0.21800338, 0.19301808,
0.16902881, 0.1688173 , 0.14491946, 0.13648451, 0.12704826,
0.09141459, 0.08569481, 0.07500199, 0.06288762, 0.02073298,
0.01934336])
Most of the time these thresholds are not used, for example in calculating the area under the curve, or plotting the False Positive Rate against the True Positive Rate.
Yet to plot what looks like a reasonable curve, one needs to have a threshold that incorporates 0 data points. Since Scikit-Learn's ROC curve function need not have normalised probabilities for thresholds (any score is fine), setting this point's threshold to 1 isn't sufficient; setting it to inf is sensible but coders often expect finite data (and it's possible the implementation also works for integer thresholds). Instead the implementation uses max(score) + epsilon where epsilon = 1. This may be cosmetically deficient, but you haven't given any reason why it's a problem!
From the documentation:
thresholds : array, shape = [n_thresholds]
Decreasing thresholds on the decision function used to compute
fpr and tpr. thresholds[0] represents no instances being predicted
and is arbitrarily set to max(y_score) + 1.
So the first element of thresholds is close to 2 because it is max(y_score) + 1, in your case thresholds[1] + 1.
this seems like a bug to me - in roc_curve(aa,bb), 1 is added to the first threshold. You should create an issue here https://github.com/scikit-learn/scikit-learn/issues