Random generator in python - python-3.x

I want to select one of 0 or 1 based on some probability of getting 1 and some initial seed.
I tried following:
import random
population = [0,1]
random.seed(33)
probabilities = [0.4,0.2,0.5]
def sampleIt():
selectedProb = random.randrange(0,3,1) #select one of probabilities
print('Selected Probability: ', selectedProb)
return random.choices(population, [0, probabilities[selectedProb-1]])
for i in range(100):
sample = sampleIt()
print(sample[0])
Below is sample output:
Selected Probability: 0.2
1
Selected Probability: 0.5
1
Selected Probability: 0.4
1
Selected Probability: 0.2
1
Selected Probability: 0.5
1
Selected Probability: 0.2
1
Doubts:
As you can see, it is able to randomly select probabilities. But for each selected probability, it ends up selecting 1 from population. If it selected probability 0.2, then I expect it to select 1 with probability 0.2. In this way, it should have selected 0 at least once. But that is not happening. Why is this so?
Is seed correct set or we have to set differently?
Also, what changes I need to do if I expect sampleIt() to be called from different threads?
Also is there any standard practice to improve performance, say if I run this millions of time? Do I have to use numpy for random number generation?
Does random.randrange() and random.choice() follow uniform distribution?
You can run code online here.

There are several critical errors here. Let's talk about that and then the correct way to do this.
First, if this were working properly, you'd be getting 1 with net probability of 0.37, which is 1/3*(0.2 + 0.4 + 0.5) because you are randomly choosing a probability.
You are passing weights to random.choices in the second positional argument, and you are passing a weight of 0 for option zero, so it will never be picked. In that same statement, you are unnecessarily subtracting 1 from the range that you have...
So, to do this properly for Bernoulli trials, you can just draw a random number and compare it to the probability you want. Or you can use random.choices correctly and get a list.
In [14]: def gen_sample(p_success):
...: if random.random() < p_success:
...: return 1
...: return 0
...:
In [15]: gen_sample(0.95)
Out[15]: 1
In [16]: gen_sample(0.02)
Out[16]: 0
In [17]: p_success = 0.85
In [18]: random.choices([0, 1], weights=[1-p_success, p_success], k=10)
Out[18]: [1, 1, 1, 1, 1, 1, 1, 0, 1, 1]

Related

Multiclass classification per class recall equals per class accuracy?

I've got a multiclass problem. I'm using sklearn.metrics to calculate the confusion matrix, overall accuracy, per class precision, per class recall and per class F1-score.
Now I wanted to calculate the per class accuracy. Since there is no method in sklearn for this I used another one which i got from a google search. I've now realised, that the per class recall equals the per class accuracy. Can anyone explain to me if this holds true and if yes, why?
I found an explanation here, but I'm not sure since there the micro-recall equals the overall accuracy if I'm understanding it correctly. And I'm looking for the per class accuracy.
I too experienced same results. because per class Recall = TP/TP+FN , Here TP+FN is same as all the samples of a class. So the formula becomes similar to accuracy.
This generally doesn't hold. Accuracy and recall are calculated using different formulas and are different measures explaining something else.
Recall is the percentage of true positive data points compared to all data points that are predicted as positive by your classifier.
Accuracy is the percentage of all examples that are classified correctly, including positive and negative.
If they are equal, this is either coincidence or you have an error is your method of calculating them. Most likely this will be coincidence.
EDIT:
I will show why it's not the case with an example that can be generalised to N classes.
Let's assume three classes: 0, 1, 2 with the following confusion matrix:
[[3 0 1]
[2 5 0]
[0 1 4]]
When we want to calculate measures per class, we do this binary. For example for class 0, we combine 1 and 2 into 'not 0'. This results in the following confusion matrix:
[[3 1]
[2 9]]
Resulting in:
TP = 3
FT = 5
FN = 1
TN = 9
Accuracy = (TN + TP) / (N + P)
Recall = TP / (TN + FN)
So you can already tell from these formulas, that they are really not equal. To disprove an hypothesis in mathematics it suffices to show a counter example. In this case an example that show that accuracy is not equal to recall.
In this example filled in we get:
Accuracy = 12/18 = 2/3
Recall = 3/4
And 2/3 is not equal to 3/4. Thus disproving the hypothesis that per class accuracy is equal to per class recall.
It is however also possible to provide examples for which the hypothesis is correct. But because it is not in general, it is disproven.
Not sure if you are looking for average per-class accuracy as a single metric or per-class accuracy as separate metrics for each class.
For per-class accuracy as a separate metric for each class, see the code below. It's the same as recall-micro per class.
For average per-class accuracy as a single metric, it is equivalent to recall-macro (which is equivalent to balanced accuracy in sklearn). See the code blow.
Here is the empirical demonstration in code.
from sklearn.metrics import accuracy_score, balanced_accuracy_score, recall_score
label_class1 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
label_class2 = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
labels = label_class1 + label_class2
pred_class1 = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
pred_class2 = [1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
pred = pred_class1 + pred_class2
# 1. calculate accuracy scores per class
score_accuracy_class1 = accuracy_score(label_class1, pred_class1)
score_accuracy_class2 = accuracy_score(label_class2, pred_class2)
print(score_accuracy_class1) # 0.6
print(score_accuracy_class2) # 0.9
# 2. calculate recall scores per class
score_recall_class1 = recall_score(label_class1, pred_class1, average='micro')
score_recall_class2 = recall_score(label_class2, pred_class2, average='micro')
print(score_recall_class1) # 0.6
print(score_recall_class2) # 0.9
assert score_accuracy_class1 == score_recall_class1
assert score_accuracy_class2 == score_recall_class2
# 3. this also means that average per-class accuracy is equivalent to averaged recall and balanced accuracy
score_balanced_accuracy1 = (score_accuracy_class1 + score_accuracy_class2) / 2
score_balanced_accuracy2 = (score_recall_class1 + score_recall_class2) / 2
score_balanced_accuracy3 = balanced_accuracy_score(labels, pred)
score_balanced_accuracy4 = recall_score(labels, pred, average='macro')
print(score_balanced_accuracy1) # 0.75
print(score_balanced_accuracy2) # 0.75
print(score_balanced_accuracy3) # 0.75
print(score_balanced_accuracy4) # 0.75
# balanced accuracy, average per-class accuracy and recall-macro are equivalent
assert score_balanced_accuracy1 == score_balanced_accuracy2 == score_balanced_accuracy3 == score_balanced_accuracy4
These official docs say: "balanced accuracy ... is defined as the average of recall obtained on each class."
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html

Finding the mean of a distribution

My code generates a number of distributions (I only plotted one below to make it more legible). Y axis - here represents a probability density function and the X axis - is a simple array of values.
In more detail.
Y = [0.02046505 0.10756612 0.24319883 0.30336375 0.22071875 0.0890625 0.015625 0 0 0]
And X is generated using np.arange(0,10,1) = [0 1 2 3 4 5 6 7 8 9]
I want to find the mean of this distribution (i.e where the curve peaks on the X-axis, not the Y value mean. I know how to use numpy packages np.mean to find the mean of Y but its not what I need.
By eye, the mean here is about x=3 but I would like to generate this with a code to make it more accurate.
Any help would be great.
By definition, the mean (actually, the expected value of a random variable x, but since you have the PDF, you could use the expected value) is sum(p(x[j]) * x[j]), where p(x[j]) is the value of the PDF at x[j]. You can implement this as code like this:
>>> import numpy as np
>>> Y = np.array(eval(",".join("[0.02046505 0.10756612 0.24319883 0.30336375 0.22071875 0.0890625 0.015625 0 0 0]".split())))
>>> Y
array([0.02046505, 0.10756612, 0.24319883, 0.30336375, 0.22071875,
0.0890625 , 0.015625 , 0. , 0. , 0. ])
>>> X = np.arange(0, 10)
>>> Y.sum()
1.0
>>> (X * Y).sum()
2.92599253
So the (approximate) answer is 2.92599253.

Python numpy array: Index error, Index out of bounds

The following code is just an example of my original code which is as follows
batch_size = 10
target_q = np.ones((10, 1))
actions = np.ones((10, ), dtype=int)
batch_index = np.arange(batch_size, dtype=np.int32)
print(target_q[batch_index, actions])
print(target_q.shape)
I get the following error
IndexError: index 1 is out of bounds for axis 1 with size 1.
Can someone please explain what this means and how to rectify it.
Thanks in advance.
In numpy you can index arrays of size N up to index N-1 (along a given axis), otherwise you will get the IndexError you are seeing. In order to check how high can you go with an index, you can print target_q.shape. In your case it will tell you (10, 1), which means that if you index target_q[i, j], then i can be maximally 9 and j can be maximally 0.
What you do in your line target_q[batch_index, actions] is you insert actions as so called fancy indexing on the second position (j) and actions is full of ones. Thus, you are trying to many times index with 1, whereas the highest allowed index value is 0.
What would work would be:
import numpy as np
batch_size = 10
target_q = np.ones((10, 1))
# changed to zeros below
actions = np.zeros((10, ), dtype=int)
batch_index = np.arange(batch_size, dtype=np.int32)
print(actions)
print(target_q.shape)
print(target_q[batch_index, 0])
print(target_q[batch_index, actions])
that prints:
[0 0 0 0 0 0 0 0 0 0]
(10, 1)
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

Change Values in One Array, Based On Value from Column of Second Array

I have a one dimensional array called Y_train that contains a series of 1's and 0's. I have another array called sample_weight that is an array of all 1's that has the shape of Y_train, defined as:
sample_weight = np.ones(Y_train.shape, dtype=int)
I'm trying to change the values in sample_weight to a 2, where the corresponding value in Y_train == 0. So initially side by side it looks like:
Y_train sample_weight
0 1
0 1
1 1
1 1
0 1
1 1
and I'd like it to look like this after the transformation:
Y_train sample_weight
0 2
0 2
1 1
1 1
0 2
1 1
What I tried was to use a for loop (shown below) but none of the 1's are changing to 2's in sample_weight. I'd like to somehow use the np.where() function if possible, but it's not crucial, just would like to avoid a for loop:
sample_weight = np.ones(Y_train.shape, dtype=int)
for num, i in enumerate(Y_train):
if i == 0:
sample_weight[num] == 2
I tried using the solution shown here but to no success with the second array. Any ideas??? Thanks!
import numpy as np
Y_train = np.array([0,0,1,1,0,1])
sample_weight = np.where(Y_train == 0, 2, Y_train)
>> print(sample_weight)
[2 2 1 1 2 1]
The np.where basically works just like Excel's "IF":
np.where(condition, then, else)
Works for transposed arrays, too:
Y_train = np.array([[0,0,1,1,0,1]]).T
sample_weight = np.where(Y_train == 0, 2, Y_train)
>> print(sample_weight)
[[2]
[2]
[1]
[1]
[2]
[1]]

How does sklearn compute the precision_score metric?

Hello I am working with sklearn and in order to understand better the metrics, I followed the following example of precision_score:
from sklearn.metrics import precision_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
print(precision_score(y_true, y_pred, average='macro'))
the result that i got was the following:
0.222222222222
I understand that sklearn compute that result following these steps:
for label 0 precision is tp / (tp + fp) = 2 / (2 + 1) = 0.66
for label 1 precision is 0 / (0 + 2) = 0
for label 2 precision is 0 / (0 + 1) = 0
and finally sklearn calculates mean precision by all three labels: precision = (0.66 + 0 + 0) / 3 = 0.22
this result is given if we take this parameters:
precision_score(y_true, y_pred, average='macro')
on the other hand if we take this parameters, changing average='micro' :
precision_score(y_true, y_pred, average='micro')
then we get:
0.33
and if we take average='weighted':
precision_score(y_true, y_pred, average='weighted')
then we obtain:
0.22.
I don't understand well how sklearn computes this metric when the average parameter is set to 'weighted' or 'micro', I really would like to appreciate if someone could give me a clear explanation of this.
'micro':
Calculate metrics globally by considering each element of the label indicator matrix as a label.
'macro':
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted':
Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label).
'samples':
Calculate metrics for each instance, and find their average.
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html
For Support measures:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
Basically, class membership.
3.3.2.12. Receiver operating characteristic (ROC)
The function roc_curve computes the receiver operating characteristic curve, or ROC curve. Quoting Wikipedia :
“A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate.”
TN / True Negative: case was negative and predicted negative.
TP / True Positive: case was positive and predicted positive.
FN / False Negative: case was positive but predicted negative.
FP / False Positive: case was negative but predicted positive# Basic terminology
confusion = metrics.confusion_matrix(expected, predicted)
print confusion,"\n"
TN, FP = confusion[0, 0], confusion[0, 1]
FN, TP = confusion[1, 0], confusion[1, 1]
print 'Specificity: ', round(TN / float(TN + FP),3)*100, "\n"
print 'Sensitivity: ', round(TP / float(TP + FN),3)*100, "(Recall)"

Resources