Finding the mean of a distribution - python-3.x

My code generates a number of distributions (I only plotted one below to make it more legible). Y axis - here represents a probability density function and the X axis - is a simple array of values.
In more detail.
Y = [0.02046505 0.10756612 0.24319883 0.30336375 0.22071875 0.0890625 0.015625 0 0 0]
And X is generated using np.arange(0,10,1) = [0 1 2 3 4 5 6 7 8 9]
I want to find the mean of this distribution (i.e where the curve peaks on the X-axis, not the Y value mean. I know how to use numpy packages np.mean to find the mean of Y but its not what I need.
By eye, the mean here is about x=3 but I would like to generate this with a code to make it more accurate.
Any help would be great.

By definition, the mean (actually, the expected value of a random variable x, but since you have the PDF, you could use the expected value) is sum(p(x[j]) * x[j]), where p(x[j]) is the value of the PDF at x[j]. You can implement this as code like this:
>>> import numpy as np
>>> Y = np.array(eval(",".join("[0.02046505 0.10756612 0.24319883 0.30336375 0.22071875 0.0890625 0.015625 0 0 0]".split())))
>>> Y
array([0.02046505, 0.10756612, 0.24319883, 0.30336375, 0.22071875,
0.0890625 , 0.015625 , 0. , 0. , 0. ])
>>> X = np.arange(0, 10)
>>> Y.sum()
1.0
>>> (X * Y).sum()
2.92599253
So the (approximate) answer is 2.92599253.

Related

Numpy take along multiple axes

I have an N dimensional array where N is a variable from which I want to take elements along a given set of axes.
The objective is similar to the question except that the solution in that one seems to work when the dimensions and the axes are fixed.
For example, suppose from a 3D array, we want to extract elements along axis 0 for every multi-index along the other two axes. If the value of N is known beforehand, this can be hard corded
import numpy as np
a = np.arange(12).reshape((2,3,2))
ydim = a.shape[1]
zdim = a.shape[2]
for y in range(ydim):
for z in range(zdim):
print(a[:,y,z])
which gives the output
[0 6]
[1 7]
[2 8]
[3 9]
[ 4 10]
[ 5 11]
Q: How can this be achieved when N and the axes are not known beforehand?
For a single axis, numpy.take or numpy.take_along_axis do the job. I am looking for a similar function but for multiple axes. A function, say, take_along_axes() which can be used as follows:
ax = [1,2] ## list of axes from which indices are taken
it = np.nditer(a, op_axes=ax, flags=['multi_index']) ## Every index along those axes
while not it.finished():
print(np.take_along_axes(a,it.multi_index, axes=ax)
it.iternext()
The expected output is the same as the previous one.

Random generator in python

I want to select one of 0 or 1 based on some probability of getting 1 and some initial seed.
I tried following:
import random
population = [0,1]
random.seed(33)
probabilities = [0.4,0.2,0.5]
def sampleIt():
selectedProb = random.randrange(0,3,1) #select one of probabilities
print('Selected Probability: ', selectedProb)
return random.choices(population, [0, probabilities[selectedProb-1]])
for i in range(100):
sample = sampleIt()
print(sample[0])
Below is sample output:
Selected Probability: 0.2
1
Selected Probability: 0.5
1
Selected Probability: 0.4
1
Selected Probability: 0.2
1
Selected Probability: 0.5
1
Selected Probability: 0.2
1
Doubts:
As you can see, it is able to randomly select probabilities. But for each selected probability, it ends up selecting 1 from population. If it selected probability 0.2, then I expect it to select 1 with probability 0.2. In this way, it should have selected 0 at least once. But that is not happening. Why is this so?
Is seed correct set or we have to set differently?
Also, what changes I need to do if I expect sampleIt() to be called from different threads?
Also is there any standard practice to improve performance, say if I run this millions of time? Do I have to use numpy for random number generation?
Does random.randrange() and random.choice() follow uniform distribution?
You can run code online here.
There are several critical errors here. Let's talk about that and then the correct way to do this.
First, if this were working properly, you'd be getting 1 with net probability of 0.37, which is 1/3*(0.2 + 0.4 + 0.5) because you are randomly choosing a probability.
You are passing weights to random.choices in the second positional argument, and you are passing a weight of 0 for option zero, so it will never be picked. In that same statement, you are unnecessarily subtracting 1 from the range that you have...
So, to do this properly for Bernoulli trials, you can just draw a random number and compare it to the probability you want. Or you can use random.choices correctly and get a list.
In [14]: def gen_sample(p_success):
...: if random.random() < p_success:
...: return 1
...: return 0
...:
In [15]: gen_sample(0.95)
Out[15]: 1
In [16]: gen_sample(0.02)
Out[16]: 0
In [17]: p_success = 0.85
In [18]: random.choices([0, 1], weights=[1-p_success, p_success], k=10)
Out[18]: [1, 1, 1, 1, 1, 1, 1, 0, 1, 1]

How to avoid NaN in numpy implementation of logistic regression?

EDIT: I already made significant progress. My current question is written after my last edit below and can be answered without the context.
I currently follow Andrew Ng's Machine Learning Course on Coursera and tried to implement logistic regression today.
Notation:
X is a (m x n)-matrix with vectors of input variables as rows (m training samples of n-1 variables, the entries of the first column are equal to 1 everywhere to represent a constant).
y is the corresponding vector of expected output samples (column vector with m entries equal to 0 or 1)
theta is the vector of model coefficients (row vector with n entries)
For an input row vector x the model will predict the probability sigmoid(x * theta.T) for a positive outcome.
This is my Python3/numpy implementation:
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
vec_sigmoid = np.vectorize(sigmoid)
def logistic_cost(X, y, theta):
summands = np.multiply(y, np.log(vec_sigmoid(X*theta.T))) + np.multiply(1 - y, np.log(1 - vec_sigmoid(X*theta.T)))
return - np.sum(summands) / len(y)
def gradient_descent(X, y, learning_rate, num_iterations):
num_parameters = X.shape[1] # dim theta
theta = np.matrix([0.0 for i in range(num_parameters)]) # init theta
cost = [0.0 for i in range(num_iterations)]
for it in range(num_iterations):
error = np.repeat(vec_sigmoid(X * theta.T) - y, num_parameters, axis=1)
error_derivative = np.sum(np.multiply(error, X), axis=0)
theta = theta - (learning_rate / len(y)) * error_derivative
cost[it] = logistic_cost(X, y, theta)
return theta, cost
This implementation seems to work fine, but I encountered a problem when calculating the logistic-cost. At some point the gradient descent algorithm converges to a pretty good fitting theta and the following happens:
For some input row X_i with expected outcome 1 X * theta.T will become positive with a good margin (for example 23.207). This will lead to sigmoid(X_i * theta) to become exactly 1.0000 (this is because of lost precision I think). This is a good prediction (since the expected outcome is equal to 1), but this breaks the calculation of the logistic cost, since np.log(1 - vec_sigmoid(X*theta.T)) will evaluate to NaN. This shouldn't be a problem, since the term is multiplied with 1 - y = 0, but once a value of NaN occurs, the whole calculation is broken (0 * NaN = NaN).
How should I handle this in the vectorized implementation, since np.multiply(1 - y, np.log(1 - vec_sigmoid(X*theta.T))) is calculated in every row of X (not only where y = 0)?
Example input:
X = np.matrix([[1. , 0. , 0. ],
[1. , 1. , 0. ],
[1. , 0. , 1. ],
[1. , 0.5, 0.3],
[1. , 1. , 0.2]])
y = np.matrix([[0],
[1],
[1],
[0],
[1]])
Then theta, _ = gradient_descent(X, y, 10000, 10000) (yes, in this case we can set the learning rate this large) will set theta as:
theta = np.matrix([[-3000.04008972, 3499.97995514, 4099.98797308]])
This will lead to vec_sigmoid(X * theta.T) to be the really good prediction of:
np.matrix([[0.00000000e+00], # 0
[1.00000000e+00], # 1
[1.00000000e+00], # 1
[1.95334953e-09], # nearly zero
[1.00000000e+00]]) # 1
but logistic_cost(X, y, theta) evaluates to NaN.
EDIT:
I came up with the following solution. I just replaced the logistic_cost function with:
def new_logistic_cost(X, y, theta):
term1 = vec_sigmoid(X*theta.T)
term1[y == 0] = 1
term2 = 1 - vec_sigmoid(X*theta.T)
term2[y == 1] = 1
summands = np.multiply(y, np.log(term1)) + np.multiply(1 - y, np.log(term2))
return - np.sum(summands) / len(y)
By using the mask I just calculate log(1) at the places at which the result will be multiplied with zero anyway. Now log(0) will only happen in wrong implementations of gradient descent.
Open questions: How can I make this solution more clean? Is it possible to achieve a similar effect in a cleaner way?
If you don't mind using SciPy, you could import expit and xlog1py from scipy.special:
from scipy.special import expit, xlog1py
and replace the expression
np.multiply(1 - y, np.log(1 - vec_sigmoid(X*theta.T)))
with
xlog1py(1 - y, -expit(X*theta.T))
I know it is an old question but I ran into the same problem, and maybe it can help others in the future, I actually solved it by implementing normalization on the data before appending X0.
def normalize_data(X):
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
return (X-mean) / std
After this all worked well!

Why does contourf (matplotlib) switch x and y coordinates?

I am trying to get contourf to plot my stuff right, but it seems to switch the x and y coordinates. In the example below, I show this by evaluating a 2d Gaussian function that has different widths in x and y directions. With the values given, the width in y direction should be larger. Here is the script:
from numpy import *
from matplotlib.pyplot import *
xMax = 50
xNum = 100
w0x = 10
w0y = 15
dx = xMax/xNum
xGrid = linspace(-xMax/2+dx/2, xMax/2-dx/2, xNum, endpoint=True)
yGrid = xGrid
Int = zeros((xNum, xNum))
for idX in range(xNum):
for idY in range(xNum):
Int[idX, idY] = exp(-((xGrid[idX]/w0x)**2 + (yGrid[idY]/(w0y))**2))
fig = figure(6)
clf()
ax = subplot(2,1,1)
X, Y = meshgrid(xGrid, yGrid)
contour(X, Y, Int, colors='k')
plot(array([-xMax, xMax])/2, array([0, 0]), '-b')
plot(array([0, 0]), array([-xMax, xMax])/2, '-r')
ax.set_aspect('equal')
xlabel("x")
ylabel("y")
subplot(2,1,2)
plot(xGrid, Int[:, int(xNum/2)], '-b', label='I(x, y=max/2)')
plot(xGrid, Int[int(xNum/2), :], '-r', label='I(x=max/2, y)')
ax.set_aspect('equal')
legend()
xlabel(r"x or y")
ylabel(r"I(x or y)")
The figure thrown out is this:
On top the contour plot which has the larger width in x direction (not y). Below are slices shown, one across x direction (at constant y=0, blue), the other in y direction (at constant x=0, red). Here, everything seems fine, the y direction is broader than the x direction. So why would I have to transpose the array in order to have it plotted as I want? This seems unintuitive to me and not in agreement with the documentation.
It helps if you think of a 2D array's shape not as (x, y) but as (rows, columns), because that is how most math routines interpret them - including matplotlib's 2D plotting functions. Therefore, the first dimension is vertical (which you call y) and the second dimension is horizontal (which you call x).
Note that this convention is very prominent, even in numpy. The function np.vstack is supposed to concatenate arrays vertically works along the first dimension and np.hstack works horizontally on the second dimension.
To illustrate the point:
import numpy as np
import matplotlib.pyplot as plt
a = np.array([[0, 0, 1, 0, 0],
[0, 1, 1, 1, 0],
[1, 1, 1, 1, 1]])
a[:, 2] = 2 # set column
print(a)
plt.imshow(a)
plt.contour(a, colors='k')
This prints
[[0 0 2 0 0]
[0 1 2 1 0]
[1 1 2 1 1]]
and consistently plots
According to your convention that an array is (x, y) the command a[:, 2] = 2 should have assigned to the third row, but numpy and matplotlib both agree that it was the column :)
You can of course use your own convention how to interpret the dimensions of your arrays, but in the long run it will be more consistent to treat them as (y, x).

Sample from a discrete random distribution in Python

I was hoping to know if there is a command in numpy of scipy to pick an element of a data from a discrete random distribution. i.e.,
For example I have a discrete distribution x = (0.5, 0.3, 0.2) and I want to sample from y = (1, 2, 3)...
>>> sample(x, y)
2
>>> sample(x, y)
3
>>> sample(x, y)
3
>>> sample(x, y)
1
Hope my question is clear. Thanks.

Resources