Skewed random sample from Numpy random generator sample (numpy.random.Generator.choice) - python-3.x

I have made a piece of Python to generate mixture of normal distributions and I would want to sample from it. As the result is my probability density function I would want the sample to be representative of the original distribution.
So I have developped the function to create the pdf:
def gaussian_pdf(amplitude, mean, std, sample_int):
coeff = (amplitude / std) / np.sqrt(2 * np.pi)
if len(amplitude > 1):
# create mixture distribution
# get distribution support
absciss_array = np.linspace(np.min(mean) - 4 * std[np.argmin(mean)],
np.max(mean) + 4 * std[np.argmax(mean)],
sample_int)
normal_array = np.zeros(len(absciss_array))
for index in range(0, len(amplitude)):
normal_array += coeff[index] * np.exp(-((absciss_array - mean[index]) / std[index]) ** 2)
else:
# create simple gaussian distribution
absciss_array = np.linspace(mean - 4*std, mean + 4*std, sample_int)
normal_array = coeff * np.exp(-((absciss_array - mean) / 2*std) ** 2)
return np.ascontiguousarray(normal_array / np.sum(normal_array))
An I have tested a sampling with the main part of the script :
def main():
amplitude = np.asarray([1, 2, 1])
mean = np.asarray([0.5, 1, 2.5])
std = np.asarray([0.1, 0.2, 0.3])
no_sample = 10000
# create mixture gaussian array
gaussian_array = gaussian_pdf(amplitude, mean, std, no_sample)
# pot data
fig, ax = plt.subplots()
absciss = np.linspace(np.min(gaussian_array), np.max(gaussian_array), no_sample)
ax.plot(absciss, gaussian_array)
# create random generator to sample from distribution
rng = np.random.default_rng(424242)
# sample from distribution
sample = rng.choice(a=gaussian_array, size=100, replace=True, p=gaussian_array)
# plot results
ax.plot(sample, np.full_like(sample, -0.00001), '|k', markeredgewidth=1)
plt.show()
return None
I then have the result :
You can see with the dark lines the samples that have been extracted from the distribution. The problem is that, even if I specify to use the probability array in the numpy function, the sampling is skewed towards the end of the distribution. I have tried several times with other seeds but the result does not change...
I expect to have more samples in the area where the probability density is greater...
Would someone please help me ? Am I missing something here ?
Thanks in advance.

Well actually the answer was to use an uniform distribution for sampling. Thanks to #amzon-ex for pointing it out.
The code is then :
absciss = np.linspace(np.min(gaussian_array), np.max(gaussian_array), no_sample)
sample_other = rng.choice(a=absciss, size=100, replace=True, p=gaussian_array)

Related

How to judge the similarity between two pieces of audio?

I want to achieve a similar singing scoring function to determine the similarity of two audio, but I do not know how a simple implementation, look at a lot of github projects, more mention is simhash, but I feel similar to the audio may not be very good, so here to ask for advice.
one approach would be to find the frequencies that are present in a audio segment by using auto correlation.
there are many implementations of this. e.g. librosa, scipy, numpy and so on.
a very loose and sloppy implementation to give you an understanding of the algorithm without libs:
import matplotlib.pyplot as plt
import math
'''
create a test signal
'''
sr = 44100 #hz
freq = 200 #hz
duration = 0.1 #sec
signal = [math.sin(x * (math.pi * 2 * freq) * (1 / sr)) for x in range(0, int(sr * duration))]
'''
compute the auto correlation at a given frequency
'''
def auto_correlation(signal, freq, sr):
sample_delay = int (sr / freq)
score = 0
for i in range(0, len(signal) - sample_delay):
score += signal[i] * signal[i + sample_delay]
return score / len(signal)
'''
iterate over frequency spectrum and test the autocorrelation
'''
start_freq = 150
end_freq = 1000
scores = []
for freq in range(start_freq, end_freq):
scores.append(auto_correlation(signal, freq, sr))
max_index = scores.index(max(scores))
print("estimated frequency : {} hz".format(max_index + start_freq))
plt.ylabel("correlation")
plt.xlabel("frequency Hz")
plt.plot([x + start_freq for x in range(len(scores))], scores)
you could then iterate thru the audio files, test segments for the dominating frequency and compare the scores.
another possibility is to do this by computing FFT for the audio files and compare those. librosa is a great lib if youre in python territory.

Improper cost function outputs for Vectorized Logistic Regression

I'm trying to implement vectorized logistic regression on the Iris dataset. This is the implementation from Andrew Ng's youtube series on deep learning. My best predictions using this method have been 81% accuracy while sklearn's implementation achieves 100% with completely different values for coefficients and bias. Also, I cant seem to get get proper outputs from my cost function. I suspect it is an issue with computing the gradients of the weights and bias with respect to the cost function though in the course he provides all of the necessary equations ( unless there is something in the actual exercise which I don't have access to being left out.) My code is as follows.
n = 4
m = 150
y = y.reshape(1, 150)
X = X.reshape(4, 150)
W = np.zeros((4, 1))
b = np.zeros((1,1))
for epoch in range(1000):
Z = np.dot(W.T, X) + b
A = sigmoid(Z) # 1/(1 + e **(-Z))
J = -1/m * np.sum(y * np.log(A) + (1-y) * (1 - np.log(A))) #cost function
dz = A - y
dw = 1/m * np.dot(X, dz.T)
db = np.sum(dz)
W = W - 0.01 * dw
b = b - 0.01 * db
if epoch % 100 == 0:
print(J)
My output looks something like this.
-1.6126604413879289
-1.6185960074767125
-1.6242504226045396
-1.6296400635926438
-1.6347800862216104
-1.6396845400653066
-1.6443664703028427
-1.648838008214648
-1.653110451818512
-1.6571943378913891
W and b values are:
array([[-0.68262679, -1.56816916, 0.12043066, 1.13296948]])
array([[0.53087131]])
Where as sklearn outputs:
(array([[ 0.41498833, 1.46129739, -2.26214118, -1.0290951 ]]),
array([0.26560617]))
I understand sklearn uses L2 regularization but even when turned off it's still far from the correct values. Any help would be appreciated. Thanks
You are likely getting strange results because you are trying to use logistic regression where y is not a binary choice. Categorizing the iris data is a multiclass problem, y can be one of three values:
> np.unique(iris.target)
> array([0, 1, 2])
The cross entropy cost function expects y to either be one or zero. One way to handle this is the one vs all method.
You can check each class by making y a boolean of whether the iris in in one class or not. For example here you can make y a data set of either class 1 or not:
y = (iris.target == 1).astype(int)
With that your cost function and gradient descent should work, but you'll need to run it multiple times and pick the best score for each example. Andrew Ng's class talks about this method.
EDIT:
It's not clear what you are starting with for data. When I do this, don't reshape the inputs. So you should double check that all your multiplication is delivering the shapes you want. On thing I notice that's a little odd, is the last term in your cost function. I generally do this:
cost = -1/m * np.sum(Y*np.log(A) + (1-Y) * np.log(1-A))
not:
-1/m * np.sum(y * np.log(A) + (1-y) * (1 - np.log(A)))
Here's code that converges for me using the dataset from sklearn:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
# Iris is a multiclass problem. Here, just calculate the probabily that
# the class is `iris_class`
iris_class = 0
Y = np.expand_dims((iris.target == iris_class).astype(int), axis=1)
# Y is now a data set of booleans indicating whether the sample is or isn't a member of iris_class
# initialize w and b
W = np.random.randn(4, 1)
b = np.random.randn(1, 1)
a = 0.1 # learning rate
m = Y.shape[0] # number of samples
def sigmoid(Z):
return 1/(1 + np.exp(-Z))
for i in range(1000):
Z = np.dot(X ,W) + b
A = sigmoid(Z)
dz = A - Y
dw = 1/m * np.dot(X.T, dz)
db = np.mean(dz)
W -= a * dw
b -= a * db
cost = -1/m * np.sum(Y*np.log(A) + (1-Y) * np.log(1-A))
if i%100 == 0:
print(cost)

Fitting distribution functions to dataset in Python 3

I'm trying to find the find the probability distribution that better fits my data. I've tried with the code I've found in different threads, but the results are not what I'm expecting.
The descriptive statistics and histogram for my data are as follows:
Data Histogram
count 865.000000
mean 43.476713
std 12.486362
min 4.075682
25% 34.934609
50% 41.917304
75% 51.271708
max 88.843940
I tried to find a proper distribution function using the following code, but the results are not what I expected.
size = 865
kappa=99
x = scipy.arange(size)
y = scipy.int_(scipy.round_(st.vonmises.rvs(kappa,size=size)*100))
h = plt.hist(df['spreadMaizChicagoAtlantico'],bins=100,color='b')
dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(y)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
plt.plot(pdf_fitted, label=dist_name)
plt.xlim(0,100)
plt.legend(loc='upper right')
plt.show()
Data histogram with functions
Can Anyone please tell me what I'm doing wrong and guide me through a better understanding of this solutions.
Thanks to the reply from before I found my mistake.
I got all the values from the DataFrame and made a numpy array.
ser=df.values
Then I ran a similar code from before correcting the fitting of the distribution to the proper data
size = 867
x = scipy.arange(size)
y = scipy.int_(scipy.round_(scipy.stats.vonmises.rvs(5,size=size)*60))
h = plt.hist(ser, bins=range(80))
dist_names = ['beta', 'rayleigh', 'norm']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(ser)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
plt.plot(pdf_fitted, label=dist_name)
plt.xlim(0,100)
plt.legend(loc='upper right')
plt.show()
The result is as follows, showing the histogram and three probability density functions.
The distfit library can do this job as it searches for the best fit among 89 theoretical distributions.
pip install distfit
import numpy as np
from distfit import distfit
# Example data
X = np.random.normal(10, 3, 2000)
# Initialize
dfit = distfit()
# Search for best theoretical fit on your empirical data
dfit.fit_transform(X)
# The plot function will now also include the predictions of y
dfit.plot(chart='PDF',
emp_properties={'linewidth': 4, 'color': 'k'},
bar_properties={'edgecolor':'k', 'color':'g'},
pdf_properties={'linewidth': 4, 'color': 'r'})

Linear Regression algorithm works with one data-set but not on another, similar data-set. Why?

I created a linear regression algorithm following a tutorial and applied it to the data-set provided and it works fine. However the same algorithm does not work on another similar data-set. Can somebody tell me why this happens?
def computeCost(X, y, theta):
inner = np.power(((X * theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))
def gradientDescent(X, y, theta, alpha, iters):
temp = np.matrix(np.zeros(theta.shape))
params = int(theta.ravel().shape[1])
cost = np.zeros(iters)
for i in range(iters):
err = (X * theta.T) - y
for j in range(params):
term = np.multiply(err, X[:,j])
temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term))
theta = temp
cost[i] = computeCost(X, y, theta)
return theta, cost
alpha = 0.01
iters = 1000
g, cost = gradientDescent(X, y, theta, alpha, iters)
print(g)
On running the algo through this dataset I get the output as matrix([[ nan, nan]]) and the following errors:
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: RuntimeWarning: overflow encountered in power
from ipykernel import kernelapp as app
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:11: RuntimeWarning: invalid value encountered in double_scalars
However this data set works just fine and outputs matrix([[-3.24140214, 1.1272942 ]])
Both the datasets are similar, I have been over it many times but can't seem to figure out why it works on one dataset but not on other. Any help is welcome.
Edit: Thanks Mark_M for editing tips :-)
[Much better question, btw]
It's hard to know exactly what's going on here, but basically your cost is going the wrong direction and spiraling out of control, which results in an overflow when you try to square the value.
I think in your case it boils down to your step size (alpha) being too big which can cause gradient descent to go the wrong way. You need to watch the cost in gradient descent and makes sure it's always going down, if it's not either something is broken or alpha is to large.
Personally, I would reevaluate the code and try to get rid of the loops. It's a matter of preference, but I find it easier to work with X and Y as column vectors. Here is a minimal example:
from numpy import genfromtxt
# this is your 'bad' data set from github
my_data = genfromtxt('testdata.csv', delimiter=',')
def computeCost(X, y, theta):
inner = np.power(((X # theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))
def gradientDescent(X, y, theta, alpha, iters):
for i in range(iters):
# you don't need the extra loop - this can be vectorize
# making it much faster and simpler
theta = theta - (alpha/len(X)) * np.sum((X # theta.T - y) * X, axis=0)
cost = computeCost(X, y, theta)
if i % 10 == 0: # just look at cost every ten loops for debugging
print(cost)
return (theta, cost)
# notice small alpha value
alpha = 0.0001
iters = 100
# here x is columns
X = my_data[:, 0].reshape(-1,1)
ones = np.ones([X.shape[0], 1])
X = np.hstack([ones, X])
# theta is a row vector
theta = np.array([[1.0, 1.0]])
# y is a columns vector
y = my_data[:, 1].reshape(-1,1)
g, cost = gradientDescent(X, y, theta, alpha, iters)
print(g, cost)
Another useful technique is to normalize your data before doing regression. This is especially useful when you have more than one feature you're trying to minimize.
As a side note - if you're step size is right you shouldn't get overflows no matter how many iterations you do because the cost will will decrease with every iteration and the rate of decrease will slow.
After 1000 iterations I arrived at a theta and cost of:
[[ 1.03533399 1.45914293]] 56.041973778
after 100:
[[ 1.01166889 1.45960806]] 56.0481988054
You can use this to look at the fit in an iPython notebook:
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(my_data[:, 0].reshape(-1,1), y)
axes = plt.gca()
x_vals = np.array(axes.get_xlim())
y_vals = g[0][0] + g[0][1]* x_vals
plt.plot(x_vals, y_vals, '--')

Normalizing CDF in Python

I want to calculate and plot the cumulative distribution function (CDF) of a given sample, new_dO18 and then overlay the CDF of a normal distribution with a given mean and standard deviation on the same plot. I am having problems normalizing the CDF. I should have values ranging between 0 and 1 on the x axis. Can someone guide me as to where I went wrong. I'm sure it's a simple fix but I'm very new to Python. I've included my steps so far. Thanks!
# Use np.histogram to get counts in each bin. See the help page or
# documentation on how to use this function, and what it returns.
# normalize the data new_dO18 using a for loop
norm_newdO18 = []
for element in new_dO18:
x = element
y = (x - np.mean(new_dO18))/np.std(new_dO18)
norm_newdO18.append(y)
print ('normalized dO18 values, excluding outliers:', norm_newdO18)
print()
# Use the histogram function to bin the data
num_bins = 20
counts, bin_edges = np.histogram(norm_newdO18, bins=num_bins, normed=0)
# Calculate and plot CDF of sample
cdf = np.cumsum(counts)
scale = 1.0/cdf[-1]
norm_cdf = scale * cdf
plt.plot(bin_edges[1:], norm_cdf, label = 'dO18 values')
plt.legend(bbox_to_anchor=(0, 1), loc='upper left', ncol=1)
plt.xlabel('normalized dO18 data')
plt.ylabel('frequency')
# Calculate and overlay the CDF of a normal distribution with sample mean and std
# as parameters.
# specific normally distributed function with mean and st. dev
mu, sigma = np.mean(new_dO18), np.std(new_dO18)
norm_theoretical = np.random.normal(mu, sigma, 1000)
# Calculate and plot CDF of theoretical sample
counts1, bin_edges1 = np.histogram(norm_theoretical, bins=20, normed=0)
cdft= np.cumsum(counts1)
scale = 1.0/cdft[-1]
norm_cdft = scale * cdf
plt.plot(bin_edges[1:], norm_cdft, label = 'theoretical values')
plt.legend(bbox_to_anchor=(0, 1), loc='upper left', ncol=1)
plt.show()

Resources