Sample from a distribution function in python - python-3.x

I have an array.
array([1,1,1,1,1,
1,1,1,0.96227599,0,
0,1,1,1,1,
0,0,1,0,0,
1,1,1,0,1,
1,1,0,1,0,
0,1,0,0,1,
0,0,1,1,1,
1,1,0,1,1,
1,1,1,1,1,
1,1,1,1,1,
1,1,0,0,0,
1,0,1,1,1,
1,1,1,1,1,
1,1,1,1,1,
0.94795539,0.85308765,0,0,1,
1,1,0.9113806,1,1,
1,1,1,1,1,
1,0,1,1,0,
1,1,1,1,1,
1,1,0.20363486,0.50635838,0.52025932,
0,0.34747655,0.50147493,0,0.4848249,
0,0.88495575,0,0.27620151,0.3981369,
0,0,0])
Values ​​range from 0 to 1.
How can I plot a probability distribution function? And then fill a table with 1000 rows based on it, where each row has 5 columns. In fact, fill the table with samples of 5 values:

To get a pdf from your samples you could use a kernel density estimator.
One option is the gaussian_kde form scipy.stats.
It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed.
Your samples look strongly bimodel with clusters at 0 and 1, so you might be better advised to use sklearns KernelDensity. Here you have more control over the specific algorithm, kernel and the bandwidth.
Sklearn also has an introduction to Density Estimation
The workflow with both methods is quite similar:
import numpy as np
from scipy import stats
from sklearn.neighbors import KernelDensity
import matplotlib.pyplot as plt
a = np.array([1,1,1,1,1,1,1,1,0.96227599,0,0,1,1,1,1,0,0,1,0,0,1,1,1,0,1,1,1,0,1,0,0,1,0,0,1,0,0,1,1,1,1,1,0,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,0,0,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0.94795539,0.85308765,0,0,1,1,1,0.9113806,1,1,1,1,
1,1,1,1,0,1,1,0,1,1,1,1,1,1,1,0.20363486,0.50635838,0.52025932,0,0.34747655,0.50147493,0,0.4848249,0,
0.88495575,0,0.27620151,0.3981369,0,0,0])
kde1 = stats.gaussian_kde(a)
x1 = np.linspace(0, 1, 100)
y1 = kde1.pdf(x1)
kde2 = KernelDensity(bandwidth=0.1).fit(a.reshape(-1, 1))
y2 = kde2.sample(10000)
kde3 = KernelDensity(bandwidth=0.01).fit(a.reshape(-1, 1))
y3 = kde3.sample(10000)
fig, ax = plt.subplots()
ax.plot(x1, y1, c='b')
ax.hist(y2.ravel(), bins=100, density=True, color='r', alpha=0.7)
ax.hist(y3.ravel(), bins=100, density=True, color='m', alpha=0.7)
Note that this method those not limit your pdf on values between [0, 1].
You have to take care of this yourself ie by filtering those out in a second step. However if you choose a small bandwith you scould come pretty close.
I do not quite understand the second part of your question.
If you want to draw new samples from the estimated distribution you can do so via kde.sample() (sklearn) / kde.resample() (scipy). And filling those values into a table is a different question for which you definitely will find answers here on StackOverflow.

Related

Plotting new points in a subspace after dimensionality reduction

I would like to plot points with 100 parameters each with values between 0-99 on a 2 dimensional plot. This should be straightforward with normal methods of dimensionality reduction (PCA/tSNE/UMAP etc) but I need to be able to add subsequent points to the plot without it needing to recalculate and therefore change
I am picturing an algorithm that takes a data-point with it's 100 values and converts it to X,Y coordinates that can then be plotted. Points proximal in the 2D projection are proximal in the original 100D space. Does such an algorithm exist? If not, any alternative approaches?
Thanks
I am not sure I understood the question correctly but with an initial set X, we can fit a PCA to compute the principal components. Then, we can use these principal components to transform new samples.
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
n_samples, n_feats = 50, 100
X = np.random.randint(0, 99, size=n_samples * n_feats).reshape(n_samples, n_feats)
pca = PCA(n_components=2).fit(X)
X_reduced = pca.transform(X)
plt.scatter(X[:, 0], X[:, 1])
This plots,
Then, when a new sample comes in
new_sample = np.random.randint(0, 99, size=100).reshape(1, 100)
new_sample_reduced = pca.transform(new_sample)
plt.scatter(new_sample_reduced[:, 0], new_sample_reduced[:, 1], color="red")
We can plot it

How to interpolate 2D spatial data with kriging in Python?

I have a spatial 2D domain, say [0,1]×[0,1]. In this domain, there are 6 points where some scalar quantity of interest has been observed (e.g., temperature, mechanical stress, fluid density, etc.). How can I predict the quantity of interest at unobserved points? In other words, how may I interpolate spatial data in Python?
For example, consider the following coordinates for points in the 2D domain (inputs) and corresponding observations of the quantity of interest (outputs).
import numpy as np
coordinates = np.array([[0.0,0.0],[0.5,0.0],[1.0,0.0],[0.0,1.0],[0.5,1.],[1.0,1.0]])
observations = np.array([1.0,0.5,0.75,-1.0,0.0,1.0])
The X and Y coordinates can be extracted with:
x = coordinates[:,0]
y = coordinates[:,1]
The following script creates a scatter plot where yellow (resp. blue) represents high (resp. low) output values.
import matplotlib.pyplot as plt
fig = plt.figure()
plt.scatter(x, y, c=observations, cmap='viridis')
plt.colorbar()
plt.show()
I would like to use Kriging to predict the scalar quantity of interest on a regular grid within the 2D input domain. How can I do this in Python?
In OpenTURNS, the KrigingAlgorithm class can estimate the hyperparameters of a Gaussian process model based on the known output values at specific input points. The getMetamodel method of KrigingAlgorithm, then, returns a function which interpolates the data.
First, we need to convert the Numpy arrays coordinates and observations to OpenTURNS Sample objects:
import openturns as ot
input_train = ot.Sample(coordinates)
output_train = ot.Sample(observations, 1)
The array coordinates has shape (6, 2), so it is turned into a Sample of size 6 and dimension 2. The array observations has shape (6,), which is ambiguous: Is it going to be a Sample of size 6 and dimension 1, or a Sample of size 1 and dimension 6? To clarify this, we specify the dimension (1) in the call to the Sample class constructor.
In the following, we define a Gaussian process model with constant trend function and squared exponential covariance kernel:
inputDimension = 2
basis = ot.ConstantBasisFactory(inputDimension).build()
covariance_kernel = ot.SquaredExponential([1.0]*inputDimension, [1.0])
algo = ot.KrigingAlgorithm(input_train, output_train,
covariance_kernel, basis)
We then fit the value of the trend and the parameters of the covariance kernel (amplitude parameter and scale parameters) and obtain a metamodel:
# Fit
algo.run()
result = algo.getResult()
krigingMetamodel = result.getMetaModel()
The resulting krigingMetamodel is a Function which takes a 2D Point as input and returns a 1D Point. It predicts the quantity of interest. To illustrate this, let us build the 2D domain [0,1]×[0,1] and discretize it with a regular grid:
# Create the 2D domain
myInterval = ot.Interval([0.0, 0.0], [1.0, 1.0])
# Define the number of interval in each direction of the box
nx = 20
ny = 10
myIndices = [nx - 1, ny - 1]
myMesher = ot.IntervalMesher(myIndices)
myMeshBox = myMesher.build(myInterval)
Using our krigingMetamodel to predict the values taken by the quantity of interest on this mesh can be done with the following statements. We first get the vertices of the mesh as a Sample, and then evaluate the predictions with a single call to the metamodel (there is no need for a for loop here):
# Predict
vertices = myMeshBox.getVertices()
predictions = krigingMetamodel(vertices)
In order to see the result with Matplotlib, we first have to create the data required by the pcolor function:
# Format for plot
X = np.array(vertices[:, 0]).reshape((ny, nx))
Y = np.array(vertices[:, 1]).reshape((ny, nx))
predictions_array = np.array(predictions).reshape((ny,nx))
The following script produces the plot:
# Plot
import matplotlib.pyplot as plt
fig = plt.figure()
plt.pcolor(X, Y, predictions_array)
plt.colorbar()
plt.show()
We see that the predictions of the metamodel are equal to the observations at the observed input points.
This metamodel is a smooth function of the coordinates: its smoothness increases with covariance kernel smoothness and squared exponential covariance kernels happen to be smooth.

FFT on MPU6050 output signal

I want to perform FFT on data array that I have extracted from MPU6050 sensor connected to Arduino UNO using Python
Please find the data sample below
0.13,0.04,1.03
0.14,0.01,1.02
0.15,-0.04,1.05
0.16,0.02,1.05
0.14,0.01,1.02
0.16,-0.03,1.04
0.15,-0.00,1.04
0.14,0.03,1.02
0.14,0.01,1.03
0.17,0.02,1.05
0.15,0.03,1.03
0.14,0.00,1.02
0.17,-0.02,1.05
0.16,0.01,1.04
0.14,0.02,1.01
0.15,0.00,1.03
0.16,0.03,1.05
0.11,0.03,1.01
0.15,-0.01,1.03
0.16,0.01,1.05
0.14,0.02,1.03
0.13,0.01,1.02
0.15,0.02,1.05
0.13,0.00,1.03
0.08,0.01,1.03
0.09,-0.01,1.03
0.09,-0.02,1.03
0.07,0.01,1.03
0.06,0.00,1.05
0.04,0.00,1.04
0.01,0.01,1.02
0.03,-0.05,1.02
-0.03,-0.05,1.03
-0.05,-0.02,1.02
I have taken 1st column (X axis) and saved in an array
Reference:https://hackaday.io/project/12109-open-source-fft-spectrum-analyzer/details
from this i took a part of FFT and the code is as below
from scipy.signal import filtfilt, iirfilter, butter, lfilter
from scipy import fftpack, arange
import numpy as np
import string
import matplotlib.pyplot as plt
sample_rate = 0.2
accx_list_MPU=[]
outputfile1='C:/Users/Meena/Desktop/SensorData.txt'
def fftfunction(array):
n=len(array)
print('The length is....',n)
k=arange(n)
fs=sample_rate/1.0
T=n/fs
freq=k/T
freq=freq[range(n//2)]
Y = fftpack.fft(array)/n
Y = Y[range(n//2)]
pyl.plot(freq, abs(Y))
pyl.grid()
ply.show()
with open(outputfile1) as f:
string1=f.readlines()
N1=len(string1)
for i in range (10,N1):
if (i%2==0):
new_list=string1[i].split(',')
l=len(new_list)
if (l==3):
accx_list_MPU.append(float(new_list[0]))
fftfunction(accx_list_MPU)
I have got the output of FFT as shown FFToutput
I do not understand if the graph is correct.. This is the first time im working with FFT and how do we relate it to data
This is what i got after the changes suggested:FFTnew
Here's a little rework of your fftfunction:
def fftfunction(array):
N = len(array)
amp_spec = abs(fftpack.fft(array)) / N
freq = np.linspace(0, 1, num=N, endpoint=False)
plt.plot(freq, amp_spec, "o-", markerfacecolor="none")
plt.xlim(0, 0.6) # easy way to hide datapoints
plt.margins(0.05, 0.05)
plt.xlabel("Frequency $f/f_{sample}$")
plt.ylabel("Amplitude spectrum")
plt.minorticks_on()
plt.grid(True, which="both")
fftfunction(X)
Specifically it removes the fs=sample_rate/1.0 part - shouldn't that be the inverse?
The plot then basically tells you how strong which frequency (relative to the sample frequency) was. Looking at your image, at f=0 you have your signal offset or mean value, which is around 0.12. For the rest of it, there's not much going on, no peaks whatsoever that indicate a certain frequency being overly present in the measurement data.

Using python and networkx to find the probability density function

I'm struggling to draw a power law graph for Facebook Data that I found online. I'm using Networkx and I've found how to draw a Degree Histogram and a degree rank. The problem that I'm having is I want the y axis to be a probability so I'm assuming I need to sum up each y value and divide by the total number of nodes? Can anyone please help me do this? Once I've got this I'd like to draw a log-log graph to see if I can obtain a straight line. I'd really appreciate it if anyone could help! Here's my code:
import collections
import networkx as nx
import matplotlib.pyplot as plt
from networkx.algorithms import community
import math
import pylab as plt
g = nx.read_edgelist("/Users/Michael/Desktop/anaconda3/facebook_combined.txt","r")
nx.info(g)
degree_sequence = sorted([d for n, d in g.degree()], reverse=True)
degreeCount = collections.Counter(degree_sequence)
deg, cnt = zip(*degreeCount.items())
fig, ax = plt.subplots()
plt.bar(deg, cnt, width=0.80, color='b')
plt.title("Degree Histogram for Facebook Data")
plt.ylabel("Count")
plt.xlabel("Degree")
ax.set_xticks([d + 0.4 for d in deg])
ax.set_xticklabels(deg)
plt.show()
plt.loglog(degree_sequence, 'b-', marker='o')
plt.title("Degree rank plot")
plt.ylabel("Degree")
plt.xlabel("Rank")
plt.show()
You seem to be on the right tracks, but some simplifications will likely help you. The code below uses only 2 libraries.
Without access your graph, we can use some graph generators instead. I've chosen 2 qualitatively different types here, and deliberately chosen different sizes so that the normalization of the histogram is needed.
import networkx as nx
import matplotlib.pyplot as plt
g1 = nx.scale_free_graph(1000, )
g2 = nx.watts_strogatz_graph(2000, 6, p=0.8)
# we don't need to sort the values since the histogram will handle it for us
deg_g1 = nx.degree(g1).values()
deg_g2 = nx.degree(g2).values()
# there are smarter ways to choose bin locations, but since
# degrees must be discrete, we can be lazy...
max_degree = max(deg_g1 + deg_g2)
# plot different styles to see both
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(deg_g1, bins=xrange(0, max_degree), density=True, histtype='bar', rwidth=0.8)
ax.hist(deg_g2, bins=xrange(0, max_degree), density=True, histtype='step', lw=3)
# setup the axes to be log/log scaled
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('degree')
ax.set_ylabel('relative density')
ax.legend()
plt.show()
This produces an output plot like this (both g1,g2 are randomised so won't be identical):
Here we can see that g1 has an approximately straight line decay in the degree distribution -- as expected for scale-free distributions on log-log axes. Conversely, g2 does not have a scale-free degree distribution.
To say anything more formal, you could look at the toolboxes from Aaron Clauset: http://tuvalu.santafe.edu/~aaronc/powerlaws/ which implement model fitting and statistical testing of power-law distributions.

Measure the uniformity of distribution of points in a 2D square

I am currently running into this problem: I have a 2D square, and have a set of points inside it, say, 1000 points. I need a way to see if the distribution of points inside the square are spread out (or more or less uniformly distributed) or they tend to gather together in some spot area inside the square.
Need a mathematical/statistical (not programming) way to determine this. I googled, found something like goodness of fit, Kolmogorov... and just wonder if there are other approaches to achieve this. Need this for class paper.
So: Inputs: a 2D square, and 1000 points.
Output: yes/no (yes = evenly spread out, no = gathering together in some spots).
Any idea would be appreciated.
Thanks
If your points are independent you can just check the distribution for each dimension individually. The Kolmogorov-Smirnov test (a measure of the distance between 2 distributions) is a good test for this. First let's generate and plot some Gaussian-distributed points so you can see how you can use the KS test (statistic) to detect a nonuniform distribution.
>>> import numpy as np
>>> from matplotlib.pyplot import plt
>>> X = np.random.gauss(1000, 2) # 1000 2-D points, normally distributed
>>> from sklearn.preprocessing import MinMaxScaler
>>> scaler = MinMaxScaler()
>>> X = scaler.fit_transform(X) # fit to default uniform dist range 0-1
>>> X
array([[ 0.46169481, 0.7444449 ],
[ 0.49408692, 0.5809512 ],
...,
[ 0.60877526, 0.59758908]])
>>> plt.scatter(*list(X))
>>> from scipy import stats
>>> from sklearn.preprocessing import StandardScaler, MinMaxScaler
>>> stats.kstest(MinMaxScaler().fit_transform(X[:,0]), 'uniform')
KstestResult(statistic=0.24738043186386116, pvalue=0.0)
The low p-value and high KS-statistic (distance from the uniform distribution) says nearly certainly did not come from a uniform distribution between 0 and 1
>>> stats.kstest(StandardScaler().fit_transform(X[:,0]), 'norm')
KstestResult(statistic=0.028970945967462303, pvalue=0.36613946547024456)
But they probably did come from a normal distribution with mean 0 and standard deviation 1 because of the high p-value and low KS distance.
Then you'd just repeat the KS-Tests for the second dimension (Y)

Resources