Discrepancy in histograms generated by plt.hist() [duplicate] - python-3.x

This question already has answers here:
numpy arange: how to make "precise" array of floats?
(4 answers)
Closed 1 year ago.
I need help figuring out why there is a discrepancy in the histogram A and B generated in the code below. I'm a physicist and some colleagues and me noted this as we were plotting the same data in python, IDL and Matlab. Python and IDL have the same problem, however Matlab does not. Matlab always reproduce histogram B.
import numpy as np
import matplotlib.pyplot as plt
t = np.random.randint(-1000,1000,10**3)
# A
tA = t/1000
binsizeA = 0.05
xminA = -1
xmaxA = 1
binsA = np.arange(xminA, xmaxA+binsizeA, binsizeA)
hA, _ , _ = plt.hist(tA, bins=binsA, histtype="step", label="A")
# B
tB = t
binsizeB = 50
xminB = -1000
xmaxB = 1000
binsB = np.arange(xminB, xmaxB+binsizeB, binsizeB)
hB, _ , _ = plt.hist(tB/1000, bins=binsB/1000, histtype="step", label="B")
plt.legend()
plt.show()
print(hA==hB)
Plot showing the histograms
The original data are time tagged measurements with microsecond presision saved as integers. The problems seems to be when the array are divided by 1000 (from microsecond to millisecond). Is there a way to avoid this?

I start by "recreating" scenario A, but directly by scaling everything (data + bins) from B:
C - binsB / 1000
# C
tC = tB / 1000
xminC = xminB / 1000
xmaxC = xmaxB / 1000
binsC = binsB / 1000
hC, _ , _ = plt.hist(tC, bins=binsC, histtype="step", label="C")
assert((hB == hC).all())
This produces the same histogram as hB, so the problem is in the way binsA is made:
binsA = np.arange(xminA, xmaxA+binsizeA, binsizeA)
From its docstring:
When using a non-integer step, such as 0.1, the results will often not
be consistent. It is better to use numpy.linspace for these cases.
So either go route C or use linspace to create the non-integer bins with less rounding errors.
D - np.linspace
Interestingly, using linspace does not yield floating-point equal bins as binsB / 1000 does:
# D
tD = t / 1000
bincountD = 41
xminD = -1
xmaxD = 1
binsD = np.linspace(xminD, xmaxD, 41)
hC, _ , _ = plt.hist(tC, bins=binsC, histtype="step", label="C")
hD, _ , _ = plt.hist(tD, bins=binsD, histtype="step", label="D")
plt.legend()
plt.show()
By inspection, both binsC look equal to binsD, but still differ in their least signifcant digits. I can "clamp" them to yield the same histogram by binsX.round(2).
But in total, this serves as a reminder how tricky it is to achieve "exact" results. But note that this fact is amplified here, as all your samples were integers to begin with. If your data is floating point as well, bins and samples would not be value-identical.

Related

What's a potentially better algorithm to solve this python nested for loop than the one I'm using?

I have a nested loop that has to loop through a huge amount of data.
Assuming a data frame with random values with a size of 1000,000 rows each has an X,Y location in 2D space. There is a window of 10 length that go through all the 1M data rows one by one till all the calculations are done.
Explaining what the code is supposed to do:
Each row represents a coordinates in X-Y plane.
r_test is containing the diameters of different circles of investigations in our 2D plane (X-Y plane).
For each 10 points/rows, for every single diameter in r_test, we compare the distance between every point with the remaining 9 points and if the value is less than R we add 2 to H. Then we calculate H/(N**5) and store it in c_10 with the index corresponding to that of the diameter of investigation.
For this first 10 points finally when the loop went through all those diameters in r_test, we read the slope of the fitted line and save it to S_wind[ii]. So the first 9 data points will have no value calculated for them thus giving them np.inf to be distinguished later.
Then the window moves one point down the rows and repeat this process till S_wind is completed.
What's a potentially better algorithm to solve this than the one I'm using? in python 3.x?
Many thanks in advance!
import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']
####====creating upper and lower bound for the diameter of the investigation circles
x_range =max(df['X']) - min(df['X'])
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800
r_test = np.arange(r1, r2, 5)
S_wind = np.zeros(len(df['X'])) + np.inf
for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
c_10 = np.zeros(len(r_test)) +np.inf
H = 0
C = 0
N = 10 ##### maybe I should also remove this
for ind in range(len(r_test)):
for i in range (ii-10,ii):
for j in range(ii-10,ii):
dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
if dd > 0:
H += 1
c_10[ind] = (H/(N**2))
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]
You can use numpy broadcasting to eliminate all of the inner loops. I'm not sure if there's an easy way to get rid of the outermost loop, but the others are not too hard to avoid.
The inner loops are comparing ten 2D points against each other in pairs. That's just dying for using a 10x10x2 numpy array:
# replacing the `for ind` loop and its contents:
points = np.hstack((np.asarray(df['X'])[ii-10:ii, None], np.asarray(df['Y'])[ii-10:ii, None]))
differences = np.subtract(points[None, :, :], points[:, None, :]) # broadcast to 10x10x2
squared_distances = (differences * differences).sum(axis=2)
within_range = squared_distances[None,:,:] < (r_test*r_test)[:, None, None] # compare squares
c_10 = within_range.sum(axis=(1,2)).cumsum() * 2 / (N**2)
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0] # this is unchanged...
I'm not very pandas savvy, so there's probably a better way to get the X and Y values into a single 2-dimensional numpy array. You generated the random data in the format that I'd find most useful, then converted into something less immediately useful for numeric operations!
Note that this code matches the output of your loop code. I'm not sure that's actually doing what you want it to do, as there are several slightly strange things in your current code. For example, you may not want the cumsum in my code, which corresponds to only re-initializing H to zero in the outermost loop. If you don't want the matches for smaller values of r_test to be counted again for the larger values, you can skip that sum (or equivalently, move the H = 0 line to in between the for ind and the for i loops in your original code).

Scipy Optimize Basin Hopping fails

I am working on a cost minimizing function to help with allocation/weights in a portfolio of stocks. I have the following code for the "Objective Function". This works when I tried it with 15 variables(stocks). However, when I tried it with 55 stocks it failed.
I have tried it with a smaller sample of stocks(15) and it works fine. The num_assets variable below is the number of stocks in the portfolio.
def get_metrics(weights):
weights = np.array(weights)
returnsR = np.dot(returns_annualR, weights )
volatilityR = np.sqrt(np.dot(weights.T, np.dot(cov_matrixR, weights)))
sharpeR = returnsR / volatilityR
drawdownR = np.multiply(weights, dailyDD).sum(axis=1, skipna =
True).min()
drawdownR = f(drawdownR)
calmarR = returnsR / drawdownR
results = (sharpeR * 0.3) + (calmarR * 0.7)
return np.array([returnsR, volatilityR, sharpeR, drawdownR, calmarR,
results])
def objective(weights):
# the number 5 is the index from the get_metrics array
return get_metrics(weights)[5] * -1
def check_sum(weights):
#return 0 if sum of the weights is 1
return np.sum(weights)-1
bound = (0.0,1.0)
bnds = tuple(bound for x in range (num_assets))
bx = list(bnds)
""" Custom step-function """
class RandomDisplacementBounds(object):
"""random displacement with bounds: see: https://stackoverflow.com/a/21967888/2320035
Modified! (dropped acceptance-rejection sampling for a more specialized approach)
"""
def __init__(self, xmin, xmax, stepsize=0.5):
self.xmin = xmin
self.xmax = xmax
self.stepsize = stepsize
def __call__(self, x):
"""take a random step but ensure the new position is within the bounds """
min_step = np.maximum(self.xmin - x, -self.stepsize)
max_step = np.minimum(self.xmax - x, self.stepsize)
random_step = np.random.uniform(low=min_step, high=max_step, size=x.shape)
xnew = x + random_step
return xnew
bounded_step = RandomDisplacementBounds(np.array([b[0] for b in bx]), np.array([b[1] for b in bx]))
minimizer_kwargs = {"method":"L-BFGS-B", "bounds": bnds}
globmin = sco.basinhopping(objective,
x0=num_assets*[1./num_assets],
minimizer_kwargs=minimizer_kwargs,
take_step=bounded_step,
disp=True)
The output should be an array of numbers that add up to 1 or 100%. However, this is not happening.
This function is a failure on my end as well. It failed to choose values which were lower -- ie., regardless of output from optimization function (negative or positive), it persisted until the parameter I was optimizing was as bad as it could possibly be. I suspect that since the function violates function encapsulation and relies on "function attributes" to adjust stepsize, the developer may not have respected encapsulated function scope elsewhere, and surprising behavior is happening as a result.
Regardless, in terms of theory, anything else is just a (dubious) estimated numerical partial second derivative (numerical Hessian, or "estimated curvature" for us mere mortals) based "performance" "gain", which reduces to a randomly-biased annealer in discrete, chaotic (phase space, continuous) or mixed (continuous and discrete) search spaces with volatile curvatures or planar areas (due to numerical underflow and loss of precision).
Anyways, import the following:
scipy.optimize.dual_anneal
dual anneal

python scipy fmin not completing succesfully

I have a function that I am attempting to minimize for multiple values. For some values it terminates successfully however for others the error
Warning: Maximum number of function evaluations has been exceeded.
Is the error that is given. I am unsure of the role of maxiter and maxfun and how to increase or decrease these in order to successfully get to the minimum. My understanding is that these values are optional so I am unsure of what the default values are.
# create starting parameters, parameters equal to sin(x)
a = 1
k = 0
h = 0
wave_params = [a, k, h]
def wave_func(func_params):
"""This function calculates the difference between a sinewave (sin(x)) and raw_data (different sin wave)
This is the function that will be minimized by modulating a, b, k, and h parameters in order to minimize
the difference between curves."""
a = func_params[0]
b = 1
k = func_params[1]
h = func_params[2]
y_wave = a * np.sin((x_vals-h)/b) + k
error = np.sum((y_wave - raw_data) * (y_wave - raw_data))
return error
wave_optimized = scipy.optimize.fmin(wave_func, wave_params)
You can try using scipy.optimize.minimize with method='Nelder-Mead' https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html.
https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
Then you can just do
minimum = scipy.optimize.minimize(wave_func, wave_params, method='Nelder-Mead')
n_function_evaluations = minimum.nfev
n_iterations = minimum.nit
or you can customize the search algorithm like this:
minimum = scipy.optimize.minimize(
wave_func, wave_params, method='Nelder-Mead',
options={'maxiter': 10000, 'maxfev': 8000}
)
I don't know anything about fmin, but my guess is that it behaves extremely similarly.

pairwise subtraction of rows of an array and division in for loop

I have three arrays called flow_stress(4x3) and strain_rate(1x4) and T(1x3). I have interpolated the log of flow_stress with respect to [Temp = 1000/(T+273) which is line spaced into 50 terms] and with respect to srate (which is log of strain rate line-spaced into 100 terms) such that the flow_stress3 is an array of (100x50)
I am trying to create an array m which is equal to the (difference of consecutive rows of flow_stress3) divided by (difference of consecutive terms of srate).
Though the array flow_stress3 and array srate have correct values . values of m are wrong.
>import numpy as np
>import math
>from scipy import interpolate as sp
>import matplotlib as plt
>T = np.array([750,800,850])
## Enter temperature value in degree##
>strain_rate = np.array([0.0003,0.001,0.01,0.1])
>flow_stress = np.array([[95.96, 49.46,28.16],\
[126.62,80.51,46.45],\
[235.14,151.46,107.94],\
[319.15,228.77,165.63]])
>Temp = 1000/(273+T)
>k = (max(T)-min(T))/2
>TT = np. linspace(max(Temp), min(Temp),k)
## divide the temp into k##
>S = np.log10(flow_stress)
>flow_stress1 = np.empty(shape=[len(strain_rate),len(TT)])
## makes an empty array of dim ##
>(len(strain_rate),len(TT))
>SR= np.log10(strain_rate)
>n = (max(SR)-min(SR))/0.025
## divide SR by 0.025 to get number of terms in matrix##
>l = n//1
## operator // converts the fraction into integer##
>srate= np.linspace(min(SR), max(SR),l)
## divides SR into l equal no of parts##
>len_srate = len(srate)
>for i in range(len(strain_rate)):
## first interpolate between temp and log flow stress ##
>> f_linear = sp.interp1d(Temp,S[i,:])
>> flow_stress1[i,:] = f_linear(TT)
## interpolate at values given by TT ##
>flow_stress2 = np.empty(shape=[len(TT),len(srate)])
>for i in range(len(TT)):
>>f_linear = sp.interp1d(SR,flow_stress1[:,i])
>>flow_stress2[i,:] = f_linear(srate)
>flow_stress3 = flow_stress2.T
>print(len(flow_stress3))
>print(len(flow_stress3[0,:]))
>print(len(srate))
>print(len(TT))
>srate = srate.T
>m = np.zeros(shape=[len(srate),len(TT)],dtype=np.ndarray)
>for i in range(len(srate)-1):
>>m[i,:]= np.array((flow_stress3[i+1,:]-flow_stress3[i,:])/(srate[i+1]-srate[i]))
>m[len(srate)-1,:] = m[len(srate)-2,:]
I get a contour plot of m with respect to srate and T as like under in fig1 .
A plot with same data done in MatLab is also shown in next fig 2 . we know for sure that MatLab data is correct. With Python, as can be seen value of many raws and columns are identical which should not be the case.
fig1
fig2
If I understand correctly, your requirement is simply to calculate the first derivative of flow_stress3 wrt srate. The code seems rather complex for that. In particular, I don't understand what purpose the last line serves.
Since you're already using scipy, I would suggest using the UnivariateSpline function. Your code will shrink to something like:
from scipy.interpolate import UnivariateSpline
splinecoeff=UnivariateSpline(srate,flow_stress3)
m=splinecoeff.derivative()

Python Index and Bounds error using data set

Our class is using Python as a solution tool for models. However, this is my first time with python or any programming language since VB in 1997 so I'm struggling. We have the following code provided to us.
from numpy import loadtxt, array, ones, column_stack
from numpy import dot, sqrt
from scipy.linalg import inv
from scipy.stats import norm, t
f = loadtxt('text data.raw')
y = f[:,4]
n = y.size
x = array([f[:,2],f[:,8],f[:,4]])
one = ones(n)
#xa = column_stack([one,f[:,3],f[:,4]])
xa = column_stack([one,x.T])
k = xa.shape[1]
xx = dot(xa.T,xa)
invx = inv(xx)
xy = dot(xa.T,y)
b = dot(invx,xy)
# Compute cov(b)
e = y - dot(xa,b)
s2 = dot(e.T,e)/(n-k)
covb = invx*s2
# Compute t-stat
tstat = b[1]/sqrt(covb[1][1])
#compute p-value
p = 1 - norm.cdf(tstat,0,1)
pt = 1 - t.cdf(tstat,88)
Our data set is a 10x88 matrix. Our goal is to create a linear program and find a few answers. On our data column 1 is already set to price which in our linear program is our desired out put and I need to use column 3,4, and 5. as my x1,x2, and x3. I'm not sure how or what line 9 and 11 values need to be changed to in order to accomplish that task nor am I currently understanding what those two lines are specifically calling for or doing in the program. Again, I'm not familiar with programming.
Everything I try generally yields an error similar to
IndexError: index 5 is out of bounds for axis 1 with size 5
Any suggestions?

Resources