Unusual nans Returned by scipy LinearNDInterpolator - python-3.x

I'm trying to interpolate the following data with python (3.8.1) using the aforementioned scipy function (official documentation here; source code here). The official documentation is incredibly sparse, so I'm hopeful that someone else out there will have some experience using the function and may know the source of this issue. Specifically, I run the following four lines of code:
predictor = [[-1.7134013337139833, 0.9582376963057636, -0.21528572746395735], [3.25933089248862, -0.7087236333980123, 0.012808817274351122], [-0.5596739049487544, -1.8723369742231246, 0.03114189522349198], [0.23080764211370225, 1.0639221305852422, -0.602148693975945], [-0.9879484423429669, -0.16678510825693527, 0.5570132252912631], [0.0029439785978213986, -0.10016927713200409, -0.18197412051828055], [0.3530872261969887, 0.6347161018351574, 0.7285361235605389], [-1.122894723267098, 0.22837861478723648, -0.9022469946784363], [-0.02862856314533664, 0.014623415207400122, 3.078346263312741], [-1.3367570531570616, -0.3218239542354167, 0.489878302042675]]
respose = [0.020235605909933625, 1.4729016163456679e-05, 0.021931080605237303, 0.21271851410989498, 0.26870984350693583, 0.9577608837143238, 0.3470452852299319, 0.11918254249689647, 7.657429164576589e-05, 0.1187813551565562]
from scipy.interpolate import LinearNDInterpolator
away = LinearNDInterpolator(predictor, response)
Now, if I write away.__call__([0,0,0])[0] then python returns 0.8208492283847619,
which is the desired outcome and is a sensible value based on the given test data. Similarly, away.__call__([0,0,1])[0] returns 0.22018657078617598 which is also a sensible value.
However, away.__call__([0,1,1])[0] returns nan. What changed? Does anyone happen to know?
Thank you.

This occurs when away.__call__(x) is passed a value x which lies outside of the convex hull - essentially, when x lies outside of the region of interpolation.

Related

Vectorize nested for loop in python to find curl

I am trying to find the curl of a 3D vector (with x,y,z components) which has values over a 3D grid of size (1200,1200,400). I was able to find curl using the finite difference method using nested for loops. But only for a section of the data. It's computation time is way higher for the entire set of (1200,1200,400) grid points. So, I tried using a package numba to speed up, but it didn't work. So, I tried vectorizing the whole thing. But the problem is there is something wrong (broadcasting error) with the way I am indexing the vector.
NB: I am relatively new to python
So, here is my approach:
create three 1D arrays x,y,z to represent the grid axis which can be used to index the vector
Put the arrays as indices of the vector. For eg: vx[x,y,z] I expect it to give the value of vx in the entire grid.
To find the curl I need to add and subtract 1 from the indices (when I use finite difference method). The error I get is
IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (1200,) (1200,) (400,)
I tried looking it up. tried changing the shape to (1200,1) instead of (1200,), error remained.
This is the function I defined:
def curl(rx,ry,rz):
curlvx = (vz[rx,np.add(ry,1)%1200,rz] - vy[rx,np.add(ry,-1)%1200%1200,rz])/0.02 - (vy[rx,ry,np.add(rz,1)%400] - vy[rx,ry,np.add(rz,-1)%400])/0.02
curlvy = (vx[rx,ry,np.add(rz,1)%400] - vx[rx,ry,np.add(rz,-1)%400])/0.02 - (vz[np.add(rx,1)%1200,ry,rz] - vz[np.add(rx,-1)%1200,ry,rz])/0.02
curlvz = (vy[np.add(rx,1)%1200,ry,rz] - vy[np.add(rx,-1)%1200,ry,rz])/0.02 - (vx[rx,np.add(ry,1)%1200,rz] - vx[rx,np.add(ry,-1)%1200,rz])/0.02
return [curlvx,curlvy,curlvz]
Where I call my function like this:
x=np.arange(0,1200)
y=np.arange(0,1200)
z=np.arange(0,400)
curl(x,y,z)
This is the line where I'm getting error.
curlvx = (vz[rx,np.add(ry,1)%1200,rz] - vy[rx,np.add(ry,-1)%1200%1200,rz])/0.02 - (vy[rx,ry,np.add(rz,1)%400] - vy[rx,ry,np.add(rz,-1)%400])/0.02
The is the part vz[rx,np.add(ry,1)%1200,rz] which is showing error

How the standard normal distribution works in practice in NumPy and PyTorch?

I have two points to ask about:
1)
I would like to understand what is precisely returned from the np.random.randn from NumPy and torch.randn from PyTorch. They both return a tensor with random numbers from a normal distribution with mean 0 and std 1, hence, a standard normal distribution. However, it is not the same thing as puting x values in the standard normal distribution function here and getting its respective image values y. The values returned by PyTorch and NumPy does not seem like this.
For me, it seems that both np.random.randn and torch.randn from these libraries returns the x values from the functions, not the image y as I calculated below. Is that correct?
normal = np.array([(1/np.sqrt(2*np.pi))*np.exp(-(1/2)*(i**2)) for i in range(-38,39)])
Printing the normal variable shows me something like this.
array([1.10e-314, 2.12e-298, 1.51e-282, 3.94e-267, 3.79e-252, 1.34e-237,
1.75e-223, 8.36e-210, 1.47e-196, 9.55e-184, 2.28e-171, 2.00e-159,
6.45e-148, 7.65e-137, 3.34e-126, 5.37e-116, 3.17e-106, 6.90e-097,
5.52e-088, 1.62e-079, 1.76e-071, 7.00e-064, 1.03e-056, 5.53e-050,
1.10e-043, 8.00e-038, 2.15e-032, 2.12e-027, 7.69e-023, 1.03e-018,
5.05e-015, 9.13e-012, 6.08e-009, 1.49e-006, 1.34e-004, 4.43e-003,
5.40e-002, 2.42e-001, 3.99e-001, 2.42e-001, 5.40e-002, 4.43e-003,
1.34e-004, 1.49e-006, 6.08e-009, 9.13e-012, 5.05e-015, 1.03e-018,
7.69e-023, 2.12e-027, 2.15e-032, 8.00e-038, 1.10e-043, 5.53e-050,
1.03e-056, 7.00e-064, 1.76e-071, 1.62e-079, 5.52e-088, 6.90e-097,
3.17e-106, 5.37e-116, 3.34e-126, 7.65e-137, 6.45e-148, 2.00e-159,
2.28e-171, 9.55e-184, 1.47e-196, 8.36e-210, 1.75e-223, 1.34e-237,
3.79e-252, 3.94e-267, 1.51e-282, 2.12e-298, 1.10e-314])
2) Also, if we ask these libraries that I want a matrix of values from a standard normal distribution, it means that all rows and columns are draw from the same standard distribution? If I want i.i.d distributions in every row, I would need to call np.random.randn over a for loop for each row and then vstack them?
1) Yes, they give you x and not phi(x) since the formula for phi(x) gives the probability density of sampling a value x. If you want to know the probability of getting values in an interval [a,b] you need to integrate phi(x) between a and b. Intuitively, if you look at the function phi(x) you'll see that you're more likely to get values near zero than, say, values near 1.
An easy way to see it, is look at the histogram of the sampled values.
import numpy as np
import matplotlib.pyplot as plt
samples = np.random.normal(size=[1000])
plt.hist(samples)
2) they're iid. Just use a 2d size like so:
samples = np.random.normal(size=[10, 10])

How to use extract the hidden layer features in H2ODeepLearningEstimator?

I found H2O has the function h2o.deepfeatures in R to pull the hidden layer features
https://www.rdocumentation.org/packages/h2o/versions/3.20.0.8/topics/h2o.deepfeatures
train_features <- h2o.deepfeatures(model_nn, train, layer=3)
But I didn't find any example in Python? Can anyone provide some sample code?
Most Python/R API functions are wrappers around REST calls. See http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/model/model_base.html#ModelBase.deepfeatures
So, to convert an R example to a Python one, move the model to be the this, and all other args should shuffle along. I.e. the example from the manual becomes (with dots in variable names changed to underlines):
prostate_hex = ...
prostate_dl = ...
prostate_deepfeatures_layer1 = prostate_dl.deepfeatures(prostate_hex, 1)
prostate_deepfeatures_layer2 = prostate_dl.deepfeatures(prostate_hex, 2)
Sometimes the function name will change slightly (e.g. h2o.importFile() vs. h2o.import_file() so you need to hunt for it at http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html

Pandas .rolling.corr using date/time offset

I am having a bit of an issue with pandas's rolling function and I'm not quite sure where I'm going wrong. If I mock up two test series of numbers:
df_index = pd.date_range(start='1990-01-01', end ='2010-01-01', freq='D')
test_df = pd.DataFrame(index=df_index)
test_df['Series1'] = np.random.randn(len(df_index))
test_df['Series2'] = np.random.randn(len(df_index))
Then it's easy to have a look at their rolling annual correlation:
test_df['Series1'].rolling(365).corr(test_df['Series2']).plot()
which produces:
All good so far. If I then try to do the same thing using a datetime offset:
test_df['Series1'].rolling('365D').corr(test_df['Series2']).plot()
I get a wildly different (and obviously wrong) result:
Is there something wrong with pandas or is there something wrong with me?
Thanks in advance for any light you can shed on this troubling conundrum.
It's very tricky, I think the behavior of window as int and offset is different:
New in version 0.19.0 are the ability to pass an offset (or
convertible) to a .rolling() method and have it produce variable sized
windows based on the passed time window. For each time point, this
includes all preceding values occurring within the indicated time
delta.
This can be particularly useful for a non-regular time frequency index.
You should checkout the doc of Time-aware Rolling.
r1 = test_df['Series1'].rolling(window=365) # has default `min_periods=365`
r2 = test_df['Series1'].rolling(window='365D') # has default `min_periods=1`
r3 = test_df['Series1'].rolling(window=365, min_periods=1)
r1.corr(test_df['Series2']).plot()
r2.corr(test_df['Series2']).plot()
r3.corr(test_df['Series2']).plot()
This code would produce similar shape of plots for r2.corr().plot() and r3.corr().plot(), but note that the calculation results still different: r2.corr(test_df['Series2']) == r3.corr(test_df['Series2']).
I think for regular time frequency index, you should just stick to r1.
This mainly because the result of two rolling 365 and 365D are different.
For example
sub = test_df.head()
sub['Series2'].rolling(2).sum()
Out[15]:
1990-01-01 NaN
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
sub['Series2'].rolling('2D').sum()
Out[16]:
1990-01-01 -0.043692
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
Since there are a lot NaN in rolling 365, so the corr of two series in two way are quit different.

(python) solving for roots of equation that includes probabilistic distribution

I am quite new to this, but I need solution to mathematical problem, which involves finding roots of a function, that involves cumulative density function (several).
For simplicity I tried to code similar procedure, but with as simple function as possible but even that doesn't work.
Would anyone tell me please what am I doing wrong?
from scipy.optimize import fsolve
import sympy as sy
import numpy as np
from scipy.stats import norm
y=sy.Symbol('y')
def cdf(x):
cdf_norm=norm.cdf(x,100,20)
return cdf_norm
result=fsolve(y**2-14*y+7-cdf(y))
print(result)
The problem seems to be that fsolve requires that the first argument is a function. However, you passed it an expression which gets evaluated to some value, however, the expression has a variable name y which is undefined, so the interpreter throws a NameError. Also, it will require one more argument, an ndarray containing the estimates to the roots. So, one easy solution is to define another function:
def f(y):
return y**2 - 14*y + 7 - cdf(y)
result = fsolve(f, np.array([1,0])
print(result)
I get the following result:
array([ 0.51925928, 0.51925928])

Resources