How to implement frame_equal_missing with tolerance (like pandas)? - rust

assert!(expected.frame_equal_missing(&res1)); is failing when using pct_change.
because of float rounding differences (a-b)/b can be different from a/b - 1
How can I frame_equal_missing with a tolerance like pandas?
let df = df! [
"A" => [1., 2., 3.],
"B" => [4., 5., 6.],
]?;
let expected = df! [
"A" => [None, Some((2./1.) - 1.), Some((3./2.) - 1.)],
"B" => [None, Some((5./4.) - 1.), Some((6./5.) - 1.)],
]?;
let res1 = &df
.clone()
.lazy()
.with_column(dtype_cols([DataType::Float64]).pct_change(1))
.collect()?;
let res2 = &df
.clone()
.lazy()
.select([col("*") / col("*").shift(1) - lit(1.0)])
.collect()?;
println!("{:?}", expected);
println!("{:?}", res1);
println!("{:?}", res2);
// This will pass
assert!(expected.frame_equal_missing(&res2));
// This will fail
assert!(expected.frame_equal_missing(&res1));
Ok(())

Related

How can I replace part of a matrix with a new matrix in torch

Consider following matrices
>>> a = torch.Tensor([[1,2,3],[4,5,6], [7,8,9]])
>>> a
tensor([[1., 2., 3.],
[4., 5., 6.],
[7., 8., 9.]])
>>> b = torch.tensor([[1,1],[1,1]])
>>> b
tensor([[1, 1],
[1, 1]])
I want to replace 4 elements in a with b where their indices are specified in X = [0,2] and Y = [0,2]
To have:
>>>a
tensor([[1., 2., 1.],
[4., 5., 6.],
[1., 8., 1.]])
I look for some operations like scatter or put_index to update the matrix in few commands (not loops).
If we consider X and Y two tensors of horizontal and vertical indices, the following can work:
a[X.reshape(-1,1), Y] = b

torch suppress to kth largest values

I have the following function which works, but just not for half precision values (get a NotImplemented error for kthvalue).
def suppress_small_probabilities(probabilities: torch.FloatTensor, k: int) -> torch.FloatTensor:
kth_largest, _ = (-probabilities).kthvalue(k, dim=-1, keepdim=True)
return probabilities * (probabilities >= -kth_largest)
How would you do the equivalent without using kthvalue? I'm guessing topk has something to do with it, but I want to suppress the smaller values. probabilities is of size batch_size x 1000.
Implement your own topk, e.g.
def mytopk(xs: Tensor, k: int) -> Tensor:
mask = torch.zeros_like(xs)
batch_idx = torch.arange(0, len(xs))
for _ in range(k):
_, index = torch.where(mask == 0, xs, -1e4).max(-1)
mask[(batch_idx, index)] = 1
return mask
This will return a boolean mask tensor where the row-wise top-k elements will have value 1, rest 0.
Then use the mask to index your original tensor, e.g.
xs = torch.rand(3, 5, dtype=torch.float16)
# tensor([[0.0626, 0.9620, 0.5596, 0.4423, 0.1932],
# [0.5289, 0.0857, 0.7802, 0.7730, 0.4807],
# [0.8272, 0.5016, 0.1169, 0.4372, 0.1843]], dtype=torch.float16)
mask = mytopk(xs, 2)
# tensor([[0., 1., 1., 0., 0.],
# [0., 0., 1., 1., 0.],
# [1., 1., 0., 0., 0.]])
top_only = torch.where(mask == 1, xs, 0)
# tensor([[0.0000, 0.9620, 0.5596, 0.0000, 0.0000],
# [0.0000, 0.0000, 0.7802, 0.7730, 0.0000],
# [0.8271, 0.5016, 0.0000, 0.0000, 0.0000]], dtype=torch.float16)

Replace elements of array with their average

Let's say I have a numpy array as such:
a = [0, 1, …, i-1, i, i+1, …, j, j+1, …, n]
and I'd like to replace of i-th, i+1-th… j-th element with a single one — their average:
b = [0, 1, …, i-1, average, j+1, …, n]
How would I do that with as compact code as possible?
Slice and concatenate arrays
np.concatenate([a[:i], a[i:j].mean().reshape(1,), a[j:]])
Example
a = np.array(list(range(20)))
i = 5
j = 10
np.concatenate([a[:i], a[i:j].mean().reshape(1,), a[j:]])
array([ 0., 1., 2., 3., 4., 7., 10., 11., 12., 13., 14., 15., 16.,
17., 18., 19.])

How to correctly guess the initial points in LogLog plot linear regression?

I have 5 sets of data represented in 5 distinct colored errorbars in the following code (I have not shown caps). errorbar plot is shown in logarithmic scale in both axes. Using curvefit, I am trying to find the best linear regression passing through these errorbars. However, it seems the power-law equation I have defined to fit is not easily able to find the best-fit slope of the 5 lines. My expectation is that all 5 colored lines should be straight with negative slopes. I had hard time figuring out which starting point p0 should I specify in curve fitting process. Even with my initial hard-to-guess values, I still don't get all straight lines and some of them are too off from my points. What is the issue here?
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x_mean = [2.81838293e+20, 5.62341325e+20, 1.12201845e+21, 2.23872114e+21, 4.46683592e+21, 8.91250938e+21, 1.77827941e+22]
mean_1 = [52., 21.33333333, 4., 1., 0., 0., 0.]
mean_2 = [57., 16.66666667, 5.66666667, 2.33333333, 0.66666667, 0., 0.33333333]
mean_3 = [67.33333333, 20., 8.66666667, 3., 0.66666667, 1., 0.33333333]
mean_4 = [79.66666667, 25., 8.33333333, 3., 1., 0., 0.]
mean_5 = [54.66666667, 16.66666667, 8.33333333, 2., 2., 1., 0.]
error_1 = [4.163332, 2.66666667, 1.15470054, 0.57735027, 0., 0., 0.]
error_2 = [4.35889894, 2.3570226, 1.37436854, 0.8819171, 0.47140452, 0., 0.33333333]
error_3 = [4.7375568, 2.5819889, 1.69967317, 1., 0.47140452, 0.57735027, 0.33333333]
error_4 = [5.15320828, 2.88675135, 1.66666667, 1., 0.57735027, 0., 0.]
error_5 = [4.26874949, 2.3570226, 1.66666667, 0.81649658, 0.81649658, 0.57735027, 0.]
newX = np.logspace(20, 22.3)
def myExpFunc(x, a, b):
return a*np.power(x, b)
popt_1, pcov_1 = curve_fit(myExpFunc, x_mean, mean_1, sigma=error_1, absolute_sigma=True, p0=(4e31,-1.5))
popt_2, pcov_2 = curve_fit(myExpFunc, x_mean, mean_2, sigma=error_2, absolute_sigma=True, p0=(4e31,-1.5))
popt_3, pcov_3 = curve_fit(myExpFunc, x_mean, mean_3, sigma=error_3, absolute_sigma=True, p0=(4e31,-1.5))
popt_4, pcov_4 = curve_fit(myExpFunc, x_mean, mean_4, sigma=error_4, absolute_sigma=True, p0=(4e31,-1.5))
popt_5, pcov_5 = curve_fit(myExpFunc, x_mean, mean_5, sigma=error_5, absolute_sigma=True, p0=(4e31,-1.5))
fig, ax1 = plt.subplots(figsize=(3,5))
ax1.errorbar(x_mean, mean_1, yerr=error_1, ecolor = 'magenta', fmt= 'mo', ms=0, elinewidth = 1, capsize = 0, capthick=0)
ax1.errorbar(x_mean, mean_2, yerr=error_2, ecolor = 'red', fmt= 'ro', ms=0, elinewidth = 1, capsize = 0, capthick=0)
ax1.errorbar(x_mean, mean_3, yerr=error_3, ecolor = 'orange', fmt= 'yo', ms=0, elinewidth = 1, capsize = 0, capthick=0)
ax1.errorbar(x_mean, mean_4, yerr=error_4, ecolor = 'green', fmt= 'go', ms=0, elinewidth = 1, capsize = 0, capthick=0)
ax1.errorbar(x_mean, mean_5, yerr=error_5, ecolor = 'blue', fmt= 'bo', ms=0, elinewidth = 1, capsize = 0, capthick=0)
ax1.plot(newX, myExpFunc(newX, *popt_1), 'm-', label='{:.2f} \u00B1 {:.2f}'.format(popt_1[1], pcov_1[1,1]**0.5))
ax1.plot(newX, myExpFunc(newX, *popt_2), 'r-', label='{:.2f} \u00B1 {:.2f}'.format(popt_2[1], pcov_2[1,1]**0.5))
ax1.plot(newX, myExpFunc(newX, *popt_3), 'y-', label='{:.2f} \u00B1 {:.2f}'.format(popt_3[1], pcov_3[1,1]**0.5))
ax1.plot(newX, myExpFunc(newX, *popt_4), 'g-', label='{:.2f} \u00B1 {:.2f}'.format(popt_4[1], pcov_4[1,1]**0.5))
ax1.plot(newX, myExpFunc(newX, *popt_5), 'b-', label='{:.2f} \u00B1 {:.2f}'.format(popt_5[1], pcov_5[1,1]**0.5))
ax1.legend(handlelength=0, loc='upper right', ncol=1, fontsize=10)
ax1.set_xlim([2e20, 3e22])
ax1.set_ylim([2e-1, 1e2])
ax1.set_xscale("log")
ax1.set_yscale("log")
plt.show()
Your numbers for X are way too enormous. Maybe you can try taking the log of both sides and fit that? Such as:
log Y = log(a) + b*log(X)
You won’t even need curve_fit at that point, it’s a standard linear regression.
EDIT
Please see my rough and not very well checked implementation (NOTE: I only have Python 2, so adjust to fit):
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as optimize
x_mean = [2.81838293e+20, 5.62341325e+20, 1.12201845e+21, 2.23872114e+21, 4.46683592e+21, 8.91250938e+21, 1.77827941e+22]
mean_1 = [52., 21.33333333, 4., 1., 0., 0., 0.]
mean_2 = [57., 16.66666667, 5.66666667, 2.33333333, 0.66666667, 0., 0.33333333]
mean_3 = [67.33333333, 20., 8.66666667, 3., 0.66666667, 1., 0.33333333]
mean_4 = [79.66666667, 25., 8.33333333, 3., 1., 0., 0.]
mean_5 = [54.66666667, 16.66666667, 8.33333333, 2., 2., 1., 0.]
error_1 = [4.163332, 2.66666667, 1.15470054, 0.57735027, 0., 0., 0.]
error_2 = [4.35889894, 2.3570226, 1.37436854, 0.8819171, 0.47140452, 0., 0.33333333]
error_3 = [4.7375568, 2.5819889, 1.69967317, 1., 0.47140452, 0.57735027, 0.33333333]
error_4 = [5.15320828, 2.88675135, 1.66666667, 1., 0.57735027, 0., 0.]
error_5 = [4.26874949, 2.3570226, 1.66666667, 0.81649658, 0.81649658, 0.57735027, 0.]
def powerlaw(x, amp, index):
return amp * (x**index)
# define our (line) fitting function
def fitfunc(p, x):
return p[0] + p[1] * x
def errfunc(p, x, y, err):
out = (y - fitfunc(p, x)) / err
out[~np.isfinite(out)] = 0.0
return out
pinit = [1.0, -1.0]
fig = plt.figure()
ax1 = fig.add_subplot(2, 1, 1)
ax2 = fig.add_subplot(2, 1, 2)
for indx in range(1, 6):
mean = eval('mean_%d'%indx)
error = eval('error_%d'%indx)
logx = np.log10(x_mean)
logy = np.log10(mean)
logy[~np.isfinite(logy)] = 0.0
logyerr = np.array(error) / np.array(mean)
logyerr[~np.isfinite(logyerr)] = 0.0
out = optimize.leastsq(errfunc, pinit, args=(logx, logy, logyerr), full_output=1)
pfinal = out[0]
covar = out[1]
index = pfinal[1]
amp = 10.0**pfinal[0]
indexErr = np.sqrt(covar[0][0] )
ampErr = np.sqrt(covar[1][1] ) * amp
##########
# Plotting data
##########
ax1.plot(x_mean, powerlaw(x_mean, amp, index), label=u'{:.2f} \u00B1 {:.2f}'.format(pfinal[1], covar[1,1]**0.5)) # Fit
ax1.errorbar(x_mean, mean, yerr=error, fmt='k.', label='__no_legend__') # Data
ax1.set_title('Best Fit Power Law', fontsize=18, fontweight='bold')
ax1.set_xlabel('X', fontsize=14, fontweight='bold')
ax1.set_ylabel('Y', fontsize=14, fontweight='bold')
ax1.grid()
ax2.loglog(x_mean, powerlaw(x_mean, amp, index), label=u'{:.2f} \u00B1 {:.2f}'.format(pfinal[1], covar[1,1]**0.5))
ax2.errorbar(x_mean, mean, yerr=error, fmt='k.', label='__no_legend__') # Data
ax2.set_xlabel('X (log scale)', fontsize=14, fontweight='bold')
ax2.set_ylabel('Y (log scale)', fontsize=14, fontweight='bold')
ax2.grid(b=True, which='major', linestyle='--', color='darkgrey')
ax2.grid(b=True, which='minor', linestyle=':', color='grey')
ax1.legend()
ax2.legend()
plt.show()
Picture:

numpy.histogram2d() returning a histogram of all zeros

I'm trying to reproduce a phenomenon I've encountered when constructing 2D histograms using numpy.histogram2d, specifically when using the "bins" parameter. When I use an integer for the bins parameter (e.g. bins=20), I see the expected 2D histogram. However, I want my histogram to have consistently-sized bins, so I want to create the histogram with set minimum and maximum x- and y-values. Currently, I'm creating the bin divisions using numpy.linspace to get arrays of evenly-spaced values.
x_bins = np.linspace(min_range, max_range, num=num_bins+1) #numpy is imported as np
y_bins = np.linspace(0, max_even, num=num_bins+1)
I use these arrays for the bins argument in numpy.histogram2d.
hist, xedges, yedges = np.histogram2d(x, y, bins=(x_bins, y_bins))
The arrays x and y are arrays of numbers between the values of min_range and max_range (for x), and between 0 and max_even (for y). When I define the bins with arrays, some of the histograms I generate have all zeros. All x and y arrays are the same length, and the only thing I can think of that changes is the number ranges fed into numpy.histogram2d.
Numbers in these x and y ranges yield histograms that are not all zeros:
x: min_range = 0.07, max_range = 142.095; y: 0, max_even = 471.64
x: min_range = 0.218, max_range = 195.178; y: 0, max_even = 1493.489
Numbers in these ranges yield histograms with all zeros:
x: min_range = 0.006, max_range = 6.916; y: 0, max_even = 1.101
x: min_range = 0, max_range = 5.58; y: 0, max_even = 1.205
The x and y arrays are both numpy arrays. Printing out the x and y bins and values shows that all the x and y values should fall into the defined bins. Trying to replicate the error with arrays of random values within the ranges of interest wasn't successful, so I apologize for the lack of examples; any suggestions for replication are welcome. What might cause the histogram2d function to return a histogram of all zeros?
EDIT
I tried using the range parameter of histogram2d to define the min and max x and y values, and using an integer for the bins parameter (code below). That had no effect on the histograms with all zeros.
hist, xedges, yedges = np.histogram2d(x, y, bins=10, range=[[min_range, max_range], [0, max_even]])
Below is a case that will generate a zero array. If the range parameters were swapped by axis or repeated this could generate a 2d histogram of zeroes with certain x and y arrays.
import numpy as np
np.random.seed(100)
x = 5*np.random.rand(40)+5.
y = 3*np.random.rand(40)+10.
x_min = x.min()
x_max = x.max()
y_min = y.min()
y_max = y.max()
np.histogram2d( x , y, bins = [ 5, 3 ], range = [[ x_min, x_max ], [ x_min, x_max ]])
# x_min & x_max both times!!
# (array([[0., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.]]),
# array([5.02359428, 5.99979628, 6.97599828, 7.95220028, 8.92840228, 9.90460429]),
# array([5.02359428, 6.65059762, 8.27760095, 9.90460429]))
# Corrected version
np.histogram2d( x , y, bins = [ 5, 3 ], range = [[ x_min, x_max ], [ y_min, y_max ]])
# (array([[5., 5., 2.],
# [1., 4., 3.],
# [0., 3., 2.],
# [2., 0., 1.],
# [5., 5., 2.]]),
# array([5.02359428, 5.99979628, 6.97599828, 7.95220028, 8.92840228, 9.90460429]),
# array([10.0613174 , 11.01588476, 11.97045212, 12.92501948]))

Resources