Rolling mean with irregular boundaries - python-3.x

I am using rolling mean on my data to smoothen it. My data can be found here.
An illustration of my original data is;
Currently, I am using
import pandas as pd
import numpy as np
data = pd.read_excel('data.xlsx')
data = np.array(data, dtype=np.float)
window_length = 9
res = pd.rolling_mean(np.array(data[:, 2]), window_length, min_periods=1, center=True)
This is what I get after applying rolling mean with a window_length of 9;
And when i increase the window_length to 20, I get a smoother image but at boundaries, the data seems to be erroneous.
The problem is, as seen in the figures above, the rolling mean introduces some sort of errors at the boundaries of my data which do not exist in the original data.
Is there any way to correct this?
My guess is, at the boundary, since part of the window_length is found outside my data, it exaggerates the mean.
Is there a way to correct this error using pandas rolling mean or is there a better pythonic way in doing this? Thanks.
Ps. I am aware the panda function of rolling mean i am using is deprecated in the new versión.

You can try a native 2D convolution method such as scipy.ndimage.filters.convolve with weights so just make the kernel an average (mean) function.
The weights would be:
n = 3. # size of kernel over which to calculate mean
weights = np.ones(n,n)/n**2
If the white area of your data are represented by nans, this would reduce the footprint of the result by n since any kernel stamp with a nan included will return a nan. If this is really an issue try look at astropy.convolution, which has better nan handling.

Related

Does fitting Weibull distribution to data using scipy.stats perform poor?

I am working on fitting Weibull distribution on some integer data and estimating relevant shape, scale, location parameters. However, I noticed poor performance of scipy.stats library while doing so.
So, I took a different direction and checked the fit performance by using the code below. I first create 100 numbers using Weibull distribution with parameters shape=3, scale=200, location=1. Subsequently, I estimate the best distribution fit using fitter library.
from fitter import Fitter
import numpy as np
from scipy.stats import weibull_min
# generate numbers
x = weibull_min.rvs(3, scale=200, loc=1, size=100)
# make them integers
data = np.asarray(x, dtype=int)
# fit one of the four distributions
f = Fitter(data, distributions=["gamma", "rayleigh", "uniform", "weibull_min"])
f.fit()
f.summary()
I expect the best fit to be Weibull distribution. I have tried re-running this test. Sometimes Weibull fit is a good estimate. However, most of the time Weibull fit is reported as the worst result. In this case, the estimated parameters are = (0.13836651040093312, 66.99999999999999, 1.3200752378443505). I assume these parameters correspond to shape, scale, location in order. Below is the summary of the fit procedure.
$ f.summary()
sumsquare_error aic bic kl_div
gamma 0.001601 1182.739756 -1090.410631 inf
rayleigh 0.001819 1154.204133 -1082.276256 inf
uniform 0.002241 1113.815217 -1061.400668 inf
weibull_min 0.004992 1558.203041 -976.698452 inf
Additionally, the following plot is produced.
Also, Rayleigh distribution is a special case of Weibull with shape parameter = 2. So, I expect the resulting Weibull fit to be at least as good as Rayleigh.
Update
I ran the tests above on Linux/Ubuntu 20.04 machine with numpy version 1.19.2 and scipy version 1.5.2. The code above seems to run as expected and return proper results for Weibull distribution on a Mac machine.
I have also tested fitting a Weibull distribution on data x generated above on the Linux machine by using an R library fitdistrplus as:
fit.weib <- fitdist(x, "weibull")
and observed that the estimated shape and scale values are found to be very close to the initially given values. The best guess so far is that the problem is due to some Python-Ubuntu bug/incompatibility.
I can be considered as a newbie in this area. So, I am wondering, am I doing something wrong here? Or is this result somehow expected? Any help is greatly appreciated.
Thank you.
Library fitter doesn't allow to specify parameters for distributions such as a, loc, etc. And strangely, Mac produces better fit while Linux heavily pains the results for best fit, for the same version of Numpy and Scipy. Underlying reasons may include different BLAS-LAPACK algorithms designed for Linux and Mac, https://stackoverflow.com/a/49274049/6806531, or weibull_min may not initialize parameter a = 1 which is discussed online, or default floating-point accuracy. However, one can solve the error inside fitter library. Knowing the fact that weib_min is expon_weib with parameter a is fixed as 1, changing the run function inside of _timed_run function in fitter.py as
def run(self):
try:
if distribution == "exponweib":
self.result = func(args,floc=0,fa = 1, **kwargs)
else:
self.result = func(args, floc=0, **kwargs)
except Exception as err:
self.exc_info = sys.exc_info()
and using exponweib as weib_min gives nearly same results as R fitdist.
I am not familiar with the Fitter library, but in order to draw some conclusions I would suggest:
Retry your code, but by taking size=10,000. In this case, there are sufficient datapoints for the fitting methods to utilize. Theoretically, you would then expect the Weibull to deliver the best fit.
I noticed that the location parameter can sometimes be a pain. You could try to run your fits by fixing the location parameter with floc=1 (i.e. equal to your sampling parameter for location). What do you get? Aditionally, FYI, with MLE, it suffices to take loc=min(x), where x is your dataset. For the exponential distribution, this in fact the MLE of the location parameter. For other distributions I am not sure, but I wouldn't be surprised if this holds for other distributions as well. This would reduce the fitting procedure with 1 parameter.
Lastly, I noticed that if you take small values for location/scale/shape for some distributions, the functions logpdf and logcdf of scipy.stats distributions result in np.inf values. In this scenario, you could perhaps use the Powell optimization algorithm and set bounds on the values of your parameters.

How to apply a Pearson Correlation Analysis over all pairs of pixels of a DataArray as a Correlation Matrix?

I am facing serious difficulties in generating a correlation matrix (pixel by pixel) of a single Netcdf with dimensions ('lon', 'lat', 'time'). My final intent is to generate what one calls a Teleconnectivity Map.
This Map is composed of correlation coefficients. Each pixel has a value that represents the highest correlation value (in module) found in the correlation matrix over all pairs of pixels in the DataArray.
Therefore, in order to create my Teleconnectivity Map, instead of looping over every longitude ('lon') and every latitude ('lat') and later checking all possible combinations of correlation for which one was higher in magnitude, I was thinking of applying the xr.apply_ufunction with a wrapped correlation function inside.
Despite my efforts, I still don't get what is truly happening behind the scenes in the xr.apply_ufunc. All I managed to do was as a single Resultant Matrix with all pixels equals to 1 (perfect correlation).
See code below:
import numpy as np
import xarray as xr
def correlation(x, y):
return np.corrcoef(x, y)[0,0] # to return a single correlation index, instead of a matriz
def wrapped_correlation(da, x, coord='time'):
"""Finds the correlation along a given dimension of a dataarray."""
from functools import partial
fpartial = partial(correlation, x.values)
return xr.apply_ufunc(fpartial,
da,
input_core_dims=[[coord]] ,
output_core_dims=[[]],
vectorize=True,
output_dtypes=[float]
)
# testing the wrapped correlation for a sample data:
ds = xr.tutorial.open_dataset('air_temperature').load()
# testing for a single point in space.
x = ds['air'].sel(dict(lon=1, lat=92), method='nearest')
# over all points in the DataArray
Corr_over_x = wrapped_correlation(ds['air'], x)
Corr_over_x# notice that the resultant DataArray is composed solely of ones (perfect correlation match). This is impossible. I would expect to have different values of correlation for each pixel in here
# if one would plot the data, I would be composed of a variety of correlation values (see example below):
Corr_over_x.plot()
This is an important asset for meteorologists and Remote Sensing researches. It allows the evaluation of potential geophysical patterns over a given area of study.
I thank you for your time, and I hope hearing from you soon.
Sincerely yours,
Firstly, you need to use np.corrcoef(x, y)[0,1]. In the end, you don't need to use partial at all, see below:
def correlation(x1, x2):
return np.corrcoef(x1, x2)[0,1] # to return a single correlation index, instead of a matriz
def wrapped_correlation(da, x, coord='time'):
"""Finds the correlation along a given dimension of a dataarray."""
return xr.apply_ufunc(correlation,
da,
x,
input_core_dims=[[coord],[coord]] ,
output_core_dims=[[]],
vectorize=True,
output_dtypes=[float]
)
I have managed to solve my question. The script has become a bit long. Nevertheless, it does what it was previously intended.
The code is adapted from this reference
Since it is too long to show a snippet in here, I am posting a link to my Github account in which the algorithm (organized in a package named Teleconnection_using_xarray_data) can be checked here.
The package has two modules with similar results.
The first module (teleconnection_with_connecting_pathways) is slower than the second (teleconnection_via_numpy), but it allows to evaluate the connecting pathways between the partial teleconnection maps.
The second, only returns the resultant teleconnection map, without the connecting lines (geopandas-Linestrings), though it is much faster.
Feel free to colaborate. If possible, I would like to combine both modules ensuring speed and pathway analyses in the Teleconnection algorithm.
Sincerely yours,
Philipe Leal

How remove duplicates from a dataframe and create new one with the weight for each sample?

I'm working on a Classification Problem where I know the label. I'm comparing 2 different algorithms K-Means and DBSCAN. However the latter has the famous problem with the Memory for computing the metric distance. But If in my dataset there are a lot of duplicated samples can I delete them and count their occurrences and after that use this weight in the Algorithm ? Everything for saving memory.
I do not know how to do it . This is my code:
df = dimensionality_reduction(dataframe = df_balanced_train)
train = np.array(df.iloc[:,1:])
### DBSCAN
#Here the centroids there aren't
y_dbscan, centroidi = Cluster(data = train, algo = "DBSCAN")
err, colori = error_Cluster(y_dbscan, df)
#These are the functions:
#DBSCAN Algorithm
#nbrs = NearestNeighbors(n_neighbors= 1500).fit(data)
#distances, indices = nbrs.kneighbors(data)
#print("The mean distance is about : " + str(np.mean(distances)))
#np.median(distances)
dbscan = DBSCAN(eps= 0.9, min_samples= 1000, metric="euclidean",
n_jobs = 1)
y_result = dbscan.fit_predict(data)
centroidi = "In DBSCAN there are not Centroids"
For a sample of 30k elements everything ok but for 800k always prloblem with the memory, could solve my problem delete dupliates and count thir occurrences ?
DBSCAN should take only O(n) memory - just as k-means.
But apparently the sklearn implementation does a version that first computes all neighbors, and thus uses O(n²) memory, and hence less scalable. I'd consider this a bug in sklearn, but apparently they are well aware of this limitation, which seems to be mostly a problem when you choose bad parameters. To guarantee O(n) memory it may be enough to just implement the standard DBSCAN yourself.
Merging duplicates is certainly an option, but A) that usually means you are using inappropriate data for these algorithms resp. for this distance and B) you'll also need to implement the algorithms yourself to add support for weight. Because you need to use weight sums instead of result counts etc. in DBSCAN.
Last but not least: if you have labels and a classification problem, these seem to be the wrong choice. They are clustering, not classification. Their job is not to recreate the labels you have, but to find new labels from the data.

STL decomposition getting rid of NaN values

Following links were investigated but didn't provide me with the answer I was looking for/fixing my problem: First, Second.
Due to confidentiality issues I cannot post the actual decomposition I can show my current code and give the lengths of the data set if this isn't enough I will remove the question.
import numpy as np
from statsmodels.tsa import seasonal
def stl_decomposition(data):
data = np.array(data)
data = [item for sublist in data for item in sublist]
decomposed = seasonal.seasonal_decompose(x=data, freq=12)
seas = decomposed.seasonal
trend = decomposed.trend
res = decomposed.resid
In a plot it shows it decomposes correctly according to an additive model. However the trend and residual lists have NaN values for the first and last 6 months. The current data set is of size 10*12. Ideally this should work for something as small as only 2 years.
Is this still too small as said in the first link? I.e. I need to extrapolate the extra points myself?
EDIT: Seems that always half of the frequency is NaN on both ends of trend and residual. Same still holds for decreasing size of data set.
According to this Github link another user had a similar question. They 'fixed' this issue. To avoid NaNs an extra parameter can be passed.
decomposed = seasonal.seasonal_decompose(x=data, freq=12, extrapolate_trend='freq')
It will then use a Linear Least Squares to best approximate the values. (Source)
Obviously the information was literally on their documentation and clearly explained but I completely missed/misinterpreted it. Hence I am answering my own question for someone who has the same issue, to save them the adventure I had.
According to the parameter definition below, setting extrapolate_trend other than 0 makes the trend estimation revert to a different estimation method. I faced this issue when I had a few observations for estimation.
extrapolate_trend : int or 'freq', optional
If set to > 0, the trend resulting from the convolution is
linear least-squares extrapolated on both ends (or the single one
if two_sided is False) considering this many (+1) closest points.
If set to 'freq', use `freq` closest points. Setting this parameter
results in no NaN values in trend or resid components.

How to detect rate of change in a stream of data Python3

I have an input source that gives me integers in [0..256].
I want to be able to locate spikes in this data, i.e. a new input.
I've tried using a rolling average in conjunction with finding the percent error. But this doesn't really work.
Basically, I want my program to find where a graph of the data would spike up, but I want it to ignore smooth transitions.
Thoughts?
A simple thought which follows my comment. First
>>> import numpy as np
Suppose we have the following time series
>>> sample = np.random.random_integers(0,256,size=(100,))
To know whether or not a spike can be considered as a rare event, we have to know the likelihood associated to each event. Since you are dealing with "rates of change", let us compute those
>>> sample_vars = np.abs(-1 + 1.*sample[1:]/sample[:-1]) # the "1.*" to get floats... (python<3)
We can then define the variation which has at most 5 percent (sample-) chance of occurring
>>> spike_defining_threshold = np.percentile(sample_vars, 95)
Finally if sample_vars[-1]>spike_defining_threshold
Would be great if others have thoughts to share as well...

Resources