STL decomposition getting rid of NaN values - python-3.x

Following links were investigated but didn't provide me with the answer I was looking for/fixing my problem: First, Second.
Due to confidentiality issues I cannot post the actual decomposition I can show my current code and give the lengths of the data set if this isn't enough I will remove the question.
import numpy as np
from statsmodels.tsa import seasonal
def stl_decomposition(data):
data = np.array(data)
data = [item for sublist in data for item in sublist]
decomposed = seasonal.seasonal_decompose(x=data, freq=12)
seas = decomposed.seasonal
trend = decomposed.trend
res = decomposed.resid
In a plot it shows it decomposes correctly according to an additive model. However the trend and residual lists have NaN values for the first and last 6 months. The current data set is of size 10*12. Ideally this should work for something as small as only 2 years.
Is this still too small as said in the first link? I.e. I need to extrapolate the extra points myself?
EDIT: Seems that always half of the frequency is NaN on both ends of trend and residual. Same still holds for decreasing size of data set.

According to this Github link another user had a similar question. They 'fixed' this issue. To avoid NaNs an extra parameter can be passed.
decomposed = seasonal.seasonal_decompose(x=data, freq=12, extrapolate_trend='freq')
It will then use a Linear Least Squares to best approximate the values. (Source)
Obviously the information was literally on their documentation and clearly explained but I completely missed/misinterpreted it. Hence I am answering my own question for someone who has the same issue, to save them the adventure I had.

According to the parameter definition below, setting extrapolate_trend other than 0 makes the trend estimation revert to a different estimation method. I faced this issue when I had a few observations for estimation.
extrapolate_trend : int or 'freq', optional
If set to > 0, the trend resulting from the convolution is
linear least-squares extrapolated on both ends (or the single one
if two_sided is False) considering this many (+1) closest points.
If set to 'freq', use `freq` closest points. Setting this parameter
results in no NaN values in trend or resid components.

Related

Assessing features to labelencode or get_dummies() on dataset in Python

I'm working on the heart attack analysis on Kaggle in python.
I am a beginner and I'm trying to figure whether it's still necessary to one-hot-encode or LableEncode these features. I see so many people encoding the values for this project, but I'm confused because everything already looks scaled (apart from age, thalach, oldpeak and slope).
age: age in years
sex: (1 = male; 0 = female)
cp: ordinal values 1-4
thalach: maximum heart rate achieved
exang: (1 = yes; 0 = no)
oldpeak: depression induced by exercise
slope: the slope of the peak exercise
ca: values (0-3)
thal: ordinal values 0-3
target: 0= less chance, 1= more chance
Would you say it's still necessary to one-hot-encode, or should I just use a StandardScaler straight away?
I've seen many people encode the whole dataset for this project, but it makes no sense to me to do so. Please confirm if only using StandardScaler would be enough?
When you apply StandardScaler, the columns would have values in the same range. That helps models to keep weights under bound and gradient descent will not shoot off when converging. This will help the model converge faster.
Independently, in order to decide between Ordinal values and One hot encoding, consider if the column values are similar or different based on the distance between them. If yes, then choose ordinal values. If you know the hierarchy of the category, then you can manually assign the ordinal values. Otherwise, you should use LabelEncoder. It seems like the heart attack data is already given with ordinal values manually assigned. For example, higher chest pain = 4.
Also, it is important to refer to notebooks that perform better. Take a look at the one below for reference.
95% Accuracy - https://www.kaggle.com/code/abhinavgargacb/heart-attack-eda-predictor-95-accuracy-score

Can I tell Excel to keep an automated quadratic model y>= 0, without trying to run a manual model?

I am trying to create a quadratic model in Excel with tennis ranks data.
When running the automatic model trendline function it gives me a model with negative y values, which can obviously not occur for ranks.
How do I tell Excel to keep model y-values >=0?
Thank you!
Screenshot below/here refer.
There are several advantages to understanding the formulation / construct of the quadratic trend. For instance, replicating the 'automatic trend' using 'linest' as follows provides the user with additional control over the individual terms, and can highlight any graphical errors:
=$L$3+$K$3*D3+$J$3*D3^2+$I$3*D3^3
This demonstrates a cubic regression (white dots) - which coincide with Excel's 'automatic' trend line.
Summary of possible issues
There are several potential issues | remedies you can consider, depending on the goodness of fit, data in question, etc. A non-exhaustive list of issues you may be encountering include the following:
Issue
Resolve
1) Overfitting
Reduce # terms (e.g. order = 2 instead of 3 etc.)
2) Wrong fit
Attempt Lognormal
3) Negative left
Set Intercept
4) Graphical error
Use scatter chart, sort x values (ascending)
5) Outliers
Various: exclude/adjust, fit separate curve (Extreme Value Theory), manually adjust polynomial terms noting reduction to goodness of fit etc.
1) Overfitting
Trendline options: reduce the order per screenshot:
2) Lognormal | Other
Transform / consider other fits/curves (you can also place y and x axes on lognormal scale which will automatically remove negatives, although consider outliers and impact upon R-squared / goodness of fit).
3) Negative left
In certain circumstances, a negative left may be removed by setting the intercept to an appropriate value.
4) Graphical error
It's often easier to use a scatter chart, with x-values ordered per description (regression parameters may be affected otherwise).
5) Outliers
It may be the case you're fitting to 1 or 2 outliers. Consider reducing complexity/number of terms; or adjusting / omitting outliers suitably. There is an entire branch of statistics that deal with the distribution of extreme values/outliers (Extreme Value Theory - beyond scope of present answer).
Other remarks:
Rounding errors in the automatic trend-line function can lead to inaccuracies; human-error in replicating 'automatic trend-line' displayed on the chart - suggesting linest / exact formulation preferable).
Reference(s)
Data / formulation for first screenshot here
Useful video content: here

Overlapping/crowded labels on y-axis python [duplicate]

This question already has answers here:
How to change spacing between ticks
(4 answers)
Closed 5 months ago.
I'am kind of in a rush to finish this for tomorrows presentation towards the project owner. We are a small group of economic students in germany trying to figure out machine learning with python. We set up a Random Forest Classifier and are desperate to show the estimators important features in a neat plot. By applying google search we came up with the following solution that kind of does the trick, but leaves us unsatisfied due to the overlapping of the labels on the y-axis. The code we used looks like this:
feature_importances = clf.best_estimator_.feature_importances_
feature_importances = 100 * (feature_importances / feature_importances.max())
sorted_idx = np.argsort(feature_importances)
pos = np.arange(sorted_idx.shape[0])
plt.barh(pos, feature_importances[sorted_idx], align='center', height=0.8)
plt.yticks(pos, df_year_four.columns[sorted_idx])
plt.show()
Due to privacy let me say this: The feature names on the y-axis are overlapping (there are about 30 of them). I was looking into the documentation of matplotlib in order to get an understanding of how to do this by myself, unfortunately I couldn't find anything helpful. Seems like training and testing models is easier than understanding matplotlib and creating plots :D
Thank you so much for helping out and taking the time, I appreciate it.
I see your solution, and I want to just add this link here to explain why: How to change spacing between ticks in matplotlib?
The spacing between ticklabels is exclusively determined by the space between ticks on the axes. Therefore the only way to obtain more space between given ticklabels is to make the axes larger.
The question I linked shows that by making the graph large enough, your axis labels would naturally be spaced better.
You are using np.argsort that will return a numpy array with many indices. And you are using that array as labels for your Y-Axis thus there is overlapping of labels.
My suggestion will be to use an index for sorted_idx like,
plt.yticks(pos, df_year_four.columns[sorted_idx[0]])
This will plot only for 1 label.
Got it guys!
'Geistesblitz' as we say in germany! (spiritual lightening)
See the variable feature_importances in the third top row? Add feature_importnaces[:-15]
to view only the top half of the features and loosen up the y-axis. Yes!!! This does well because there are way less important features.

How to apply a Pearson Correlation Analysis over all pairs of pixels of a DataArray as a Correlation Matrix?

I am facing serious difficulties in generating a correlation matrix (pixel by pixel) of a single Netcdf with dimensions ('lon', 'lat', 'time'). My final intent is to generate what one calls a Teleconnectivity Map.
This Map is composed of correlation coefficients. Each pixel has a value that represents the highest correlation value (in module) found in the correlation matrix over all pairs of pixels in the DataArray.
Therefore, in order to create my Teleconnectivity Map, instead of looping over every longitude ('lon') and every latitude ('lat') and later checking all possible combinations of correlation for which one was higher in magnitude, I was thinking of applying the xr.apply_ufunction with a wrapped correlation function inside.
Despite my efforts, I still don't get what is truly happening behind the scenes in the xr.apply_ufunc. All I managed to do was as a single Resultant Matrix with all pixels equals to 1 (perfect correlation).
See code below:
import numpy as np
import xarray as xr
def correlation(x, y):
return np.corrcoef(x, y)[0,0] # to return a single correlation index, instead of a matriz
def wrapped_correlation(da, x, coord='time'):
"""Finds the correlation along a given dimension of a dataarray."""
from functools import partial
fpartial = partial(correlation, x.values)
return xr.apply_ufunc(fpartial,
da,
input_core_dims=[[coord]] ,
output_core_dims=[[]],
vectorize=True,
output_dtypes=[float]
)
# testing the wrapped correlation for a sample data:
ds = xr.tutorial.open_dataset('air_temperature').load()
# testing for a single point in space.
x = ds['air'].sel(dict(lon=1, lat=92), method='nearest')
# over all points in the DataArray
Corr_over_x = wrapped_correlation(ds['air'], x)
Corr_over_x# notice that the resultant DataArray is composed solely of ones (perfect correlation match). This is impossible. I would expect to have different values of correlation for each pixel in here
# if one would plot the data, I would be composed of a variety of correlation values (see example below):
Corr_over_x.plot()
This is an important asset for meteorologists and Remote Sensing researches. It allows the evaluation of potential geophysical patterns over a given area of study.
I thank you for your time, and I hope hearing from you soon.
Sincerely yours,
Firstly, you need to use np.corrcoef(x, y)[0,1]. In the end, you don't need to use partial at all, see below:
def correlation(x1, x2):
return np.corrcoef(x1, x2)[0,1] # to return a single correlation index, instead of a matriz
def wrapped_correlation(da, x, coord='time'):
"""Finds the correlation along a given dimension of a dataarray."""
return xr.apply_ufunc(correlation,
da,
x,
input_core_dims=[[coord],[coord]] ,
output_core_dims=[[]],
vectorize=True,
output_dtypes=[float]
)
I have managed to solve my question. The script has become a bit long. Nevertheless, it does what it was previously intended.
The code is adapted from this reference
Since it is too long to show a snippet in here, I am posting a link to my Github account in which the algorithm (organized in a package named Teleconnection_using_xarray_data) can be checked here.
The package has two modules with similar results.
The first module (teleconnection_with_connecting_pathways) is slower than the second (teleconnection_via_numpy), but it allows to evaluate the connecting pathways between the partial teleconnection maps.
The second, only returns the resultant teleconnection map, without the connecting lines (geopandas-Linestrings), though it is much faster.
Feel free to colaborate. If possible, I would like to combine both modules ensuring speed and pathway analyses in the Teleconnection algorithm.
Sincerely yours,
Philipe Leal

How to identify data points that are significantly smaller than the others in a data set?

I have an array of data points of real value. I wish to identify those data points whose values are significantly smaller than others. Are there any well-known algorithms?
For example, the data set can be {0.01, 0.32, 0.45, 0.68, 0.87, 0.95, 1.0}. I can manually tell that 0.01 is significantly smaller than the others. However, I would like to know are there any analysis method for this purpose in statistics area? I tried outlier detection in my data set, but it cannot find any outliers (such as detecting 0.01 as outlier).
I have deleted a segment I wrote explaining the use of zscores for your problem but it was incorrect, I hope the information below is accurate, just in case, use it as a guide only...
The idea is to build a z-distribution from the scores you are testing, minus the test score, and then use that distribution to get a zscore of the test score. Any z greater than 1.96 is unlikely to belong to your test population.
I am not that this works properly because you remove your tests score' influence from the distribution, thus large scores will have inflated zscores because they contribute to a greater variance (the denominator in the zscore equation).
This could be a start till someone with a modicum of expertise sets us straight :)
e.g.
for i = 1:length(data_set)
test_score = data_set(i)
sample_pop = data_set(data_set~=test_score)
sample_mean = mean(sample_pop)
sample_stdev = std(sample_pop)
test_z(i) = (i-sample_mean)/sample_stdev
end
This can be done for higher dimensions by using the dim input for mean.

Resources