I am working on a new model which is very sensitive to the interpolation/fit used to describe a certain dataset. I have some success with linear splines and logarithmic fits but I think there is still significant room for improvement. My supervisor suggested I take a look at exponential splines. I have found some books on the theory of exponential spines but no reference to a library or code example to follow.
Is there a library that I am unaware of that supports this feature?
Related
I'm working on a simple project in which I'm trying to describe the relationship between two positively correlated variables and determine if that relationship is changing over time, and if so, to what degree. I feel like this is something people probably do pretty often, but maybe I'm just not using the correct terminology because google isn't helping me very much.
I've plotted the variables on a scatter plot and know how to determine the correlation coefficient and plot a linear regression. I thought this may be a good first step because the linear regression tells me what I can expect y to be for a given x value. This means I can quantify how "far away" each data point is from the regression line (I think this is called the squared error?). Now I'd like to see what the error looks like for each data point over time. For example, if I have 100 data points and the most recent 20 are much farther away from where the regression line/function says it should be, maybe I could say that the relationship between the variables is showing signs of changing? Does that make any sense at all or am I way off base?
I have a suspicion that there is a much simpler way to do this and/or that I'm going about it in the wrong way. I'd appreciate any guidance you can offer!
I can suggest two strands of literature that study changing relationships over time. Typing these names into google should provide you with a large number of references so I'll stick to more concise descriptions.
(1) Structural break modelling. As the name suggest, this assumes that there has been a sudden change in parameters (e.g. a correlation coefficient). This is applicable if there has been a policy change, change in measurement device, etc. The estimation approach is indeed very close to the procedure you suggest. Namely, you would estimate the squared error (or some other measure of fit) on the full sample and the two sub-samples (before and after break). If the gains in fit are large when dividing the sample, then you would favour the model with the break and use different coefficients before and after the structural change.
(2) Time-varying coefficient models. This approach is more subtle as coefficients will now evolve more slowly over time. These changes can originate from the time evolution of some observed variables or they can be modeled through some unobserved latent process. In the latter case the estimation typically involves the use of state-space models (and thus the Kalman filter or some more advanced filtering techniques).
I hope this helps!
In sklearn, the document of QuantileTransformer says
This method transforms the features to follow a uniform or a normal distribution
the document of PowerTransformer says,
Apply a power transform featurewise to make data more Gaussian-like
It seems both of them can transform features to a gaussian/normal distribution. What are the differences in terms of this aspect and when to use which ?
It is confusing terminology that they use because Gaussian and normal distribution are actually the SAME.
QuantileTransformer and PowerTransformer are both non-linear.
To answer your question about what exactly is the difference it is this according to https://scikit-learn.org:
"QuantileTransformer provides non-linear transformations in which distances between marginal outliers and inliers are shrunk. PowerTransformer provides non-linear transformations in which data is mapped to a normal distribution to stabilize variance and minimize skewness. "
Source and more info here: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#:~:text=QuantileTransformer%20provides%20non%2Dlinear%20transformations,stabilize%20variance%20and%20minimize%20skewness.
The main difference is PowerTransformer() being parametric and QuantileTransformer() being non-parametric. Box-Cox or Yeo-Johnson will make your data look more 'normal' (i.e. less skewed and more centered) but it's often still far from the perfect gaussian. QuantileTransformer(output_distribution='normal') results usually look much closer to gaussian, at the cost of distorting linear relationships somewhat more. I believe there's no rule of thumb to decide which one would work better in a certain case, but it's worth noting you can select an optimal scaler in a pipeline when doing e.g. GridSearchCV().
I've to fit the following exponential function to a time-series data (data).
$C(t)$ = $C_{\infty} (1-\exp(-\frac{t}{\tau}))$
I want to compute the time scale $\tau$ at which C(t) reaches $C_{\infty}$. I would like to ask for suggestions on how $\tau$ can be computed. I found an example here that use curve fitting. But I am not sure how to use curve_fit library in scipy to set up the problem described above.
One cannot expect a good fitting along the whole curve with the function that you choose.
This is because especially at t=0 this function returns C=0 while the data value is C=2.5 .This is very far considering the order of magnitude.
Nevertheless on can try to fit this function for a rough result. A non-linear regression calculus is necessary : this is the usual approach using available softwares. This is the recommended method in context of academic exercices.
Alternatively and more simply, a linear regression can be used thanks to a non-conventional method explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales .
The result is shown below.
For a better fitting one have to take account of the almost constant value of data in the neighborhood of t=0. Choosing a function made of two logistic functions would be recommended. But the calculus is more complicated.
IN ADDITION, AFTER THE OP CHANGES THE DATA :
The change of data makes out of date the above answer.
In fact artificially changing the origin of the y-scale so that y=0 at t=0 changes nothing. The slope at t=0 of the chosen fonction is far to be nul, while the slope of the data curve is almost 0. This remains incompatible.
Definitively the chosen function y=C*(1-exp(-t/tau)) cannot fit correctly the data (the preceeding data or the new data as well).
As already pointed out, for a better fitting one have to take account of the almost constant value of data in the neighborhood of t=0. Choosing a function made of two logistic functions would be recommended. But the calculus is more complicated.
I am struggling with finding methods which can be used to detect the periodicity of binary time series.(the binary time series look like 0,0,1,0,0,1,0,0,1... or 1,0,0,0,1,1,0,1,0,0,0,0...)
You can try to solve this problem by using a Python's library, I think. I never experienced it but when I google it as "python how to detect a periodicity" then one of the first links on the search result was this.
According to this package;
Useful tools for analysis of periodicities in time series data.
Documentation: https://periodicity.readthedocs.io
Currently includes:
Auto-Correlation Function
Spectral methods:
Lomb-Scargle periodogram
Wavelet Transform
Hilbert-Huang Transform (WIP)
Phase-folding methods:
String Length
Phase Dispersion Minimization
Analysis of Variance (soon™)
Gaussian Processes:
george implementation
celerite implementation
pymc3 implementation (soon™)
Hope, this helps.
Despite going through lots of similar question related to this I still could not understand why some algorithm is susceptible to it while others are not.
Till now I found that SVM and K-means are susceptible to feature scaling while Linear Regression and Decision Tree are not.Can somebody please elaborate me why? in general or relating to this 4 algorithm.
As I am a beginner, please explain this in layman terms.
One reason I can think of off-hand is that SVM and K-means, at least with a basic configuration, uses an L2 distance metric. An L1 or L2 distance metric between two points will give different results if you double delta-x or delta-y, for example.
With Linear Regression, you fit a linear transform to best describe the data by effectively transforming the coordinate system before taking a measurement. Since the optimal model is the same no matter the coordinate system of the data, pretty much by definition, your result will be invariant to any linear transform including feature scaling.
With Decision Trees, you typically look for rules of the form x < N, where the only detail that matters is how many items pass or fail the given threshold test - you pass this into your entropy function. Because this rule format does not depend on dimension scale, since there is no continuous distance metric, we again have in-variance.
Somewhat different reasons for each, but I hope that helps.