Fitting a regression model - statistics

I'm trying to solve a question from a Chinese "linear statistical models",
and the chapter containing this question is about weighted least squares.
The question and the way I solve it are as following:
As you can see, the predicted values is very different to the actual value, so I wonder about whether I solve it right or not.
Could somebody tell me what is wrong with it?
And if there are mistakes how I correct it?

The predicted values are actually not that far off from the actual values. This seems fine and a seems like a sensible result here

Related

Gradient boosting machine formula issue

In the related formula I exactly dont understand what the first term is. I know fm(x) is the weak learner but I dont have any idea about ziita.

How do I analyze the change in the relationship between two variables?

I'm working on a simple project in which I'm trying to describe the relationship between two positively correlated variables and determine if that relationship is changing over time, and if so, to what degree. I feel like this is something people probably do pretty often, but maybe I'm just not using the correct terminology because google isn't helping me very much.
I've plotted the variables on a scatter plot and know how to determine the correlation coefficient and plot a linear regression. I thought this may be a good first step because the linear regression tells me what I can expect y to be for a given x value. This means I can quantify how "far away" each data point is from the regression line (I think this is called the squared error?). Now I'd like to see what the error looks like for each data point over time. For example, if I have 100 data points and the most recent 20 are much farther away from where the regression line/function says it should be, maybe I could say that the relationship between the variables is showing signs of changing? Does that make any sense at all or am I way off base?
I have a suspicion that there is a much simpler way to do this and/or that I'm going about it in the wrong way. I'd appreciate any guidance you can offer!
I can suggest two strands of literature that study changing relationships over time. Typing these names into google should provide you with a large number of references so I'll stick to more concise descriptions.
(1) Structural break modelling. As the name suggest, this assumes that there has been a sudden change in parameters (e.g. a correlation coefficient). This is applicable if there has been a policy change, change in measurement device, etc. The estimation approach is indeed very close to the procedure you suggest. Namely, you would estimate the squared error (or some other measure of fit) on the full sample and the two sub-samples (before and after break). If the gains in fit are large when dividing the sample, then you would favour the model with the break and use different coefficients before and after the structural change.
(2) Time-varying coefficient models. This approach is more subtle as coefficients will now evolve more slowly over time. These changes can originate from the time evolution of some observed variables or they can be modeled through some unobserved latent process. In the latter case the estimation typically involves the use of state-space models (and thus the Kalman filter or some more advanced filtering techniques).
I hope this helps!

What are the things to note in order to maximize the auc_roc_score?

What are the tips & tricks to improve our auc_roc_score?
Example:
Is balanced data required?
Is recall is more important than precision?
Is oversampling is usually better than undersampling?
Thanks again!
This is highly dependent upon the type of data in use and the type of model it is trained on along with the hyper parameters being currently used. The ROC is just the outcome of how well the data pre processing and model building is done. Thus you must look into ways of improving your model. Also precision and recall are useful in their own ways. It depends on the scenario.
Like precision answers the question :
What proportions of the predicted True labels were actually correct ?
whereas recall answers :
What proportion of the actually correct labels were identified correctly ?
Thus you would want a higher recall in case of identifying a corona patient while you would prefer a higher precision when it comes to the fact that the cost of acting is high and the cost of not acting is low.
Also, what exactly do u mean by balanced data varies from situation to situation.
Thus if you could specify the kind of model, problem and data in use, we may be able to help you more.
Please share the code of your model for the same.

How can I get the initial values from my dataset for a combined lorentzian and gaussian fit?

I try to fit data using standard defined functions (Lorentzian & Gaussian) from lmfit package. The program works quite well for some data set but for another one its not able to fit because the initial values doesnt seem right. Is there any algorithm which can extract the initial values from the data set and do some iterations in order to find the best fit?
I tried some common methods like bruethe-force algorithm but the results are not satisfactory and it cost a lot of time.
It is always recommended to provide a small, complete example script that shows the problem you are having. How could we know why it works in some cases and not in others?
lmfit.GaussianModel and lmfit.LorentzianModel both have guess methods. This should work reasonably well data with an isolated peak, working like
import lmfit
model = lmfit.models.GaussianModel()
params = model.guess(ydata, x=xdata)
for p in params.values():
print(p)
result = model.fit(ydata, params, x=xdata)
print(result.fit_report())
If the data doesn't have a clear isolated peak, that might not work so well.
If finding the peak(s) is the actual problem, try scipy.signal.find_peaks
(https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html)or peakutils (https://peakutils.readthedocs.io/en/latest/). Either of these should give you a good estimate of the center parameter, which is probably the most likely to cause bad fits if a poor initial value is give.

logistic regression missing values

Could I have a logistic regression with missing values?
I have many continuos attributes and some categorical, could I set them as user-missing? Could it be useful?
For doing a regression analysis you need all variables measured for each event. Perhaps another technique works with missing attributes, but not regression.
BTW, you should try posting the question at https://stats.stackexchange.com/
HTH!
Most regression procedures require complete data, but there are a variety of methods for dealing with missing values. This is a subtle topic, so I won't pretend to give a complete answer here, and recommend doing some reading on the subject. Briefly, though:
Never delete observations to fix this problem.
Deletion of variables is always allowed, but obviously is quite severe in terms of one's data budget.
Filling in missing values with global constants, such as the mean or median of the non-missings, should be done sparingly (when the proportion of missings is very low) if at all.
Filling in missing values with values chosen based on other independent variables is preferred over number 3, above.
To learn more about this subject, seek information on the terms "imputation", especially "single imputation" and "multiple imputation", "missing at random" and "missing completely at random".

Resources