Missing data while using fixed effects model panel data - statistics

I have panel data made of 99 cross-sectional units and 7 time series. When I do the fixed effects model in gretl, there are only 158 observations, 56 cross-sectional units and 4 time series. My model has one dependent variable and 5 independent variables. There are some values missing and I have been told this shoudn't be a problem while doing the linear regression. But gretl removed all of the units where there is only one variable missing. Even when I have data for all other 5 variables, gretl removed it from model completely because one is missing. This reduced my dataset significantly. Could you please advise how to fix this?
I'm not very experienced in gretl so I don't know what to do.

Related

Analyze catagorical variable with a continuous variable

I have a data set with more than 20 columns. there I have a main categorical variable(x) as target. Assume x have 3 levels.
high
moderate
low
So I want to find out is there any relationship between each and every levels of X(high, low, moderate) independently and other variables.
I tried with python. using featurewiz and simple correlation heat map. and tied one hot encoding. but things didn't went good.

The bounding box's position and size is incorrect, how to improve it's accuracy?

I'm using detectron2 for solving a segmentation task,
I'm trying to classify an object into 4 classes,
so I have used COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml.
I have applied 4 kind of augmentation transforms and after training I get about 0.1
total loss.
But for some reason the accuracy of the bbox is not great on some images on the test set,
the bbox is drawn either larger or smaller or doesn't cover the whole object.
Moreover sometimes the predictor draws few bboxes, it assumes there are few different objects although there is only a single object.
Are there any suggestions how to improve it's accuracy?
Are there any good practice approaches how to resolve this issue?
Any suggestion or reference material will be helpful.
I would suggest the following:
Ensure that your training set has the object you want to detect in all sizes: in this way, the network learns that the size of the object can be different and less prone to overfitting (detector could assume your object should be only big for example).
Add data. Rather than applying all types of augmentations, try adding much more data. The phenomenon of detecting different objects although there is only one object leads me to believe that your network does not generalize well. Personally I would opt for at least 500 annotations per class.
The biggest step towards improvement will be achieved by means of (2).
Once you have a decent baseline, you could also experiment with augmentations.

Decorrelating 3 categorical variables

I have a table of 3 categorical variables (salary, face_amount, and area_code) that I was looking to decorrelate from one another. In other words, I'm trying to find how much of some output can be attributed solely to each one of these variables. So I would want to see how much of this output is due to the salary variable and not the correlated amount of salary with face_amount for example if that makes sense.
I noticed that there is Multiple Correspondence Analysis for this type of problem that will decorrelate the variables, however, the issue I'm having is that I need the original variables and not the ones that are produced from multiple correspondence analysis. I'm very confused as to how to go about analyzing this type of problem and would appreciated any help with this kind of problem.
Sample of data:
salary face_amount area_code
'1-50' 1000 67
'1-50' 500 600
'1-50' 500 600
'51-200' 2000 623
'51-200' 1000 623
'201-500' 500 700
I'm not exactly sure how to go about this kind of problem

Excel Polynomial Regression with multiple variables

I saw a lot of tutorials online on how to use polynomial regression on Excel and multi-regression but none which explain how to deal with multiple variable AND multiple regression.
In , the left columns contain all my variables X1,X2,X3,X4 (say they are features of a car), and Y1 is the price of the car I am looking for.
I got about 5000 lines of data that I got from running a model with various values of X1,X2,X3,X4 and I am looking to make a regression so that I can get a best estimate of my model without having to run it (saving me valuable computing time).
So far I've managed to do multiple linear regression using the Data Analysis pack in Excel, just by using the X1,X2,X3,X4. I noticed however that the regression looks very messy and inaccurate in places, which is due to the fact that my variables X1,X2,X3,X4, affect my output Y1 non-linearly.
I had a look online and to add polynomials to the mix, tutorial suggest adding a X^2 column. But when I do that (see right part of the chart) my regression is much much worse than when I use linear fits.
I know that polynomials, can over-fit the data, but I though that using a quadratic form was safe since the regression would only have to return a coefficient of 0 to ignore any excess polynomial orders.
Any help would be very welcome,
For info I get an adujsted-R^2 of 0.91 for linear fits and 0.66 when I add a few X^2 columns.
So far this is the best regression I can get (black line is 1:1):
As you can see I would like to increase the fit for the bottom left part and top right parts of the curve.

SVM and cross validation

The problem is as follows. When I do support vector machine training, suppose I have already performed cross validation on 10000 training points with a Gaussian kernel and have obtained the best parameter C and \sigma. Now I have another 40000 new training points and since I don't want to waste time on cross validation, I stick to the original C and \sigma that I obtained from the first 10000 points, and train the entire 50000 points on these parameters. Is there any potentially major problem with this? It seems that for C and \sigma in some range, the final test error wouldn't be that bad, and thus the above process seems okay.
There is one major pitfal of such appraoch. Both C and sigma are data dependant. In particular, it can be shown, that optimal C strongly depends on the size of the training set. So once you make your training data 5 times bigger, even if it brings no "new" knowledge - you should still find new C to get the exact same model as before. So, you can do such procedure, but keep in mind, that best parameters for smaller training set do not have to be the best for the bigger one (even though, they sometimes still are).
To better see the picture. If this procedure would be fully "ok" than why not fit C on even smaller data? 5 times? 25 times smaller? Mayone on one single point per class? 10,000 may seem "a lot", but it depends on the problem considered. In many real life domains this is just a "regular" (biology) or even "very small" (finance) dataset, so you won't be sure, if your procedure is fine for this particular problem until you test it.

Resources