I am currently using machine learning model(written in Python3) to predict the product delivery date, but due to the nature of our business, customer always complain when the actual delivery date is later than the predicted delivery date. so I try to force the predicted date always later than the actual delivery date, but still as close to the actual date as possible. Can anyone advice me how to do this or any particular algorithm/methods that I can search for? Thank you in advance!
When you use a prediction from a machine learning model, you're saying that this date is the most probably date that It will be delivered.
Every prediction has an error associated to real values. You can describe these errors as a normal distribution of the probability to happen over time. Thus, you have 50% of chance to the real value is before your prediction and 50% of chance to be after your prediction.
You'll hardly predict the exact value.
But how can you overcome this?
You can use the root mean squared error (RMSE). This metric will tell you "how far" your predictions are from the real values. So, if you add 2 times RMSE to your predictions before sending to users, the probability of the real value appears after your prediction is < 5%.
Related
I have the following task:
I have a database of output measurements of a solar system and I am supposed to detect errors or rather abnormalities. The following Link shows Example Data. What seemingly happened is that the sensors didn't provide any measurements at 11:00am, the same happened from 16:00 to 18:00. Then at 12:00pm and at 19:00 all the prior missing measurements have been summed up.
I am supposed to create a system that automatically detects these types of abnormalities and my first thought is to use classification (maybe decision trees or naive bayes classifier) to predict if a row is an error or not. My question is if this is a reasonable method or is it completely wrong to use classification here (and what are other methods to solve this problem)?
Thanks in advance
I am rleatively new to statistics and am stuggling with the normality assumption.
I understand that parametric tests are underpinned by the assumption that the data is normally distributed, but there seems to be lots of papers and articles providing conflicting information.
Some articles say that independant variables need to be normally disrbiuted and this may require a transformation (log, SQRT etc.). Others says that in linear modelling there are no assumptions about any linear the distribution of the independent variables.
I am trying to create a multiple regression model to predict highest pain scores on hospital admissions:
DV: numeric pain scores (0-no pain -> 5 intense pain)(discrete- dependant variable).
IVs: age (continuous), weight (continuous), sex (nominal), depreviation status (ordinal), race (nominal).
Can someone help clear up the following for me?
Before fitting a model, do I need to check the whether my independant variables are normally distributed? If so, why? Does this only apply to continuous variables (e.g. age and weight in my model)?
If age is positively skewed, would a transformation (e.g. log, SQRT) be appropriate and why? Is it best to do this before or after fitting a model? I assume I am trying to get close to a linear relationship between my DV and IV.
As part of the SPSS outputs it provides plots of the standardised residuals against predicted values and also normal P-P plots of standardised residuals. Are these tests all that is needed to check the normality assumption after fitting a model?
Many Thanks in advance!
I'm working on a protein classification problem with svm, so i use LibSVM for string data. The string kernel defined into LibSVM is the edit distance kernel, it depends from a parameter gamma. During cross-validation, changing C and gamma parameters, i get 75% of accuracy in every way! Moreover, also changing the number of trainingset patterns, i get the same accuracy. I use the SCOP database. I have no idea about this behavior!
Look at the counts of each type of misclassification error. If you are getting a constant error rate like this then it is quite possible that every observation is getting assigned to the same class. For example, if 75% of your training observations are in one class and the classifier assigns every observation to that class, then you'll see an error rate of 25%.
I'm trying to model 10 years of monthly time series data that is very choppy and overal it has an upward trend. At first glance it looks like a strong seasonal series, however the test results indicate that it is definitely not seasonal. This is a pricing variable that I'm trying to model as a function of macroeconomic environment, such as interest rates and yield curves.
I've tryed linear OLS regression (proc reg), but I don't get a very goo dmodel with that.
I've also tried autoregressive error models (proc autoreg), but it captures 7 lags of the error term as significant factors. I don't really want to include that many lag of the error term in the model. In addition most of the macroeconomic variables become insignificant when I include all these error lags in the model.
Any suggestions on modeling method/technique that could help me model this choppy data is really appreciated.
At a past project, we've used proc arima to predict future product sales based on a time series of past sales:
http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_arima_sect019.htm (note that arima is also an autoregressive model)
But as Joe said, for really statistical feedback on your question, you're better of asking at the Cross Validated site.
I am trying to predict the inter-arrival time of the incoming network packets. I measure the inter-arrival times of network packets and represent this data in the form of binary features: xi= 0,1,1,1,0,... where xi=0 if the inter-arrival time is less than a break-even-time and 1 otherwise. The data has to be mapped into two possible classes C={0,1}, where C=0 represents a short inter-arrival time and 1 represents a long inter-arrival time. Since I want to implement the classifier in an online feature, where as soon as I observe a vector of features xi=0,1,1,0..., I calculate the MAP class. Since I don't have a prior estimation of the conditional and prior probabilities, I initialize them as follows:
p(x=0|c=0)=p(x=1|c=0)=p(x=0|c=1)=p(x=1|c=1)=0.5
p(c=0)=p(c=1)=0.5
For each feature vector (x1=m1,x2=m2,...,xn=mn), when I output a class C, I update the conditional and prior probabilities as follows:
p(xi=mi|y=c)=a+(1-a)*p(p(xi=mi|c)
p(y=c)=b+(1-b)*p(y=c)
The problem is that, I am always getting a biased prediction. Since the number of long inter-arrival times are comparatively less than the short, the posterior of short always remains higher than the long. Is there any way to improve this? or am I doing something wrong? Any help will be appreciated.
Since you have a long time series, the best path would probably be to take into account more than a single previous value. the standard way of doing this would be to use a time-window, i.e. split the long vector Xi to overlapping pieces of a constant length, with the last value treated as the class, and use them as the train set. This could be also done on streaming data in an online manner, by incrementally updating the NB model with new data as it arrives.
Note that Using this method, other regression algorithms might end up being a better choice than NB.
Weka (version 3.7.3 and up) has a very nice dedicated tool supporting time-series analysis. alternatively, MOA is also based on Weka, and supports modeling of streaming data.
EDIT: it might also be a good idea to move from binary features to the real values (maybe normalized), and apply the threshold post-classification. This might give more information to the regression model (NB or other), allowing better accuracy.