Predict Measurment Errors/Abnormalities - statistics

I have the following task:
I have a database of output measurements of a solar system and I am supposed to detect errors or rather abnormalities. The following Link shows Example Data. What seemingly happened is that the sensors didn't provide any measurements at 11:00am, the same happened from 16:00 to 18:00. Then at 12:00pm and at 19:00 all the prior missing measurements have been summed up.
I am supposed to create a system that automatically detects these types of abnormalities and my first thought is to use classification (maybe decision trees or naive bayes classifier) to predict if a row is an error or not. My question is if this is a reasonable method or is it completely wrong to use classification here (and what are other methods to solve this problem)?
Thanks in advance

Related

Normality Assumption - how to check you have not violated it?

I am rleatively new to statistics and am stuggling with the normality assumption.
I understand that parametric tests are underpinned by the assumption that the data is normally distributed, but there seems to be lots of papers and articles providing conflicting information.
Some articles say that independant variables need to be normally disrbiuted and this may require a transformation (log, SQRT etc.). Others says that in linear modelling there are no assumptions about any linear the distribution of the independent variables.
I am trying to create a multiple regression model to predict highest pain scores on hospital admissions:
DV: numeric pain scores (0-no pain -> 5 intense pain)(discrete- dependant variable).
IVs: age (continuous), weight (continuous), sex (nominal), depreviation status (ordinal), race (nominal).
Can someone help clear up the following for me?
Before fitting a model, do I need to check the whether my independant variables are normally distributed? If so, why? Does this only apply to continuous variables (e.g. age and weight in my model)?
If age is positively skewed, would a transformation (e.g. log, SQRT) be appropriate and why? Is it best to do this before or after fitting a model? I assume I am trying to get close to a linear relationship between my DV and IV.
As part of the SPSS outputs it provides plots of the standardised residuals against predicted values and also normal P-P plots of standardised residuals. Are these tests all that is needed to check the normality assumption after fitting a model?
Many Thanks in advance!

Problem when using Machine learning to predict product delivery date

I am currently using machine learning model(written in Python3) to predict the product delivery date, but due to the nature of our business, customer always complain when the actual delivery date is later than the predicted delivery date. so I try to force the predicted date always later than the actual delivery date, but still as close to the actual date as possible. Can anyone advice me how to do this or any particular algorithm/methods that I can search for? Thank you in advance!
When you use a prediction from a machine learning model, you're saying that this date is the most probably date that It will be delivered.
Every prediction has an error associated to real values. You can describe these errors as a normal distribution of the probability to happen over time. Thus, you have 50% of chance to the real value is before your prediction and 50% of chance to be after your prediction.
You'll hardly predict the exact value.
But how can you overcome this?
You can use the root mean squared error (RMSE). This metric will tell you "how far" your predictions are from the real values. So, if you add 2 times RMSE to your predictions before sending to users, the probability of the real value appears after your prediction is < 5%.

Tensorflow object detection API - validation loss behaviour

I am trying to use the TensorFlow object detection API to recognize a specific object (guitars) in pictures and videos.
As for the data, I downloaded the images from the OpenImage dataset, and derived the .tfrecord files. I am testing with different numbers, but for now let's say I have 200 images in the training set and 100 in the evaluation one.
I'm traininig the model using the "ssd_mobilenet_v1_coco" as a starting point, and the "model_main.py" script, so that I can have training and validation results.
When I visualize the training progress in TensorBoard, I get the following results for train:
and validation loss:
respectively.
I am generally new to computer vision and trying to learn, so I was trying to figure out the meaning of these plots.
The training loss goes as expected, decreasing over time.
In my (probably simplistic) view, I was expecting the validation loss to start at high values, decrease as training goes on, and then start increasing again if the training goes on for too long and the model starts overfitting.
But in my case, I don't see this behavior for the validation curve, which seems to be trending upwards basically all the time (excluding fluctuations).
Have I been training the model for too little time to see the behavior I'm expecting? Are my expectations wrong in the first place? Am I misinterpreting the curves?
Ok, I fixed it by decreasing the initial_learning_rate from 0.004 to 0.0001.
It was the obvious solution, considering the wild oscillations of the validation loss, but at first I thought it wouldn't work since there seems to be already a learning rate scheduler in the config file.
However, immediately below (in the config file) there's a num_steps option, and it's stated that
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
Honestly, I don't remember if I commented out the num_steps option...if I didn't, it seems my learning rate was kept to the initial value of 0.004, which turned out to be too high.
If I did comment it out (so that the learning scheduler was active), I guess that, instead of the decrease, it still started from too high of a value.
Anyway, it's working much better now, I hope this can be useful if anyone is experiencing the same problem.

How to efficiently implement training multiple related time series in Keras?

I have 5 time series that I want a neural network to predict. The time series are related to each other. Each time series consists of numbers between 0 and 100. I want to predict the next number for each time series. I already have a model to train one time series using a GRU and that works reasonably well. I have tried two strategies:
I normalized the numbers and made the problem a regression problem. The best validation accuracy so far is 0.38.
I one-hot-encoded the time series, and this works significantly better (an added accuracy of 0.15) but it costs 100 times as much memory.
For 5 time series, I tried 5 independent models but in that case the relationship between the 5 time series was lost. I am wondering what an efficient strategy to proceed might be. I can think of two myself but I might be missing something:
I can stack the input so that I have a five-hot-encoded input instead of 5 one-hot-encoded. Can this be done?
I can create 5 models and merge them. I am not sure what to do with the output. Should I split the model, one for each time series?
Is there a strategy I have overlooked? Memory is a problem. With thousands of time series, with sample lengths of 100, the data uses a lot of memory and processing time. I Googled around but could not find an efficient strategy. Could someone suggest how to implement this problem efficiently in Keras?

Time Series Modeling of Choppy Data

I'm trying to model 10 years of monthly time series data that is very choppy and overal it has an upward trend. At first glance it looks like a strong seasonal series, however the test results indicate that it is definitely not seasonal. This is a pricing variable that I'm trying to model as a function of macroeconomic environment, such as interest rates and yield curves.
I've tryed linear OLS regression (proc reg), but I don't get a very goo dmodel with that.
I've also tried autoregressive error models (proc autoreg), but it captures 7 lags of the error term as significant factors. I don't really want to include that many lag of the error term in the model. In addition most of the macroeconomic variables become insignificant when I include all these error lags in the model.
Any suggestions on modeling method/technique that could help me model this choppy data is really appreciated.
At a past project, we've used proc arima to predict future product sales based on a time series of past sales:
http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_arima_sect019.htm (note that arima is also an autoregressive model)
But as Joe said, for really statistical feedback on your question, you're better of asking at the Cross Validated site.

Resources