I'm trying to solve time series prediction problem for multivariate data in Python using LSTM approach.
In here , the author solving problem for time series air pollution prediction. The data looks like this:
pollution dew temp press wnd_dir wnd_spd snow rain
date
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0
As opposed to yearly in the above tutorial, I have a 30-sec time step observations on soccer matches with over 20 features. Where each match with unique ID has different length ranging from 190 to 200.
The author split train/test set by number of days in a year as follow:
# split into train and test sets
values = reframed.values
n_train_hours = 365 * 24
train = values[:n_train_hours, :]
test = values[n_train_hours:, :]
So my train/test set should be by number of matches:
(matches*len(match))
n_train_matches = some k number of matches * len(match)
train = values[:n_train_matches, :]
test = values[n_train_matches:, :]
I want to translate this to my problem to make a prediction for each feature as early as time t=2. I.e. 30-sec into a match.
Question
Do I need to apply pre-Sequence Padding on each match?
Is there a way of solving the problem without padding?
If you are using an LSTM then I believe you are more likely to benefit from that model if you are padding and feeding in multiple 30 second step observations.
If you didn't pad the sequences, and you wanted a prediction at t=2, then you'll only be able to use the very last step-observation.
Related
I need to have 10 results in my CSV file but it shows only one. I looked through some questions posted and it said might be my previous iteration being covered.
How should I edit my code in order to get 10 repetitions in my CSV file?
for x in range (10):
from sklearn.metrics import classification_report
report = classification_report(Y_test, Y_pred, output_dict=True)
CR = pd.DataFrame(report).transpose()
CR.to_csv('LR_CR.csv')
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(Y_test, Y_pred)
CM = pd.DataFrame(matrix).transpose()
CM.to_csv('LR_CM.csv')
Output
precision recall f1-score support
0 0.421053 0.444444 0.432432 18.000000
1 0.777778 0.760870 0.769231 46.000000
accuracy 0.671875 0.671875 0.671875 0.671875
macro avg 0.599415 0.602657 0.600832 64.000000
weighted avg 0.677449 0.671875 0.674506 64.000000
0 1
0 8 14
1 10 32
precision recall f1-score support
0 0.625000 0.277778 0.384615 18.00
1 0.767857 0.934783 0.843137 46.00
accuracy 0.750000 0.750000 0.750000 0.75
macro avg 0.696429 0.606280 0.613876 64.00
weighted avg 0.727679 0.750000 0.714178 64.00
0 1
0 5 3
1 13 43
What is happening here is that you are overwriting your CSV file in each iteration of the loop.
If you want to have 10 separate CSV files, you need to name each one differently which can be achieved by using f-strings
for i in range(10):
pd.to_csv(f'some_name_{i}.csv')
Or if you need all the results in a single CSV file, then just append to the existing file
df.to_csv('existing.csv', mode='a', index=False, header=False)
This might seem like a stupid question, but i have been trying out Time Series Forecasting Techniques, and somehow both the techniques(Prophet & Auto-Arima) that i tried, seemed to give predictions in an increasing order?
What could be the possible reasons for getting constantly increasing predictions? I think it might be due to seasonality factor, but i am not really sure.
I can share the code if required.
The code is as follows:
data_v2 = pd.read_csv('data_v1.csv')
data_v2.shape
data_v2.head()
data_v2.dtypes
data_v2['Date'] = pd.to_datetime(data_v2.Date,format='%m/%d/%Y')
data_v2.index = data_v2['Date']
#preparing data
data_v2.rename(columns={'Invoice Amount': 'y', 'Date': 'ds'}, inplace=True)
data_v2.head()
#train and validation
train = data_v2[:16]
train.shape
valid = data_v2[16:]
valid.shape
#fit the model
model = Prophet()
model.fit(train)
#predictions
close_prices = model.make_future_dataframe(periods=len(valid))
forecast = model.predict(close_prices)
forecast.head()
#rmse
forecast_valid = forecast['yhat'][16:]
rms=np.sqrt(np.mean(np.power((np.array(valid['y'])-np.array(forecast_valid)),2)))
print(rms)
valid['Predictions'] = 0
valid['Predictions'] = forecast_valid.values
plt.plot(train['y'])
plt.plot(valid[['y', 'Predictions']])
# Plot the forecast
f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(15)
fig = model.plot(forecast, ax=ax)
plt.show()
fig = model.plot_components(forecast)
The following are the predictions:
0 30505.608982
1 31618.779403
2 32731.949825
3 33737.394077
4 34850.564499
5 35927.826201
6 37040.996625
7 38118.258327
8 39231.428751
9 40344.599176
10 41421.860877
11 42535.031302
12 43612.293004
13 44725.463429
14 45838.633854
15 46844.078108
16 46879.986832
17 46915.895555
18 46951.804279
19 46987.713002
20 47023.621725
21 47059.530449
22 47095.439172
23 47131.347896
Following are the actual values:
2016-12-01 63662.5
2017-01-01 35167.5
2017-02-01 24810.0
2017-03-01 25352.5
2017-04-01 19355.0
2017-05-01 21860.0
2017-06-01 21420.0
2017-07-01 30260.0
2017-08-01 26810.0
2017-09-01 29510.0
2017-10-01 84722.5
2017-11-01 71706.5
2017-12-01 44935.0
2018-01-01 43835.0
2018-02-01 35405.0
2018-03-01 40307.5
2018-04-01 26665.0
2018-05-01 27395.0
2018-06-01 89142.5
2018-07-01 100497.5
2018-08-01 41722.5
2018-09-01 30760.0
2018-10-01 183562.5
2018-11-01 90650.0
Thanks in advance!
loan_amnt funded_amnt funded_amnt_inv term
0 5000.0 5000.0 4975.0 36 months
1 2500.0 2500.0 2500.0 60 months
My dataframe frame goes like above .
I need to change '36 months' under term to '3'(months to years) but I'm unable to do so as the dtype of 'term' is Object.
loan.term.replace('36months','3',inplace=True) -->No change in dataframe
I tried the below code for type conversion but it still returns the dtype as Object
loan['term']=loan.term.astype(str)
Expected output:
loan_amnt funded_amnt funded_amnt_inv term
0 5000.0 5000.0 4975.0 3
1 2500.0 2500.0 2500.0 5
Any help would be dearly appreciated .Thank you for your time.
Put your data in a variable like data, use this:
for i in data:
i['term'] = int(i['term'].split(' ')[0])//12
print(data)
you'll get your output!
I want to read in python a file which contains a varying length header and then extract in a dataframe/series the variables which are coming after the header.
The data looks like :
....................................................................
Data coverage and measurement duty cycle:
When the instrument duty cycle is not in measure mode (i.e. in-flight
calibrations) the data is not given here (error flag = 2).
The measurements have been found to exhibit a strong sensitivity to cabin
pressure.
Consequently the instrument requires calibrated at each new cabin
pressure/altitude.
Data taken at cabin pressures for which no calibration was performed is
not given here (error flag = 2).
Measurement sensivity to large roll angles was also observed.
Data corresponding to roll angles greater than 10 degrees is not given
here (error flag = 2)
......................................................................
High Std: TBD ppb
Target Std: TBD ppb
Zero Std: 0 ppb
Mole fraction error flag description :
0 : Valid data
2 : Missing data
31636 0.69 0
31637 0.66 0
31638 0.62 0
31639 0.64 0
31640 0.71 0
.....
.....
So what I want is to extract the data as :
Time C2H6 Flag
0 31636 0.69 0 NaN
1 31637 0.66 0 NaN
2 31638 0.62 0 NaN
3 31639 0.64 0 NaN
4 31640 0.71 0 NaN
5 31641 0.79 0 NaN
6 31642 0.85 0 NaN
7 31643 0.81 0 NaN
8 31644 0.79 0 NaN
9 31645 0.85 0 NaN
I can do that with
infile="/nfs/potts.jasmin-north/scratch/earic/AEOG/data/mantildas_faam_20180911_r1_c118.na"
flightdata = pd.read_fwf(infile, skiprows=53, header=None, names=['Time', 'C2H6', 'Flag'],)
but I m skipping approximately 53 rows because I counted how much I should skip. I have a bunch of these files and some don't have exactly 53 rows in the header so I was wondering what would be the best way to deal with this and a criteria to have Python always only read the three columns of data when finds them? I thought if I'd want let's say Python to actually read the data from where encounters
Mole fraction error flag description :
0 : Valid data
2 : Missing data
what should I do ? What about another criteria to use which would work better ?
You can split on the header delimiter, like so:
with open(filename, 'r') as f:
myfile = f.read()
infile = myfile.split('Mole fraction error flag description :')[-1]
# skip lines with missing data
infile = infile.split('\n')
# likely a better indicator of a line with incorrect format, you know the data better
infile = '\n'.join([line for line in infile if ' : ' not in line])
# create dataframe
flightdata = pd.read_fwf(infile, header=None, names=['Time', 'C2H6', 'Flag'],)
I've been running a dataset through Weka, applying NB.
I stuck on the following problem: while I was analyzing it, I noticed the difference between total number in attributes section, and total instances appeared in log.
If you sum "a0" attribute, you'll notice Weka points 1044 instances.
If you check "Instances", it is 1036.
Dataset, actually, contains 1036 instances.
Does anyone have a explanation about it? Thanks.
Here's a log paste:
=== Run information ===
Scheme: weka.classifiers.bayes.NaiveBayes
Relation: teste.carro
Instances: 1036
Attributes: 7
a0
a1
a2
a3
a4
a5
class
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute 0 1
(0.5) (0.5)
===========================
a0
1 105.0 175.0
2 112.0 165.0
3 153.0 109.0
4 152.0 73.0
[total] 522.0 522.0
a1
1 101.0 165.0
2 123.0 165.0
3 136.0 119.0
4 162.0 73.0
[total] 522.0 522.0
a2
1 150.0 107.0
2 122.0 133.0
3 121.0 141.0
4 129.0 141.0
[total] 522.0 522.0
a3
1 247.0 1.0
2 134.0 265.0
3 140.0 255.0
[total] 521.0 521.0
a4
1 189.0 127.0
2 177.0 185.0
3 155.0 209.0
[total] 521.0 521.0
a5
1 244.0 1.0
2 160.0 220.0
3 117.0 300.0
[total] 521.0 521.0
Time taken to build model: 0 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0.01 seconds
=== Summary ===
Correctly Classified Instances 957 92.3745 %
Incorrectly Classified Instances 79 7.6255 %
Kappa statistic 0.8475
Mean absolute error 0.1564
Root mean squared error 0.2398
Relative absolute error 31.2731 %
Root relative squared error 47.9651 %
Coverage of cases (0.95 level) 100 %
Mean rel. region size (0.95 level) 80.2124 %
Total Number of Instances 1036
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0,847 0,000 1,000 0,847 0,917 0,858 0,989 0,991 0
1,000 0,153 0,868 1,000 0,929 0,858 0,989 0,988 1
Weighted Avg. 0,924 0,076 0,934 0,924 0,923 0,858 0,989 0,989
=== Confusion Matrix ===
a b <-- classified as
439 79 | a = 0
0 518 | b = 1
Reading from "Data Mining: Practical Machine Learning Tools and Techniques" by Witten and Frank (the companion book for Weka) a problem is pointed out in naive Bayes.
If a particular attribute value does not appear with every possible class value, then the zero attribute has undue influence over the class prediction. In Weka, this possibility is avoided by adding one to the numerator of every categorical attribute when calculating the conditional probabilities (with the denominator adjusted accordingly). If you look at your example you can verify this is what was done.
Below I attempt to explain the undue influence that is exhibited by the absence of an attribute value.
The naive bayes formula:
P(y|x)= ( P(x1|y) * P(x2|y) * ... * P(xn|y) * P(Y) ) / P(x)
From the naive bayes formula we can see what they mean:
Say:
P(x1|y1) = 0
P(x2|y1) ... P(xn|y1) all equal 1
From the above formula:
P(y1|x) = 0
Even though all other attributes strongly indicate that the instance belongs to class y1, the resulting probability is zero. The adjustment made by Weka allows for the possibility that the instance still comes from the class y1.
A true numeric example can be found starting around slide 12 on this webpage