Time Series Forecasting Python | Constant Increasing Predictions - python-3.x

This might seem like a stupid question, but i have been trying out Time Series Forecasting Techniques, and somehow both the techniques(Prophet & Auto-Arima) that i tried, seemed to give predictions in an increasing order?
What could be the possible reasons for getting constantly increasing predictions? I think it might be due to seasonality factor, but i am not really sure.
I can share the code if required.
The code is as follows:
data_v2 = pd.read_csv('data_v1.csv')
data_v2.shape
data_v2.head()
data_v2.dtypes
data_v2['Date'] = pd.to_datetime(data_v2.Date,format='%m/%d/%Y')
data_v2.index = data_v2['Date']
#preparing data
data_v2.rename(columns={'Invoice Amount': 'y', 'Date': 'ds'}, inplace=True)
data_v2.head()
#train and validation
train = data_v2[:16]
train.shape
valid = data_v2[16:]
valid.shape
#fit the model
model = Prophet()
model.fit(train)
#predictions
close_prices = model.make_future_dataframe(periods=len(valid))
forecast = model.predict(close_prices)
forecast.head()
#rmse
forecast_valid = forecast['yhat'][16:]
rms=np.sqrt(np.mean(np.power((np.array(valid['y'])-np.array(forecast_valid)),2)))
print(rms)
valid['Predictions'] = 0
valid['Predictions'] = forecast_valid.values
plt.plot(train['y'])
plt.plot(valid[['y', 'Predictions']])
# Plot the forecast
f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(15)
fig = model.plot(forecast, ax=ax)
plt.show()
fig = model.plot_components(forecast)
The following are the predictions:
0 30505.608982
1 31618.779403
2 32731.949825
3 33737.394077
4 34850.564499
5 35927.826201
6 37040.996625
7 38118.258327
8 39231.428751
9 40344.599176
10 41421.860877
11 42535.031302
12 43612.293004
13 44725.463429
14 45838.633854
15 46844.078108
16 46879.986832
17 46915.895555
18 46951.804279
19 46987.713002
20 47023.621725
21 47059.530449
22 47095.439172
23 47131.347896
Following are the actual values:
2016-12-01 63662.5
2017-01-01 35167.5
2017-02-01 24810.0
2017-03-01 25352.5
2017-04-01 19355.0
2017-05-01 21860.0
2017-06-01 21420.0
2017-07-01 30260.0
2017-08-01 26810.0
2017-09-01 29510.0
2017-10-01 84722.5
2017-11-01 71706.5
2017-12-01 44935.0
2018-01-01 43835.0
2018-02-01 35405.0
2018-03-01 40307.5
2018-04-01 26665.0
2018-05-01 27395.0
2018-06-01 89142.5
2018-07-01 100497.5
2018-08-01 41722.5
2018-09-01 30760.0
2018-10-01 183562.5
2018-11-01 90650.0
Thanks in advance!

Related

Convert monthly cumulative values to current month values in Pandas

For the following data df1 with missing January data, cumul_val1 and cumul_val2 are the monthly cumulative values of value1 and value2 respectively.
date cumul_val1 cumul_val2
0 2020-05-31 48702.97 45919.59
1 2020-06-30 69403.68 62780.21
2 2020-07-31 83631.36 75324.61
3 2020-08-31 98485.95 88454.14
4 2020-09-30 117072.67 103484.20
5 2020-10-31 133293.80 116555.76
6 2020-11-30 150834.45 129492.36
7 2020-12-31 176086.22 141442.95
8 2021-02-28 17363.14 13985.87
9 2021-03-31 36007.05 27575.82
10 2021-04-30 50305.00 40239.76
11 2021-05-31 66383.32 54318.08
12 2021-06-30 88635.35 72179.07
13 2021-07-31 101648.18 84895.41
14 2021-08-31 114192.81 98059.73
15 2021-09-30 130331.78 112568.07
16 2021-10-31 143040.71 124933.62
17 2021-11-30 158130.73 137313.96
18 2021-12-31 179433.41 147602.08
19 2022-02-28 15702.61 14499.38
20 2022-03-31 31045.96 27764.95
21 2022-04-30 39768.15 39154.31
22 2022-05-31 50738.38 52133.62
I now want to convert them into monthly values. For example, the value of value1 on 2021-04-30 is calculated by 50305.00-36007.05. It can be seen that the value in January is missing, so the current month value in February is the accumulated value itself, and the current month value in March will be the accumulated value in March minus the accumulated value in February.
May I ask how to achieve it?
The expected result:
date cumul_val1 cumul_val2 month_val1 month_val2
0 2020-05-31 48702.97 45919.59 NaN NaN
1 2020-06-30 69403.68 62780.21 20700.71 16860.62
2 2020-07-31 83631.36 75324.61 14227.68 12544.40
3 2020-08-31 98485.95 88454.14 14854.59 13129.53
4 2020-09-30 117072.67 103484.20 18586.72 15030.06
5 2020-10-31 133293.80 116555.76 16221.13 13071.56
6 2020-11-30 150834.45 129492.36 17540.65 12936.60
7 2020-12-31 176086.22 141442.95 25251.77 11950.59
8 2021-02-28 17363.14 13985.87 17363.14 13985.87
9 2021-03-31 36007.05 27575.82 18643.91 13589.95
10 2021-04-30 50305.00 40239.76 14297.96 12663.94
11 2021-05-31 66383.32 54318.08 16078.32 14078.32
12 2021-06-30 88635.35 72179.07 22252.03 17860.99
13 2021-07-31 101648.18 84895.41 13012.83 12716.34
14 2021-08-31 114192.81 98059.73 12544.63 13164.32
15 2021-09-30 130331.78 112568.07 16138.97 14508.34
16 2021-10-31 143040.71 124933.62 12708.94 12365.55
17 2021-11-30 158130.73 137313.96 15090.02 12380.34
18 2021-12-31 179433.41 147602.08 21302.68 10288.12
19 2022-02-28 15702.61 14499.38 15702.61 14499.38
20 2022-03-31 31045.96 27764.95 15343.35 13265.57
21 2022-04-30 39768.15 39154.31 8722.19 11389.36
22 2022-05-31 50738.38 52133.62 10970.22 12979.31
Notes: in order to simplify the question, I added a new alternative sample data df2 without missing months:
date cumul_val monthly_val
0 2020-09-30 32144142.46 NaN
1 2020-10-31 36061223.45 3917080.99
2 2020-11-30 40354684.50 4293461.05
3 2020-12-31 44360036.58 4005352.08
4 2021-01-31 4130729.28 4130729.28
5 2021-02-28 7985781.64 3855052.36
6 2021-03-31 12306556.74 4320775.10
7 2021-04-30 16873032.10 4566475.36
8 2021-05-31 21730065.01 4857032.91
9 2021-06-30 26816787.85 5086722.84
10 2021-07-31 31785276.80 4968488.95
11 2021-08-31 37030178.38 5244901.58
12 2021-09-30 42879767.13 5849588.75
13 2021-10-31 48392250.79 5512483.66
14 2021-11-30 53655448.65 5263197.86
15 2021-12-31 59965790.04 6310341.39
16 2022-01-31 5226910.15 5226910.15
17 2022-02-28 9481147.06 4254236.91
18 2022-03-31 14205738.71 4724591.65
19 2022-04-30 19096746.32 4891007.61
20 2022-05-31 24033460.77 4936714.45
21 2022-06-30 28913566.31 4880105.54
22 2022-07-31 34099663.15 5186096.84
23 2022-08-31 39082926.81 4983263.66
24 2022-09-30 44406354.61 5323427.80
25 2022-10-31 48889431.89 4483077.28
26 2022-11-30 52956747.09 4067315.20
27 2022-12-31 57184652.60 4227905.51
Had there been no gap in the data, the problem would have been an easy .diff(). However, since there are gaps, we need to fill those gap with 0, calculate the diff, then keep only the original months.
idx = pd.to_datetime(df["date"])
month_val = (
df[["cumul_val1", "cumul_val2"]]
# Fill the gap months with 0
.set_index(idx)
.reindex(pd.date_range(idx.min(), idx.max(), freq="M"), fill_value=0)
# Take the diff
.diff()
# Keep only the original months
.loc[idx]
# Beat into shape for the subsequent concat
.set_axis(["month_val1", "month_val2"], axis=1)
.set_index(df.index)
)
result = pd.concat([df, month_val], axis=1)
Edit: the OP clarified that for the first entry of the year, be it Jan or Feb, the monthly value is the same as a cumulative value. In that case, use this:
cumul_cols = ["cumul_val1", "cumul_val2"]
monthly_cols = [f"month_val{i+1}" for i in range(len(cumul_cols))]
# Make sure `date` is of type Timestamp and the dataframe is sorted. You data
# may have satisfied both conditions already.`
df["date"] = pd.to_datetime(df["date"])
df = df.sort_values("date")
# Return True if current row is in the same year as the previous row.
# Repeat the result for each cumul_val column.
is_same_year = np.tile(
df["date"].dt.year.diff().eq(0).to_numpy()[:, None],
(1, len(cumul_cols)),
)
month_val = np.where(
is_same_year,
df[cumul_cols].diff(),
df[cumul_cols],
)
month_val[0, :] = np.nan
df[monthly_cols] = month_val
It would be much easier to use your date as a PeriodIndex with monthly frequencies:
# set up the date as a monthly period Index
df2 = df.assign(date=pd.to_datetime(df['date']).dt.to_period('M')).set_index('date')
# subtract the previous month
df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2)
Output:
cumul_val1 cumul_val2
date
2020-05 48702.97 45919.59
2020-06 20700.71 16860.62
2020-07 14227.68 12544.40
2020-08 14854.59 13129.53
2020-09 18586.72 15030.06
2020-10 16221.13 13071.56
2020-11 17540.65 12936.60
2020-12 25251.77 11950.59
2021-02 17363.14 13985.87
2021-03 18643.91 13589.95
2021-04 14297.95 12663.94
2021-05 16078.32 14078.32
2021-06 22252.03 17860.99
2021-07 13012.83 12716.34
2021-08 12544.63 13164.32
2021-09 16138.97 14508.34
2021-10 12708.93 12365.55
2021-11 15090.02 12380.34
2021-12 21302.68 10288.12
2022-02 15702.61 14499.38
2022-03 15343.35 13265.57
2022-04 8722.19 11389.36
2022-05 10970.23 12979.31
If you want to assign back to the original DataFrame:
df[['month_val1', 'month_val2']] = df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2).to_numpy()
Updated df:
date cumul_val1 cumul_val2 month_val1 month_val2
0 2020-05-31 48702.97 45919.59 48702.97 45919.59
1 2020-06-30 69403.68 62780.21 20700.71 16860.62
2 2020-07-31 83631.36 75324.61 14227.68 12544.40
3 2020-08-31 98485.95 88454.14 14854.59 13129.53
4 2020-09-30 117072.67 103484.20 18586.72 15030.06
5 2020-10-31 133293.80 116555.76 16221.13 13071.56
6 2020-11-30 150834.45 129492.36 17540.65 12936.60
7 2020-12-31 176086.22 141442.95 25251.77 11950.59
8 2021-02-28 17363.14 13985.87 17363.14 13985.87
9 2021-03-31 36007.05 27575.82 18643.91 13589.95
10 2021-04-30 50305.00 40239.76 14297.95 12663.94
11 2021-05-31 66383.32 54318.08 16078.32 14078.32
12 2021-06-30 88635.35 72179.07 22252.03 17860.99
13 2021-07-31 101648.18 84895.41 13012.83 12716.34
14 2021-08-31 114192.81 98059.73 12544.63 13164.32
15 2021-09-30 130331.78 112568.07 16138.97 14508.34
16 2021-10-31 143040.71 124933.62 12708.93 12365.55
17 2021-11-30 158130.73 137313.96 15090.02 12380.34
18 2021-12-31 179433.41 147602.08 21302.68 10288.12
19 2022-02-28 15702.61 14499.38 15702.61 14499.38
20 2022-03-31 31045.96 27764.95 15343.35 13265.57
21 2022-04-30 39768.15 39154.31 8722.19 11389.36
22 2022-05-31 50738.38 52133.62 10970.23 12979.31
Fortunately pandas offers a diff function for this:
df = pd.DataFrame([['2020-05-31',48702.97,45919.59], ['2020-06-30',69403.68,62780.21], ['2020-07-31',83631.36,75324.61]], columns=['date','cumul_val1','cumul_val2'])
df['val1'] = df['cumul_val1'].diff()
df['val2'] = df['cumul_val2'].diff()
print(df)

ValueError: cannot reindex from a duplicate axis while shift one column in Pandas

Given a dataframe df with date index as follows:
value
2017-03-31 NaN
2017-04-01 27863.7
2017-04-02 27278.5
2017-04-03 27278.5
2017-04-04 27278.5
...
2021-10-27 NaN
2021-10-28 NaN
2021-10-29 NaN
2021-10-30 NaN
2021-10-31 NaN
I'm able to shift value column by one year use df['value'].shift(freq=pd.DateOffset(years=1)):
Out:
2018-03-31 NaN
2018-04-01 27863.7
2018-04-02 27278.5
2018-04-03 27278.5
2018-04-04 27278.5
...
2022-10-27 NaN
2022-10-28 NaN
2022-10-29 NaN
2022-10-30 NaN
2022-10-31 NaN
But when I use it to replace orginal value by df['value'] = df['value'].shift(freq=pd.DateOffset(years=1)), it raises an error:
ValueError: cannot reindex from a duplicate axis
Since the code below works smoothly, so I think the issue caused by NaNs in value column:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130101', periods=720)
df = pd.DataFrame(np.random.randint(0, 100, size=(720, 3)), index=dates, columns=list('ABC'))
df
df.B = df.B.shift(freq=pd.DateOffset(years=1))
I also try with df['value'].shift(freq=relativedelta(years=+1)), but it generates: pandas.errors.NullFrequencyError: Cannot shift with no freq
Someone could help to deal with this issue? Sincere thanks.
Since the code below works smoothly, so I think the issue caused by NaNs in value column
No I don't think so. It's probably because in your 2nd sample you have only 1 leap year.
Reproducible error with 2 leap years:
# 2018 (366 days), 2019 (365 days) and 2020 (366 days)
dates = pd.date_range('20180101', periods=365*3+1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3+1, 3)),
index=dates, columns=list('ABC'))
df.B = df.B.shift(freq=pd.DateOffset(years=1))
...
ValueError: cannot reindex from a duplicate axis
...
The example below works:
# 2017 (365 days), 2018 (366 days) and 2019 (365 days)
dates = pd.date_range('20170101', periods=365*3+1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3+1, 3)),
index=dates, columns=list('ABC'))
df.B = df.B.shift(freq=pd.DateOffset(years=1))
Just look to value_counts:
# 2018 -> 2020
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2021-02-28 2 # The duplicated index
2020-12-29 1
2021-01-04 1
2021-01-03 1
2021-01-02 1
..
2020-01-07 1
2020-01-08 1
2020-01-09 1
2020-01-10 1
2021-12-31 1
Length: 1095, dtype: int64
# 2017 -> 2019
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2018-01-01 1
2019-12-30 1
2020-01-05 1
2020-01-04 1
2020-01-03 1
..
2019-01-07 1
2019-01-08 1
2019-01-09 1
2019-01-10 1
2021-01-01 1
Length: 1096, dtype: int64
Solution
Obviously, the solution is to remove duplicated index, in our case '2021-02-28', by using resample('D') and an aggregate function first, last, min, max, mean, sum or a custom one:
>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28']
2021-02-28 41
2021-02-28 96
Name: B, dtype: int64
>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28'] \
.resample('D').agg(('first', 'last', 'min', 'max', 'mean', 'sum')).T
2021-02-28
first 41.0
last 96.0
min 41.0
max 96.0
mean 68.5
sum 137.0
# Choose `last` for example
df.B = df.B.shift(freq=pd.DateOffset(years=1)).resample('D').last()
Note, you can replace .resample(...).func by .loc[lambda x: x.index.duplicated()]

Pandas calculating over duplicated entries

This is my sample dataframe
Price DateOfTrasfer PAON Street
115000 2018-07-13 00:00 4 THE LANE
24000 2018-04-10 00:00 20 WOODS TERRACE
56000 2018-06-22 00:00 6 HEILD CLOSE
220000 2018-05-25 00:00 25 BECKWITH CLOSE
58000 2018-05-09 00:00 23 AINTREE DRIVE
115000 2018-06-21 00:00 4 EDEN VALE MEWS
82000 2018-06-01 00:00 24 ARKLESS GROVE
93000 2018-07-06 00:00 14 HORTON CRESCENT
42500 2018-06-27 00:00 18 CATHERINE TERRACE
172000 2018-05-25 00:00 67 HOLLY CRESCENT
this is the task to perform:
For any address that appears more than once in a dataset, define a holding period as the time
between any two consecutive transactions involving that property (i.e. N(holding_periods)
= N(appearances) - 1. Implement a function that takes price paid data and returns the
average length of a holding period and the annualised change in value between the purchase
and sale, grouped by the year a holding period ends and the property type.
def holding_time(df):
df = df.copy()
# to work only with dates (day)
df.DateOfTrasfer = pd.to_datetime(df.DateOfTrasfer)
cols = ['PAON', 'Street']
df['address'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df.drop(["PAON", 'Street'],axis=1,inplace=True)
df = df.groupby(['address', 'Price'],as_index=False).agg({'PPD':'size'})\
.rename(columns={'PPD':'count_2'})
return df
This script creates columns containing the individual holding times, the average holding time for that property, and the price changes during the holding times:
import numpy as np
import pandas as pd
# assume df is defined above ...
hdf = df.groupby("Street", sort=False).apply(lambda c: c.values[:,1]).reset_index(name='hgb')
pdf = df.groupby("Street", sort=False).apply(lambda c: c.values[:,0]).reset_index(name='pgb')
df['holding_periods'] = hdf['hgb'].apply(lambda c: np.diff(c.astype(np.datetime64)))
df['price_changes'] = pdf['pgb'].apply(lambda c: np.diff(c.astype(np.int64)))
df['holding_periods'] = df['holding_periods'].fillna("").apply(list)
df['avg_hold'] = df['holding_periods'].apply(lambda c: np.array(c).astype(np.float64).mean() if c else 0).fillna(0)
df.drop_duplicates(subset=['Street','avg_hold'], keep=False, inplace=True)
I created 2 new dummy entries for "Heild Close" to test it:
# Input:
Price DateOfTransfer PAON Street
0 115000 2018-07-13 4 THE LANE
1 24000 2018-04-10 20 WOODS TERRACE
2 56000 2018-06-22 6 HEILD CLOSE
3 220000 2018-05-25 25 BECKWITH CLOSE
4 58000 2018-05-09 23 AINTREE DRIVE
5 115000 2018-06-21 4 EDEN VALE MEWS
6 82000 2018-06-01 24 ARKLESS GROVE
7 93000 2018-07-06 14 HORTON CRESCENT
8 42500 2018-06-27 18 CATHERINE TERRACE
9 172000 2018-05-25 67 HOLLY CRESCENT
10 59000 2018-06-27 12 HEILD CLOSE
11 191000 2018-07-13 1 HEILD CLOSE
# Output:
Price DateOfTransfer PAON Street holding_periods price_changes avg_hold
0 115000 2018-07-13 4 THE LANE [] [] 0.0
1 24000 2018-04-10 20 WOODS TERRACE [] [] 0.0
2 56000 2018-06-22 6 HEILD CLOSE [5 days, 16 days] [3000, 132000] 10.5
3 220000 2018-05-25 25 BECKWITH CLOSE [] [] 0.0
4 58000 2018-05-09 23 AINTREE DRIVE [] [] 0.0
5 115000 2018-06-21 4 EDEN VALE MEWS [] [] 0.0
6 82000 2018-06-01 24 ARKLESS GROVE [] [] 0.0
7 93000 2018-07-06 14 HORTON CRESCENT [] [] 0.0
8 42500 2018-06-27 18 CATHERINE TERRACE [] [] 0.0
9 172000 2018-05-25 67 HOLLY CRESCENT [] [] 0.0
Your question also mentions the annualised change in value between the purchase and sale, grouped by the year a holding period ends and the property type, but there is no property type column (PAON maybe?) and grouping by year would make the table extremely difficult to read, so I did not implement it. As it stands, you have the holding time between each transaction and the change of price at each time, so it should be trivial to implement a function to use this information to plot annualized data, if you so choose.
After manually calculating the max and min average difference checking, I had to modify the accepted solution, in order to match the manual results.
these are the database, this function is a bit slow so I would appreciate a faster implementation.
urls = ['http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2020.csv',
'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2019.csv',
'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2018.csv']
def holding_time(df):
df = df.copy()
df = df[['Price', 'DateOfTrasfer', 'Prop_Type', 'Postcode', 'PAON', 'Street']]
df = df[df.duplicated(subset=['Postcode', 'PAON', 'Street'], keep=False)]
cols = ['Postcode', 'PAON', 'Street']
df['address'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
df['address'] = df['address'].apply(lambda x: x.replace(' ', '_'))
df.DateOfTrasfer = pd.to_datetime(df.DateOfTrasfer)
df['avg_price'] = df.groupby(['address'])['Price'].transform(lambda x: x.diff().mean())
df['avg_hold'] = df.groupby(['address'])['DateOfTrasfer'].transform(lambda x: x.diff().dt.days.mean())
df.drop_duplicates(subset=['address'], keep='first', inplace=True)
df.drop(['Price', 'DateOfTrasfer', 'address'], axis=1, inplace=True)
df = df.dropna()
df['avg_hold'] = df['avg_hold'].map('Days {:.1f}'.format)
df['avg_price'] = df['avg_price'].map('£{:,.1F}'.format)
return df

reshape dataframe time series

[![enter image description here][1]][1]I have a dataframe for a weather data in certain shape and i want to transform it, but struggling on it.
My dataframe looks like that :
city temp_day1, temp_day2, temp_day3 ...., hum_day1, hum_day2, hum_day4, ..., condition
city_1 12 13 20 44 44.5 good 44
city_1 12 13 20 44 44.5
bad 44
city_2 14 04 33 44 44.5
good 44
I want to transforme it to
city_1 city_2 .....
day. temperature humidity condition ... temperature humidity condition
1 12 44 good . 12 13
20 44 44.5
2 13 44 .5 bad .
3 20 NaN bad .
4 NaN 44 .
some day dont have temperature values and humidity values
Thanks for your help
Use wide_to_long with DataFrame.unstack and last DataFrame.swaplevel and DataFrame.sort_index:
df1 = (pd.wide_to_long(df,
stubnames=['temp','hum'],
i='city',
j='day',
sep='_',
suffix='\w+')
.unstack(0)
.swaplevel(1,0, axis=1)
.sort_index(axis=1))
print (df1)
city city_1
hum temp
day
day1 44.0 12.0
day2 44.5 13.0
day3 NaN 20.0
day4 44.0 NaN
Alternative solution:
df1 = df.set_index('city')
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack([0,1]).unstack([0,1])
If need extract numbers from index:
df1 = (pd.wide_to_long(df,
stubnames=['temp','hum'],
i='city',
j='day',
sep='_',
suffix='\w+')
.unstack(0)
.swaplevel(1,0, axis=1)
.sort_index(axis=1))
df1.index = df1.index.str.extract('(\d+)', expand=False)
print (df1)
city city_1
hum temp
day
1 44.0 12.0
2 44.5 13.0
3 NaN 20.0
4 44.0 NaN
EDIT:
Solution with real data:
df1 = df.set_index(['condition', 'ACTIVE', 'mode', 'apply', 'spy', 'month'], append=True)
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack([0,1]).unstack([0,-2])
If need remove unnecessary levels in MultiIndex:
df1 = df1.reset_index(level=['condition', 'ACTIVE', 'mode', 'apply', 'spy', 'month'], drop=True)
You can use pandas transpose method like this: df.T
This will turn your dataframe into one row. If you create multiple columns, you can slice it with indexing and assing each slice to independent columns.

Train/test set split for LSTM with multivariate Time series

I'm trying to solve time series prediction problem for multivariate data in Python using LSTM approach.
In here , the author solving problem for time series air pollution prediction. The data looks like this:
pollution dew temp press wnd_dir wnd_spd snow rain
date
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0
As opposed to yearly in the above tutorial, I have a 30-sec time step observations on soccer matches with over 20 features. Where each match with unique ID has different length ranging from 190 to 200.
The author split train/test set by number of days in a year as follow:
# split into train and test sets
values = reframed.values
n_train_hours = 365 * 24
train = values[:n_train_hours, :]
test = values[n_train_hours:, :]
So my train/test set should be by number of matches:
(matches*len(match))
n_train_matches = some k number of matches * len(match)
train = values[:n_train_matches, :]
test = values[n_train_matches:, :]
I want to translate this to my problem to make a prediction for each feature as early as time t=2. I.e. 30-sec into a match.
Question
Do I need to apply pre-Sequence Padding on each match?
Is there a way of solving the problem without padding?
If you are using an LSTM then I believe you are more likely to benefit from that model if you are padding and feeding in multiple 30 second step observations.
If you didn't pad the sequences, and you wanted a prediction at t=2, then you'll only be able to use the very last step-observation.

Resources