Find SARIMAX AIC and pdq values, statsmodels - python-3.x

I'm trying to find the values of p,d,q and the seasonal values of P,D,Q using statsmodels as "sm" in python.
The data set i'm using is a csv file that contains time series data over three years recording the energy consumption. The file was split into a smaller data frame in order to work with it. Here is what df_test.head() looks like.
time total_consumption
122400 2015-05-01 00:01:00 106.391
122401 2015-05-01 00:11:00 120.371
122402 2015-05-01 00:21:00 109.292
122403 2015-05-01 00:31:00 99.838
122404 2015-05-01 00:41:00 97.387
Here is my code so far.
#Importing the timeserie data set from local file
df = pd.read_csv(r"C:\Users\path\Name of the file.csv")
#Rename the columns, put time as index and assign datetime to the column time
df.columns = ["time","total_consumption"]
df['time'] = pd.to_datetime(df.time)
df.set_index('time')
#Select test df (there is data from the 2015-05-01 2015-06-01)
df_test = df.loc[(df['time'] >= '2015-05-01') & (df['time'] <= '2015-05-14')]
#Find minimal AIC value for the ARIMA model integers
p = range(0,2)
d = range(0,2)
q = range(0,2)
pdq = list(itertools.product(p,d,q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p,d,q))]
warnings.filterwarnings("ignore")
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(df_test,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
except:
continue
When I try to run the code as it is, the program doesn't even acknowledge the "for" loop. But when I take out the
try:
except:
continue
the program gives me this error message
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
How could I remedy that and is there a way to automate the process directly output the parameters with the lowest AIC value (without having to look for it throught all the possibilities).
Thanks !

Related

How to extract 4dimensional data from a list of pandas dataframes?

I have a list of 500 dataframes (in the form of .csv files); 500 = 20(time) x 25(energy) bins. In other words, each dataframe is a measurement of flux at a single time and energy and is represented as 150x150 mesh grid corresponding to x and y spatial coordinates. However, I would like to transform these data into 4-d coordinates of the form Flux(x, y, t, E) such that I have new set of dataframes with columns E and rows t for any given (x,y) position.
I am not sure how to approach the problem. I would appreciate your help in giving me some sort of roadmap for doing this procedure.
Note:
The time and energy of each dataframe is in the name of the corresponding .csv file in the form time-5e+35-energy0.00023-position.csv where t=-5 10^35 and E=0.00023.
What I know:
500 dataframes of 20tx25E must be converted to 22,500 dataframes of 150x150 coordinates. However, this is very time consuming and I am not sure if there is any other package in python3 that can do the job easier.
Code that combines your files into one big Pandas dataframe of size 11,250,000 or 25 × 20 × 150 × 150:
import pandas as pd
from glob import glob
import re
from datetime import datetime
pattern_file_name = re.compile(r'time-(.*)-energy(.*)-position.csv')
start_time = datetime.now()
result_df = None
for file_name in glob('time-*.csv'):
# extract time and energy values from file name
if not pattern_file_name.match(file_name):
raise ValueError(f'file name {file_name} failed pattern match.')
time_s, energy_s = pattern_file_name.findall(file_name)[0]
time, energy = float(time_s), float(energy_s)
print(f'Processing | {time_s} | {energy_s} |...')
df = pd.read_csv(file_name, header=None)
# assuming the CSV (i) has no headers (ii) is an array of 150x150...
# ...floats with no missing or problematic values (iii) each row...
# ...represents a fixed y-coordinate; adjust to your needs
df.index.name = 'y'
df = df.stack()
df.index.rename('x', level=-1, inplace=True)
df = df.swaplevel().sort_index().reset_index().rename(columns={0: 'flux'})
# df is now (x, y, f)
# x and y will each vary from 0 to 149
df.insert(0, 't', time)
df.insert(0, 'E', energy)
result_df = df if result_df is None else pd.concat([result_df, df])
result_df = result_df.set_index(['E', 't', 'x', 'y']).sort_index()
# result_df is now (E, t, x, y) -> flux
result_df.to_csv('output.csv', index=True)
final_time = datetime.now()
delta_time = final_time - start_time
print(f'Completed in {delta_time}')
The main steps are as follows:
Loop over file names
Extract t and E values from file name
Read square matrix of flux values from file
Transform 150 × 150 square matrix to Pandas dataframe of length 22,500
Add columns to keep track of E and t
Append local result to a global, ever-increasing result vector
Finally, leave the loop and save results to disk as CSV
The resulting CSV file will have 5 columns. The first four would represent (E,t,x,y) and the last column would be the value of the flux field at those co-ordinates.

How to get feature importances/feature ranking from summary plot in SHAP without crashing?

I am attempting to get shap values out of an array which was created by
explainer = shap.Explainer(xg_clf, X_train)
shap_values2 = explainer(X_train)
using my XGBoost data, to make a dataframe of feature_names and their SHAP importance, as they would appear in a SHAP bar or summary plot.
Following advice from how to extract the most important feature names? and How to get feature names of shap_values from TreeExplainer? specifically the comment by user Thoo, which shows how the values can be extracted to make a dataframe:
vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()
shap_values has 11595 persons with 595 features each, which I understand is large, but, creating the vals variable runs very slowly, about 58 minutes on my laptop. It uses almost all RAM on the computer.
After 58 minutes I get an error:
Command terminated by signal 9
which as far as I understand, means that the computer ran out of RAM.
I've tried converting the 2nd line in Thoo's code to
feature_importance = pd.DataFrame(list(zip(X_train.columns,np.abs(shap_values2).mean(0))),columns=['col_name','feature_importance_vals'])
so that vals isn't stored but this change doesn't reduce RAM at all.
I've also tried a different comment from the same GitHub issue (user "ba1mn"):
def global_shap_importance(model, X):
""" Return a dataframe containing the features sorted by Shap importance
Parameters
----------
model : The tree-based model
X : pd.Dataframe
training set/test set/the whole dataset ... (without the label)
Returns
-------
pd.Dataframe
A dataframe containing the features sorted by Shap importance
"""
explainer = shap.Explainer(model)
shap_values = explainer(X)
cohorts = {"": shap_values}
cohort_labels = list(cohorts.keys())
cohort_exps = list(cohorts.values())
for i in range(len(cohort_exps)):
if len(cohort_exps[i].shape) == 2:
cohort_exps[i] = cohort_exps[i].abs.mean(0)
features = cohort_exps[0].data
feature_names = cohort_exps[0].feature_names
values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))])
feature_importance = pd.DataFrame(
list(zip(feature_names, sum(values))), columns=['features', 'importance'])
feature_importance.sort_values(
by=['importance'], ascending=False, inplace=True)
return feature_importance
but global_shap_importance returns the feature importances in the wrong order, and I don't see how I can alter global_shap_importance so that the features are returned in the same order as summary_plot (beeswarm plot).
How can I get the feature importance ranking into a dataframe?
I pulled this straight from the source code. Confirmed identical to the summary_plot.
def shapley_feature_ranking(shap_values, X):
feature_order = np.argsort(np.mean(np.abs(shap_values), axis=0))
return pd.DataFrame(
{
"features": [X.columns[i] for i in feature_order][::-1],
"importance": [
np.mean(np.abs(shap_values), axis=0)[i] for i in feature_order
][::-1],
}
)
shapley_feature_ranking(shap_values[0], X)

'numpy.ndarray' object has no attribute 'iterrows' while predicting value using lstm in python

I have a dataset with three inputs and trying to predict next value of X1 with the combination of previous inputs values.
My three inputs are X1, X2, X3, X4.
So here I am trying to predict next future value of X1. To predict the next X1 these four inputs combination affect with:
X1 + X2 - X3 -X4
I wrote this code inside the class. Then I wrote the code to run the lstm . After that I wrote the code for predict value. Then it gave me this error. Can anyone help me to solve this problem?
my code:
def model_predict(data):
pred=[]
for index, row in data.iterrows():
val = row['X1']
if np.isnan(val):
data.iloc[index]['X1'] = pred[-1]
row['X1'] = pred[-1]
f = row['X1','X2','X3','X4']
s = row['X1'] - row['X2'] + row['X3'] -row['X4']
val = model.predict(s)
pred.append(val)
return np.array(pred)
After lstm code then I wrote the code for predict value:
pred = model_predict(x_test_n)
Gave me this error:
` ---> 5 pred = model_predict(x_test_n)
def model_predict(data):
pred=[]
-->for index, row in data.iterrows():
val = row['X1']
if np.isnan(val):`
AttributeError: 'numpy.ndarray' object has no attribute 'iterrows'
Apparenty, data argument of your function is a Numpy array, not a DataFrame.
Data, as a np.ndarray, has also no named columns.
One of possible solutions, keeping the argument as np.ndarray is:
iterate over rows of this array using np.apply_along_axis(),
refer to columns by indices (instead of names).
Another solution is to create a DataFrame from data, setting proper
column names and iterate on its rows.
One of possible solutions how to write the code without DataFrame
Assume that data is a Numpy table with 4 columns,
containing respectively X1, X2, X3 and X4:
[[ 1 2 3 4]
[10 8 1 3]
[20 6 2 5]
[31 3 3 1]]
Then your function can be:
def model_predict(data):
s = np.apply_along_axis(lambda row: row[0] + row[1] - row[2] - row[3],
axis=1, arr=data)
return model.predict(s)
Note that:
s - all input values to your model - can be computed in a single
instruction, calling apply_along_axis for each row (axis=1),
the predictions can also be computed "all at once", passing a Numpy
vector - just s.
For demonstration purpose, compute s and print it.

Concat dataframe to multi index dataframe with gradient values

I have a Multi-index dataframe with multiple test result values.
For further data analysis I want to add the derivation to the dataframe.
I tried to either calculate it via a lambda function directly after I grouped the dataframe. Grouping (mean values) is required due to the noise in the sampling.
Later I want to delete the rows from my dataframes where the derivative is <= 0.
The simplified Multi-index dataframe looks like this:
arrays = [['LS13', 'LS13', 'LS13', 'LS13','LS14','LS14','LS14','LS14','LS14','LS14','LS14','LS14'],[0, 2, 2.5, 3,0,2,5,5.5,6,6.5,7,7.5]]
index = pd.MultiIndex.from_arrays(arrays, names=('File', 'Flow Rate Setpoint [l/s]'))
df = pd.DataFrame({('Flow Rate [l/s]','mean') : [-0.057,2.089,2.496,3.011,0.056,2.070,4.995,5.519,6.011,6.511,7.030,7.499],('Time [s]','mean') : [42.225,104.909,165.676,226.446,42.225,104.918,469.560,530.328,591.100,651.864,712.660,773.034],('Shear Stress [Pa]','mean') : [-0.698,5.621,7.946,11.278,-0.774,6.557,40.610,48.370,54.685,58.414,58.356,56.254]},index=index)
if I run my code:
import numpy as np
xls = ['LS13', 'LS14']
gradient = [pd.Series(np.gradient(df.loc[(i),('Shear Stress [Pa]','mean')],df.loc[(i),('Time [s]','mean')])) for i in xls]
now I want to concat gradient to df on axis = 1, Title could be df['Gradient''values'].
So my pd.Series looks like:
Gradient
values
0 0.100808
1 0.069048
2 0.04654
3 0.054801
0 0.116941
1 0.087431
2 0.149521
3 0.115805
4 0.082639
5 0.030213
6 -0.017938
7 -0.034806
next step would be to remove/drop the rows where ['Gradient','values'] <= 0, in my example ['LS14','7':'7.5']
When I tried to concatenate both Dataframe df and Series gradient (I'm aware that the indexes are different)
merged = pd.concat([pd.DataFrame(df),pd.Series(gradient)], axis=1 , ignore_index = True)
Errors are usually one of the following:
ValueError: Buffer dtype mismatch, expected 'Python object' but got
'long long'
TypeError: cannot concatenate object of type "<class 'list'>"; only
pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
I would also assume there is an easier way to get this done with an lambda function and just apply it in place.
merged = pd.concat([df, pd.Series([gradient], name=('Gradient','value'))], axis=1)
I would have expected that to work, but I also get a miss match error:
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
when I try:
df[("Gradient","value")] =pd.Series([pd.Series(np.gradient(df.loc[(i),('Shear Stress [Pa]','mean')],df.loc[(i),('Time [s]','mean')])) for i in xls])
The 'Gradient','value' column gets correctly added to the dataframe but the values are again NaN.
You can try groupby().apply():
def get_gradients(x):
gradients = np.gradient(x[('Shear Stress [Pa]', 'mean')],x[('Time [s]', 'mean')] )
return pd.Series(gradients, index=x.index)
df[('Gradient','Value')] = (df.groupby('File', group_keys=False)
.apply(get_gradients)
)

How do I fix KeyError bug in my code while implementing K-Nearest Neighbours from scratch?

I am trying to implement K-Nearest Neighbours algorithm from scratch in Python. The code I wrote worked well for the Breast-Cancer-Wisconsin.csv dataset.
However, the same code when I try to run for Iris.csv dataset, my implementation fails and gives KeyError.
The only difference in the 2 datasets is the fact that in Breast-Cancer-Wisconsin.csv there are only 2 classes ('2' for malignant and '4' for benign) and both the labels are integers wheres in Iris.csv there are 3 classes ('setosa', 'versicolor', 'virginica') and all these 3 labels are in string type.
Here is the code I wrote (for Iris.csv) :
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
style.use('fivethirtyeight')
dataset = {'k':[[1,2],[2,3],[3,1]], 'r':[[6,5],[7,7],[8,6]]}
new_features = [5,7]
#[[plt.scatter(j[0],j[1], s=100, color=i) for j in dataset[i]] for i in dataset]
#plt.scatter(new_features[0], new_features[1], s=100)
#plt.show()
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
#full_data = df.astype(float).values.tolist()
#random.shuffle(full_data)
test_size = 0.2
train_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
test_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', correct/total)
When I run the above code, I get a KeyError message at line number 49.
Could anyone please explain to me where I am going wrong? Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
Also, how do I handle if the classes are in string type instead of integer?
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
REFERENCES
Iris.csv
Breas-Cancer-Wisconsin.csv
Let's start from your last question:
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
Yes, that would work. You shouldn't have to hardcode the names of all the classes of every problem in your code. Instead, you can just write a function that reads all the different values for the class attribute, and assigns a numeric value to each different one.
Could anyone please explain to me where I am going wrong?
Most likely, the problem is that you are reading an instance whose class attribute is not 'setosa', 'versicolor', 'virginica' (something like Iris-setosa perhaps?). The idea above should fix this problem.
Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
As discuss before, you just need to avoid hard-coding the names of the classes in your code
Also, how do I handle if the classes are in string type instead of integer?
def get_class_values(data):
classes_seen = {}
for i in data:
_class = data[-1]
if _class not in classes_seen:
classes_seen[_class] = len(classes_seen)
return classes_seen
A function like this one would return a mapping between all your classes (no matter the type) and numeric codes (from 0 to N-1). Using this mapping would also solve all the problems mentioned before.
Convert String Labels In CSV Files To Integer Labels
After going through some GitHub repos I came across a very simple yet elegant piece of code that solves the above problem. Hope it helps those who have faced this problem before (beginners especially!)
% read the csv file
df = pd.read_csv('iris.csv')
% clean the data file
df.replace('?', -99999, inplace=True)
% convert the string classes into integer types.
% integers are assigned from 0 to N-1.
% species is the name of the column which has class labels.
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
% convert the data frame to list
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Post Debugging
Turns out that we need not use the above piece of code also, i.e I can get the answer without explicitly converting the string labels into integer labels (using the above code).
I have posted the original code after some minor changes (below) and the key error is now fixed. Also, I am now getting an accuracy of 97% to 100% (only on IRIS dataset).
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
That is the only change you need to make to the original code I posted in order to make it work!! Simple!
However, please note that the numbers have to be given as integers and not string (otherwise it would lead to key error!).
Wrap-Up
There are some commented lines in the original code which I thought would be good to explain in case somebody ran into some issues. Here's one snippet with the comments removed (compare with original code in the question).
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Here's the output you get:
ValueError: could not convert string to float: 'virginica'
What went wrong?
Note that here we did not convert the string labels into integer labels. Therefore, when we tried to convert the data in the CSV to float values, the kernel threw an error because a string cannot be converted to float!
So one way to go about it is that you don't convert the data into floating point values and then you won't get this error. However in many cases you need to convert all the data into floating point (for eg.. normalisation, accuracy, long mathematical calculations, prevention of loss of precision etc etc..).
Hence after heavy debugging and going through a lot of articles I finally came up with a simple version of the original code (below):
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', (correct/total)*100,'%')
Hope this helps!

Resources