Stock Prediction on basis of Symbol, Date, AveragePrice - python-3.x

I am trying to predict stock prices for next 7 days based on the data available for last 5 years. Data looks like this
I am trying to apply Support vector regression on this data set. i have already converted date column to pandas datetime using data.Date = pd.to_datetime(data.Date), but still i get this error
float() argument must be a string or a number, not 'Timestamp'.
My code is as follows
from sklearn.svm import SVR
adaniPorts = data[data.Symbol == 'ADANIPORTS']
from sklearn.cross_validation import train_test_split
X = adaniPorts[['Symbol', 'Date']]
Y = adaniPorts['Average Price']
x_train, x_test, y_train, y_test = train_test_split(X, Y)
classifier = SVR().fit(x_train, y_train)
is there any way to resolve this problem of datetime?

When you train the SVR you can only use numerical features. One way to include the datetime information would be to use pd.to_timedelta(df.date).dt.total_seconds()so you also feed the regressor with a numerical feature representing the date in this case. Another way would be to include the different fields of the datetime object, year, month, day as predictors.
However, using a SVR for time series forecasting would make more sense if the features provided enough information to overcome the temporal component, which dubiously is the case.
Furthermore you are using train_test_split, which will generate random train and test subsets from the original data.
This cannot be applied directly with time series data as it assumes that there is no relationship between the observations. When dealing with time series the data must be split respecting the temporal order in which values were observed.
I suggest you also give a look at Recurrent neural networks or ARIMA models

Like the answer from Alexandre said just numerical features are supported. If you used string feature it automaticaly transforms to numerical. You have several options. the first one is like he does, transforms each date to numerical seconds, but i think is better to transforme to one-hot encoding for each part of the date.
data['day'] = data.Date.dt.day
data['month'] = data.Date.dt.month
data['year'] = data.Date.dt.year
With this you have day, month and year separated. Now you can encode like one-hot. this is to build a vector of 0s for each element and then fill with one on the date you are working. For example the 3rd for a month will be:
[0,0,1,0,....,0] -> 1x31
To do that with pandas you can use something like this.
data = pd.concat([data, pd.get_dummies(data.year, prefix='year')], axis=1, sort=False)
data = pd.concat([data, pd.get_dummies(data.month, prefix='month')], axis=1, sort=False)
data = pd.concat([data, pd.get_dummies(data.day, prefix='day')], axis=1, sort=False)
Also can be interesting to add the week day because on weekends the world is stopper.
data['week_day'] = data.Date.dt.dayofweek
And before to pass to SVR drop the Date column. data.drop(['Date'], axis=1, inplace=True)
I hope this works
PS. I would recommend you LSTM (neural network) or Arima (estadistic model) for this task.

Related

Should I standardize the second datset with the same scaling as in the first dataset?

I am in very much confusion.
I have two datasets. One dataset is considered a source domain (Dataset A) and other dataset is considered a target domain (Dataset B).
First, I standardized each column of Dataset A using mean and standard deviation value of respective columns. I have 600 points in the dataset A. Then I splitted my dataset into Training, Validation and Testing dataset. I trained CNN model and then I tested model using testing dataset. I gives pretty accurate results (prediction).
I have calculated mean and standard deviation of each column available in Dataset A as follow,
thicknessMean = np.mean(thick_SD)
MaxForceMean = np.mean(maxF_SD)
MeanForceMean = np.mean(meanF_SD)
thicknessstd = np.std(thick_SD)
MaxForcestd = np.std(maxF_SD)
MeanForcestd = np.std(meanF_SD)
thick_SD_scaled = (thick_SD - thicknessMean)/thicknessstd
maxF_SD_scaled = (maxF_SD - MaxForceMean)/MaxForcestd
meanF_SD_scaled = (meanF_SD - MeanForceMean)/MeanForcestd
Now, I want to make prediction from the model by feeding the Dataset B. Therefore, I saved the already trained model (with .pth file). Then I standardize the dataset B, but this time I have transformed the dataset using 'mean' and 'standard deviation' of the dataset A. After doing this, I evaluate the already trained model using dataset B. But it is giving a worse prediction.
thick_TD_scaled = (thick_TD - thicknessMean)/thicknessstd
maxF_TD_scaled = (maxF_TD - MaxForceMean)/MaxForcestd
meanF_TD_scaled = (meanF_TD - MeanForceMean)/MeanForcestd
You can see, to scale my dataset B, I have used mean value for eg.thicknessMean and standard deviation for eg. thicknessstd value of the Dataset A .
My question is:
(1) where I am doing wrong? What should I do to make my prediction near to accurate?
(2) When I check prediction's accuracy on two different dataset, should I standardize the second dataset at a same scaling as in the first dataset?

Multi-step time series forecast using Holt-Winters algorithm in python

This is regarding a time-series forecast problem, with a dataset which has almost no seasonality, with a trend that follows the input data. The data is stationary (p-value is less than 5%)
Trying to convert the single-step forecast into a multi-step forecast, by feeding back the predictions as inputs to the Holt-Winters algorithm to achieve predictions for multiple days.
PFB a small snippet of the logic.
from statsmodels.tsa.holtwinters import ExponentialSmoothing
data = pd.read_csv('test_data.csv')
#After time series decomposition and stationarity check using the AD Fuller's test
model = ExponentialSmoothing(data).fit()
number_of_days = 5
for i in range(0,number_of_days):
yhat = model.predict(len(data), len(data))
data = pd.DataFrame(data)
data = data.append(pd.DataFrame(yhat),ignore_index=True)
data_length = data.size
The forecast (output) for all the days is the same value.
Can anyone please help me understand how to tune the algorithm (and / or the logic above) for a better forecast?

Incremental OneHotEncoding and Target Encoding

I am working with a large tabular dataset that consists of many categorical columns. I want to train a regression model (XGBoost) in this data while using as many regressors as possible.
Because of the size of data, I am using incremental training - where following sklearn API - .fit(X, y) I am not able to fit the entire matrix X into memory and therefore I am training the model in a couple of rows at the time. The problem is that in every batch, the model is expecting the same number of columns in X.
This is where it gets tricky because some variables are categorical it may be that one-hot encoding on a batch of data will same some shape (e.g. 20 columns). However, the next batch will have (26 columns) simply because in the previous batch not every unique level of the categorical feature was present. Sklearn allows for accounting for this and costume function can also be used: To keep some number of columns in matrix X.
import seaborn as sns
import numpy as np
from sklearn.preprocessing import OneHotEncoder
def one_hot_known(dataf, list_levels, col):
"""Creates a dummy coded matrix with as many columns as unique levels"""
return np.array(
[np.eye(len(list_levels))[list_levels.index(i)] for i in dataf[col]])
# Load Some Dataset with categorical variable
df_orig = sns.load_dataset('tips')
# List of unique levels - known apriori
day_level = list(df_orig['day'].unique())
# Image, we have a batch of data (subset of original data) and one categorical level (DAY) is not present here
df = df_orig.loc[lambda d: d['day'] != 'Sun']
# Missing category is filled with 0 and in next batch, if present its columns will have 1.
OneHotEncoder(categories = [day_level], sparse=False).fit_transform(np.array(df['day']).reshape(-1, 1))
#Costum function, can be used in incremental(data batches chunk fashion)
one_hot_known(df, day_level, 'day')
What I would like to do not is to utilize the TargerEncoding approach, so that we do not have matrix X with a huge number of columns. However, it still needs to be done in an Incremental fashion, just like the OneHot Encoding above.
I am writing this as a post because I know this is very useful to many people and would like to know how to utilize the same strategy for TargetEncoding.
I am aware that Deep Learning allows for Embedding layers, which represent categorical features in continuous space but I would like to apply TargetEncoding.

SkLearn: Feature Union with a dictionary and text data

I have a DataFrame like:
text_data worker_dicts outcomes
0 "Some string" {"Sector":"Finance", 0
"State: NJ"}
1 "Another string" {"Sector":"Programming", 1
"State: NY"}
It has both text information, and a column that is a dictionary. (The real worker_dicts has many more fields). I'm interested in the binary outcome column.
What I initially tried doing was to combine both text_data and worker_dict, crudely concatenating both columns, and then running Multinomial NB on that:
df['stacked_features']=df['text_data'].astype(str)+'_'+df['worker_dicts']
stacked_features = np.array(df['stacked_features'])
outcomes = np.array(df['outcomes'])
text_clf = Pipeline([('vect', TfidfVectorizer(stop_words='english'), ngram_range = (1,3)),
('clf', MultinomialNB())])
text_clf = text_clf.fit(stacked_features, outcomes)
But I got very bad accuracy, and I think that fitting two independent models would be a better use of data than fitting one model on both types of features (as I am doing with stacking).
How would I go about utilizing Feature Union? worker_dicts is a little weird because it's a dictionary, so I'm very confused as to how I'd go about parsing that.
If your dictionary entries are categorical as they appear to be in your example, then I would create different columns from the dictionary entries before doing additional processing.
new_features = pd.DataFrame(df['worker_dicts'].values.tolist())
Then new_features will be its own dataframe with columns Sector and State and you can one hot encode those as needed in addition to TFIDF or other feature extraction for your text_data column. In order to use that in a pipeline, you would need to create a new transformer class, so I might suggest just applying the dictionary parsing and the TFIDF separately, then stacking the results, and adding OneHotEncoding to your pipeline as that allows you to specify columns to apply the transformer to. (As the categories you want to encode are strings you may want to use LabelBinarizer class instead of OneHotEncoder class for the encoding transformation.)
If you want to just use TFIDF on all of the columns individually with a pipeline, you would need to use a nested Pipeline and FeatureUnion set up to extract columns as described here.
If you have your one hot encoded features in dataframes X1 and X2 as described below and your text features in X3, you could do something like the following to create a pipeline. (There are many other options, this is just one way)
X = pd.concat([X1, X2, X3], axis=1)
def select_text_data(X):
return X['text_data']
def select_remaining_data(X):
return X.drop('text_data', axis=1)
# pipeline to get all tfidf and word count for first column
text_pipeline = Pipeline([
('column_selection', FunctionTransformer(select_text_data, validate=False)),
('tfidf', TfidfVectorizer())
])
final_pipeline = Pipeline([('feature-union', FeatureUnion([('text-features', text_pipeline),
('other-features', FunctionTransformer(select_remaining_data))
])),
('clf', LogisticRegression())
])
(MultinomialNB won't work in the pipeline because it doesn't have fit and fit_transform methods)

Dealing with no data values

During learning, none of my features has '0' values; so I have successfully made my SVM model.
However, when I use that model for prediction with my features, have '0' values in some location of samples. The '0' are no data values. How can I deal with no data values during prediction. I could impute during learning. But if I remove no data value during prediction I will have missing prediction results in those sample locations.
In those sample points, not all features are void but some are.
any suggestions are appreciated.
If some data values are NaN, then you need an imputer to impute those missing values. General strategy is to use 'mean' or 'median' strategy for replacement.
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy='mean')
X_data = imputer.fit_transform(X_data_with_missing_values)
You can then to fit SVM using this imputed X_data.

Resources