Distribution of time series data in TensorFlow - python-3.x

I currently have a kdb+ database with ~1mil rows of financial tick data. Using Python3, TensorFlow, and numpy, what is the best way to break up time-series financial data into train/dev/test sets?
This paper suggests the use of k-fold cross-validation, which partitions the data into complimentary subsets. But it's from Spring-2014 and after reading it I'm still unclear on how to implement it in practice. Is this the best solution or is something like hold-out validation more appropriate for financial data?
I'm also interested in learning best practices for importing locally stored time-series data into my TensorFlow model.
Thank you.

One can use qPython to load the data to the Python process and then KFold from sklearn to repeatedly split the data set into training and test part.
Suppose we have the following table defined on the KDB+ side:
t:([] time:.z.t+til 30;ask:100.0+30?1.0;bid:98+30?1.0)
Then on the Python side you can do the following to produce indices of the train/test splits:
from qpython import qconnection
import pandas as pd
from sklearn.model_selection import KFold
with qconnection.QConnection(host = 'localhost', port = 5001, pandas = True) as q:
X = q.sync('t')
kf = KFold(n_splits=4)
for train_index, test_index in kf.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
See the KFold documentation for other variants of KFold.

Related

is it possible to set the splitting strategy for GridSearchCv?

I'm optimizing model's hyperparameters by GridSearchCv. And because the data I'm working with is very imbalanced, I need to "choose" the manner that the algortihm splits the train/test sets in order to ensure that the underrepresented points are in both sets.
By reading scikit-learn's documentation, I have the idea that it's possible to set the splitting strategy for GridSearch but I'm not sure how or if this is the case.
I would be very grateful if someone could help me with this.
Yes, pass in the GridSearchCV as cv a StratifiedKFold object.
from sklearn.model_selection import StratifiedKFold
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
skf = StratifiedKFold(n_splits=5)
clf = GridSearchCV(svc, parameters, cv = skf)
clf.fit(iris.data, iris.target)
By default, if you are training a classification model with GridSearchCV, the default method for splitting the dataset is StratifiedKFold, that takes care of balancing the dataset according to the target variable.
If your dataset is imbalanced for some other reason (not the target variable), you can choose another criteria to perform the split. Carefully read the documentation of GridSearchCV, and select an appropriate CV splitter.
In the scikit-learn documentation of model selection, there are many Splitter Classes that you could use. Or you can define your own splitter class according to your criteria, but it would be more difficult.

What steps should I take next to improve my accuracy? Can data be the problem?

I built various ML models using sklearn for a binary classification problem. The data-set is provided to me by my professor for this comparative study.
my jupyter notebook and dataset can be found here
As I am getting very low accuracy, I fear that I must be doing something wrong while building the model. So I tested my decision tree on the inbuilt data-set in sklearn (breast cancer data-set) which is very similar to my data-set as both are binary classifications. Here I get an mean accuracy of 95 %. So I think right now that the problem might be my data-set. Can I get some help on how do I pre-process my data or any other steps that I might look into to improve accuracy.
Encode labels
Categorical data are variables that contain label values rather than numeric values.The number of possible values is often limited to a fixed set.
For example, users are typically described by country, gender, age group etc. We will use Label Encoder to label the categorical data. Label Encoder is the part of SciKit Learn library in Python and used to convert categorical data, or text data, into numbers, which our predictive models can better understand.
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
Feature scaling
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations. We need to bring all features to the same level of magnitudes. This can be achieved by scaling. This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1. We will use StandardScaler method from SciKit-Learn library.
#Feature Scalingfrom sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Choosing Right model
You kight also want to vhoose the appropriate model. You can't just use neural nets or so for all problems it's the no free luch theorem. For this you could use K-fold cross validation, AIC and BIC

Pickle for datapreprocessing

I was going through various tutorials and articles on using pickle on the ml model so that that can be used later.
But I am not able to get something pickle or something similar for data pre- processing. I am doing the preprocessing:
Changing the datatype of few columns/features.
Feature engineering.
Hot Encoding/Dummy variables
Scaling the data using below code
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Now, I want to do this for every dataset which I pass for predictions.
Is there any way to do something like pickle to load the data preprocessing steps before I was this to loaded ML model from pickle.
Please guide
I created a function and saved it a independent file. Then called that function whenever required.
Below is the code on how I am calling the data pre process function
from DataPreparationv3 import Data_Preprocess
Base_Data = pd.read_csv('Validate.csv')
DataReady = Data_Preprocess(Base_Data)
This solved my problem.
Regards
Sudhir

Scikit-Learn: Avoiding Data Leakage During Cross-Validation

I've just been reading up on k-fold cross-validation and have realized that I'm inadvertently leaking data with my current preprocessing setup.
Usually, I have a train and test dataset. I do a bunch of data imputation and one-hot encoding on my entire train dataset and then run k-fold cross-validation.
The leakage comes in because, if I'm doing 5-fold cross-validation, I'm training on 80% of my train data and testing it on the remaining 20% of the train data.
I really should just be imputing the 20% based on the 80% of train (whereas I was using 100% of the data before).
1) Is this the right way to think about cross-validation?
2) I've been looking at the Pipeline class in sklearn.pipeline and it seems useful for doing a bunch of transformations and then finally fitting a model to the resulting data. However, I'm doing a bunch of stuff like "impute missing data in float64 columns with the mean", "impute all other data with the mode", etc.
There isn't an obvious transformer for this kind of imputation. How would I go about adding this step to a Pipeline? Would I just make my own subclass of BaseEstimator?
Any guidance here would be great!
1) Yes, you should impute the 20% test data using the 80% training data.
2) I wrote a blog post that answers your second question, but I'll include the core parts here.
With sklearn.pipeline, you can apply separate preprocessing rules to different feature types (e.g., numeric, categorical). In the example code below, I impute the median of numeric features before scaling them. The categorical and boolean features are imputed with the mode -- the categorical features are one-hot encoded.
You can include an estimator at the end of the pipeline for regression, classification, etc.
import numpy as np
from sklearn.pipeline import make_pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, Imputer, StandardScaler
preprocess_pipeline = make_pipeline(
FeatureUnion(transformer_list=[
("numeric_features", make_pipeline(
TypeSelector(np.number),
Imputer(strategy="median"),
StandardScaler()
)),
("categorical_features", make_pipeline(
TypeSelector("category"),
Imputer(strategy="most_frequent"),
OneHotEncoder()
)),
("boolean_features", make_pipeline(
TypeSelector("bool"),
Imputer(strategy="most_frequent")
))
])
)
The TypeSelector portion of the pipeline assumes the object X is a pandas DataFrame. The subset of columns with the given data type are selected with TypeSelector.transform.
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
class TypeSelector(BaseEstimator, TransformerMixin):
def __init__(self, dtype):
self.dtype = dtype
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.select_dtypes(include=[self.dtype])
I recommend thinking of 5-fold cross validation as simply splitting up the data into 5 parts (or folds). You hold out one fold for testing and using the other 4 together for your training set. We repeat this process another 4 times until each fold has had the chance to be tested.
For your imputation to work correctly and not be subject to contamination, you would need to determine the mean from the 4 folds used for testing, and use it to impute that value in both the training set and test set.
I like to implement the CV split with StratifiedKFold. This will ensure you have the same number of samples for each class in the folds.
To answer your question about using Pipelines, I would say you should probably subclass the BaseEstimator with your custom Imputation transformer. Inside of your loop for the CV-split, you should compute the mean from your training set then set this mean as a parameter in your transformer. Then you can call fit or transform.

ValueError: could not convert string to float in panda

My code is :
import pandas as pd
data = pd.read_table('train.tsv')
X=data.Phrase
Y=data.Sentiment
from sklearn import cross_validation
X_train,X_test,Y_train,Y_test=cross_validation.train_test_split(X,Y,test_size=0.2,random_state=0)
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X,Y)
I get the error :ValueError: could not convert string to float:
What changes can I make that my code works?
You can't pass in text data into MultinomialNB of scikit-learn as stated in its documentation.
None of the algorithms in scikit-learn works directly with text data. You need to do some preprocessing to get desired output. You'll need to first extract the features from text data using techniques like bagging or tokenizing. Have a look at this link for better understanding.
You also might want to look at using NLTK for such use cases as yours.
ValueError when using Multinomial Naive Bayes classifier
You probably should preprocess your data as shown in the answer above.

Resources