Pickle for datapreprocessing - python-3.x

I was going through various tutorials and articles on using pickle on the ml model so that that can be used later.
But I am not able to get something pickle or something similar for data pre- processing. I am doing the preprocessing:
Changing the datatype of few columns/features.
Feature engineering.
Hot Encoding/Dummy variables
Scaling the data using below code
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Now, I want to do this for every dataset which I pass for predictions.
Is there any way to do something like pickle to load the data preprocessing steps before I was this to loaded ML model from pickle.
Please guide

I created a function and saved it a independent file. Then called that function whenever required.
Below is the code on how I am calling the data pre process function
from DataPreparationv3 import Data_Preprocess
Base_Data = pd.read_csv('Validate.csv')
DataReady = Data_Preprocess(Base_Data)
This solved my problem.
Regards
Sudhir

Related

Dataiku - Saving Models in DSS Python Receipes

How do I save a model in Dataiku?
This is the tutorial that I am using: https://doc.dataiku.com/dss/latest/python-api/model-evaluation-stores.html
Example Code:
from sklearn import linear_model
reg = linear_model.LinearRegression()
m = dataiku.Model(reg)
> TypeError: argument of type 'LinearRegression' is not iterable
Not sure what you're trying to achieve, but dataiku.Model(...) expects the param to be a str that corresponds to the id of the model according to the doc.
You might either want to:
Turn your linear regression into a mlflow model and import it in Dataiku DSS.
Use the auto ML pipeline to train your linear regression in Dataiku
In both cases, the corresponding doc can be found here: link

What are common sources of randomness in Machine Learning projects with Keras?

Reproducibility is important. In a closed-source machine learning project I'm currently working on it is hard to achieve it. What are the parts to look at?
Setting seeds
Computers have pseudo-random number generators which are initialized with a value called the seed. For machine learning, you might need to do the following:
# I've heard the order here is important
import random
random.seed(0)
import numpy as np
np.random.seed(0)
import tensorflow as tf
tf.set_random_seed(0)
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
from keras import backend as K
K.set_session(sess) # tell keras about the seeded session
# now import keras stuff
See also: Keras FAQ: How can I obtain reproducible results using Keras during development?
sklearn
sklearn.model_selection.train_test_split has a random_state parameter.
What to check
Am I loading the data in the same order every time?
Do I initialize the model the same way?
Do you use external data that might change?
Do you use external state that might change (e.g. datetime.now)?

Loading dataset in UINT8 format- python

I was actually looking through the "load_data()" function in python that returns X_train, X_test, Y_train and Y_test as in this link. As you see it is for CIFAR10 and CIFAR100 dataset, that returns the above mentioned values as uint8 array.
I wanted to know is there some other function like this for loading datasets in our system locally ?
If so please help me with its usage and if not please suggest me some other alternative.
Thanks in advance.
load_data() is not a part of python but rather is defined in keras.datasets.cifar10 module. To load cifar dataset (or any other dataset), there might be many methods depending upon how the dataset in packaged/formatted. Usually, the module pandas can be used for loading/saving/manipulating table-like data.
For cifar data, here is another example: loading an image from cifar-10 dataset
Here the author is using the pickle module to unpack the dataset and then PIL and numpy modules to load and manipulate indivdual images.

Distribution of time series data in TensorFlow

I currently have a kdb+ database with ~1mil rows of financial tick data. Using Python3, TensorFlow, and numpy, what is the best way to break up time-series financial data into train/dev/test sets?
This paper suggests the use of k-fold cross-validation, which partitions the data into complimentary subsets. But it's from Spring-2014 and after reading it I'm still unclear on how to implement it in practice. Is this the best solution or is something like hold-out validation more appropriate for financial data?
I'm also interested in learning best practices for importing locally stored time-series data into my TensorFlow model.
Thank you.
One can use qPython to load the data to the Python process and then KFold from sklearn to repeatedly split the data set into training and test part.
Suppose we have the following table defined on the KDB+ side:
t:([] time:.z.t+til 30;ask:100.0+30?1.0;bid:98+30?1.0)
Then on the Python side you can do the following to produce indices of the train/test splits:
from qpython import qconnection
import pandas as pd
from sklearn.model_selection import KFold
with qconnection.QConnection(host = 'localhost', port = 5001, pandas = True) as q:
X = q.sync('t')
kf = KFold(n_splits=4)
for train_index, test_index in kf.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
See the KFold documentation for other variants of KFold.

name 'classification_model' is not defined

Im trying to model in Python 3.5 and am following an example that can be found at here.
I have imported all the required libraries from sklearn.
However I'm getting the following error.
Code:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics
outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History']
classification_model(model, loan,predictor_var,outcome_var)
When I run the above code I get the following error:
NameError: name 'classification_model' is not defined
I'm not sure how to resolve this as I tried importing sklearn and all the sub libraries.
P.S. I'm new to Python, hence I'm trying to figure out basic steps
Depending on the exact details this may not be what you want but I have never had a problem with
import sklearn.linear_model as sk
logreg = sk.LogisticRegressionCV()
logreg.fit(predictor_var,outcome_var)
This means you have to explicitly separate your training and test set, but having fit to a training set (the process in the final line of my code), you can then use the methods detailed in the documentation [1].
For example figuring out what scores (how many did I get correct) you get on unseen data with the .score method
[1] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html
It appears this code came from this tutorial.
The issue is exactly as the error describes. classification_model is currently undefined. You need to create this function yourself before you can call it. Check out this part of that tutorial so you can see how it's defined. Good luck!
from sklearn.metrics import classification_report

Resources